Re: [HACKERS] Measuring replay lag

2017-02-16 Thread Abhijit Menon-Sen
Hi Thomas.

At 2017-02-15 00:48:41 +1300, thomas.mu...@enterprisedb.com wrote:
>
> Here is a new version with the buffer on the sender side as requested.

This looks good.

> + write_lag
> + interval
> + Estimated time taken for recent WAL records to be written on this
> +  standby server

I think I would find a slightly more detailed explanation helpful here.

A few tiny nits:

> +  * If the lsn hasn't advanced since last time, then do nothing.  This 
> way
> +  * we only record a new sample when new WAL has been written, which is
> +  * simple proxy for the time at which the log was written.

"which is simple" → "which is a simple"

> +  * If the buffer if full, for now we just rewind by one slot and 
> overwrite
> +  * the last sample, as a simple if somewhat uneven way to lower the
> +  * sampling rate.  There may be better adaptive compaction algorithms.

"buffer if" → "buffer is"

> + * Find out how much time has elapsed since WAL position 'lsn' or earlier was
> + * written to the lag tracking buffer and 'now'.  Return -1 if no time is
> + * available, and otherwise the elapsed time in microseconds.

Find out how much time has elapsed "between X and 'now'", or "since X".
(I prefer the former, i.e., s/since/between/.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Update copyright for 2017

2017-01-03 Thread Abhijit Menon-Sen
At 2017-01-03 18:46:32 +0100, mag...@hagander.net wrote:
>
> Thoughts?

I think we should stop making wholesale changes to copyright notices
every year.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackups and slots

2016-12-15 Thread Abhijit Menon-Sen
At 2016-12-16 07:32:24 +0100, mag...@hagander.net wrote:
>
> I really think the default should be "what most people want", and not
> "whatever is compatible with a mode that was necessary because we
> lacked infrastructure".

Very much agreed.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-11-04 Thread Abhijit Menon-Sen
At 2016-11-04 10:35:05 +, si...@2ndquadrant.com wrote:
>
> Since the "lots of parameters" approach is almost exactly what we have
> now, I think we should do that first, get this patch committed and
> then sit back and discuss an improved API and see what downsides it
> introduces.

Thanks, I agree strongly with this.

Someone (Michael?) earlier mentioned the potential for introducing bugs
with this patch (the idea of merging recovery.conf into postgresql.conf
at all, not this particular incarnation).

I think the current proposed approach with

recovery_target=xid
recovery_target_xid=xxx

is preferable because (a) it doesn't introduce much new code, e.g., new
parsing functions, nor (b) need many changes in the documentation—all we
need is to say that of the various recovery_target_xxx parameters, the
one that has priority is the one named by recovery_target.

If I were to introduce recovery_target='xid xxx', I would at a minimum
need to factor out (or duplicate) parsing and error handling code, make
a type/value union-in-struct to store in the GUC *extra, then make sure
that we handle the older values in a backwards-compatible way, and move
a bunch of documentation around.

Can it be done? Yes, of course; and I'll do it if that's the consensus.
But it would be a backwards-compatible change to the current approach
anyway, and I think it would be better to put in the simple way now
before discussing the new API further.

Let's get the basic new *functionality* out so that people can play with
it, I'm sure we'll find a few non-API things that need tweaking as a
result of the change.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-10-31 Thread Abhijit Menon-Sen
At 2016-09-28 13:13:56 -0400, robertmh...@gmail.com wrote:
>
> I hope that the fact that there's been no discussion for the last
> three weeks doesn't mean this effort is dead; I would like very
> much to see it move forward.

Here's an updated patch. Sorry, I got busy elswhere.

I struggled with the handling of recovery_target a little. For example,
one suggested alternative was:

recovery_target_type = xid
recovery_target_value = …

The problem with implementing it this way is that the _value setting
cannot be parsed without already having parsed the _type, and I didn't
want to force that sort of dependency.

What I've done instead is to make this work:

recovery_target = xid|time|name|lsn|immediate
recovery_target_xid = …
recovery_target_time = …
recovery_target_name = …
recovery_target_lsn = …

The recovery_target_xxx values are parsed as they used to be, but the
one that's used is the one that's set in recovery_target. That's easy to
explain, and the patch is much less intrusive, but I'm certainly open to
suggestions to improve this, and I have the time to work on this patch
with a view towards getting it committed in this cycle.

-- Abhijit

P.S. Sorry, I haven't been able to resolve the conflicts between Simon's
earlier recovery_startup_r10_api.v1b patch and the "pg_ctl promote -w"
changes in master. I was distracted by some illness in the family, but
I will post another update very soon.
>From 231ccefbc99c07f00560f4e88a961db186db704e Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Mon, 31 Oct 2016 11:32:51 +0530
Subject: Convert recovery.conf settings to GUCs

Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao.
---
 contrib/pg_standby/pg_standby.c |   2 +-
 doc/src/sgml/backup.sgml|  31 +-
 doc/src/sgml/config.sgml| 480 ++
 doc/src/sgml/filelist.sgml  |   1 -
 doc/src/sgml/func.sgml  |   2 +-
 doc/src/sgml/high-availability.sgml |  42 +-
 doc/src/sgml/pgstandby.sgml |   4 +-
 doc/src/sgml/postgres.sgml  |   1 -
 doc/src/sgml/recovery-config.sgml   | 508 ---
 doc/src/sgml/ref/pgarchivecleanup.sgml  |   6 +-
 doc/src/sgml/release-9.1.sgml   |   5 +-
 doc/src/sgml/release.sgml   |   2 +-
 src/backend/access/transam/recovery.conf.sample |  29 +-
 src/backend/access/transam/xlog.c   | 520 +++-
 src/backend/access/transam/xlogarchive.c|   8 +-
 src/backend/postmaster/startup.c|   2 +
 src/backend/replication/walreceiver.c   |  24 ++
 src/backend/utils/misc/guc-file.l   |  18 +
 src/backend/utils/misc/guc.c| 430 
 src/backend/utils/misc/postgresql.conf.sample   |  24 ++
 src/bin/pg_archivecleanup/pg_archivecleanup.c   |   2 +-
 src/include/access/xlog.h   |  21 +
 src/include/access/xlog_internal.h  |   2 +
 src/include/utils/guc_tables.h  |   2 +
 24 files changed, 1202 insertions(+), 964 deletions(-)
 delete mode 100644 doc/src/sgml/recovery-config.sgml

diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index e4136f9..8fcb85c 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -520,7 +520,7 @@ usage(void)
 	printf("  -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n");
 	printf("  -?, --help show this help, then exit\n");
 	printf("\n"
-		   "Main intended use as restore_command in recovery.conf:\n"
+		   "Main intended use as restore_command in postgresql.conf:\n"
 		   "  restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n"
 		   "e.g.\n"
 	"  restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n");
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..2df0dc6 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1190,8 +1190,15 @@ SELECT pg_stop_backup();


 
- Create a recovery command file recovery.conf in the cluster
- data directory (see ). You might
+ Set up recovery parameters in postgresql.conf (see
+  and
+ ).
+
+   
+   
+
+ Create a recovery trigger file recovery.trigger
+ in the cluster data directory. You might
  also want to temporarily modify pg_hba.conf to prevent
  ordinary users from connecting until you are sure the recovery was successful.
 
@@ -1203,7 +1210,7 @@ SELECT pg_stop_backup();
  recovery be terminated because of an external error, the server can
  simply be restarted and it will continue recovery.  Upon completion
  of the r

Re: [HACKERS] Proposal for changes to recovery.conf API

2016-09-06 Thread Abhijit Menon-Sen
At 2016-09-06 10:56:53 +0200, p...@2ndquadrant.com wrote:
>
> So +1 on the recovery_target = 'xid:xxx' idea.

OK, I'll make it work that way. New patch in a couple of days.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-09-06 Thread Abhijit Menon-Sen
At 2016-09-06 14:40:54 +0900, michael.paqu...@gmail.com wrote:
>
> my best advice here is to make all those recovery_target_* parameters
> PGC_POSTMASTER so as they are loaded only once when the server starts,
> and then we define the recovery target type used in the startup
> process instead of trying to do so at GUC level.

I understand your approach in light of the GUC code, but I see things a
bit differently—the complexity comes largely from the specific handling
of recovery_target. I'll try to come up with a way to do it better. If
not, we have your suggestion to fall back on.

> +else if (recovery_target_lsn != 0)
> +recovery_target = RECOVERY_TARGET_LSN;
> This needs to check for InvalidXLogRecPtr.

Of course, thanks. Fixed locally in all the relevant places. Will repost
whenever there are other substantive changes.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-09-05 Thread Abhijit Menon-Sen
Hi.

Here's an updated version of my patch, which now applies on top of the
patch that Simon posted earlier (recovery_startup_r10_api.v1b.patch).

A few notes:

1. I merged in recovery_target_lsn as a new GUC setting.
2. I fixed various minor nits in the earlier patch, not worth mentioning
   individually.
3. I haven't added back trigger_file, because Simon's patch removes it.
   I can add it back in separately after discussion (otherwise Simon's
   and my patches will conflict).
4. I've tested this to the extent that setting things in postgresql.conf
   works, and recovery.conf is still read if it exists, and so on.

One open issue is the various assign_recovery_target_xxx functions,
which Michael noted in his earlier review:

+static void
+assign_recovery_target_xid(const char *newval, void *extra)
+{
+   recovery_target_xid = *((TransactionId *) extra);
+
+   if (recovery_target_xid != InvalidTransactionId)
+   recovery_target = RECOVERY_TARGET_XID;
+   else if (recovery_target_name && *recovery_target_name)
+   recovery_target = RECOVERY_TARGET_NAME;
+   else if (recovery_target_time != 0)
+   recovery_target = RECOVERY_TARGET_TIME;
+   else if (recovery_target_lsn != 0)
+   recovery_target = RECOVERY_TARGET_LSN;
+   else
+   recovery_target = RECOVERY_TARGET_UNSET;
+}

(Note how recovery_target_lsn caused this—and three other functions
besides—to grow an extra branch.)

I don't like this code, but I'm not yet sure what to replace it with. I
think we should address the underlying problem—that the UI doesn't map
cleanly to what the code wants. There's been some discussion about this
earlier, but not any consensus that I could see.

Do we want something like this (easy to implement and document, perhaps
not especially convenient to use):

recovery_target = 'xid' # or 'time'/'name'/'lsn'/'immediate'
recovery_target_xid = xxx?  # the only setting we care about now
recovery_target_otherthings = parsed_but_ignored

Or something like this (a bit harder to implement):

recovery_target = 'xid:xxx' # or 'time:xxx' etc.

Alternatively, the do-nothing option is to move the tests from guc.c to
StartupXLOG and do it only once in some defined order (which would also
break the current last-mention-wins behaviour).

Thoughts? (I've added Fujii to the Cc: list, in case he has any
comments, since this is based on his earlier patch.)

-- Abhijit
commit c20d735648f5ea867fda5afc499a63d3877536a3
Author: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date:   Thu Sep 1 09:01:04 2016 +0530

Convert recovery.conf settings to GUCs

Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao.

diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index e4136f9..8fcb85c 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -520,7 +520,7 @@ usage(void)
 	printf("  -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n");
 	printf("  -?, --help show this help, then exit\n");
 	printf("\n"
-		   "Main intended use as restore_command in recovery.conf:\n"
+		   "Main intended use as restore_command in postgresql.conf:\n"
 		   "  restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n"
 		   "e.g.\n"
 	"  restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n");
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..f43e41e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1174,8 +1174,15 @@ SELECT pg_stop_backup();


 
- Create a recovery command file recovery.conf in the cluster
- data directory (see ). You might
+ Set up recovery parameters in postgresql.conf (see
+  and
+ ).
+
+   
+   
+
+ Create a recovery trigger file recovery.trigger
+ in the cluster data directory. You might
  also want to temporarily modify pg_hba.conf to prevent
  ordinary users from connecting until you are sure the recovery was successful.
 
@@ -1187,7 +1194,7 @@ SELECT pg_stop_backup();
  recovery be terminated because of an external error, the server can
  simply be restarted and it will continue recovery.  Upon completion
  of the recovery process, the server will rename
- recovery.conf to recovery.done (to prevent
+ recovery.trigger to recovery.done (to prevent
  accidentally re-entering recovery mode later) and then
  commence normal database operations.
 
@@ -1203,12 +1210,11 @@ SELECT pg_stop_backup();

 

-The key part of all this is to set up a recovery configuration file that
-describes how you want to recover and how far the recovery should
-run.  You can use recovery.conf.sample (normally
-located in the installation's share/ directory) as a
-prototype

Re: [HACKERS] LSN as a recovery target

2016-09-04 Thread Abhijit Menon-Sen
At 2016-09-04 07:02:01 +0100, si...@2ndquadrant.com wrote:
>
> > By the way, what has been committed does not include the patch
> > adding the parsing context in case of an error as wanted upthread.
> > Perhaps that's not worth adding now as there is the GUC refactoring
> > potentially happening for the recovery parameters, so I don't mind
> > much. Just that's worth mentioning.
> 
> Hmm, that was unintentional. If something stalls the recovery
> parameter project, please remind me to commit that as well.

I'm aware of this, and will adjust accordingly in the GUC patch. Thanks
for the heads up.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-09-01 Thread Abhijit Menon-Sen
At 2016-09-01 15:51:11 +0900, michael.paqu...@gmail.com wrote:
>
> -   (errmsg("starting point-in-time recovery to XID %u",
> -   recoveryTargetXid)));
> User loses information if those logs are removed.

Agreed. I'm revising the patch right now, and I decided to leave them.
I'll consider and comment on the remainder of your points after I've
posted an update. Thanks for reading.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-08-31 Thread Abhijit Menon-Sen
At 2016-08-31 17:15:59 +0100, si...@2ndquadrant.com wrote:
>
> * Recovery parameters would now be part of the main postgresql.conf
> infrastructure
> Any parameters set in $DATADIR/recovery.conf will be read after the
> main parameter file, similar to the way that postgresql.conf.auto is
> read.
> (Abhijit)
> 
> * Parameters
> All of the parameters formerly set in recovery.conf can be set in
> postgresql.conf using RELOAD
> These parameters will have no defaults in postgresql.conf.sample
> Setting them has no effect during normal running, or once recovery ends.
>  https://www.postgresql.org/docs/devel/static/archive-recovery-settings.html
>  https://www.postgresql.org/docs/devel/static/recovery-target-settings.html
>  https://www.postgresql.org/docs/devel/static/standby-settings.html
> (Abhijit)

I've attached a WIP patch for the above (recovery_guc_v20160831.patch).
This was based on the unite_recoveryconf_postgresqlconf_v3.patch posted
by Fujii Masao.

Unfortunately, some parts conflict with the patch that Simon just posted
(e.g., his patch removes trigger_file altogether, whereas mine converts
it into a GUC along the lines of the original patch). Rather than trying
to untangle that right now, I'm posting what I have as-is, and I'll post
an updated version tomorrow.

-- Abhijit
commit 999c0c2632f0d4c20d20b9ac30cd258305f74bac
Author: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date:   Wed Aug 31 22:18:07 2016 +0530

Convert recovery.conf settings into GUCs

Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao.

diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index e4136f9..8fcb85c 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -520,7 +520,7 @@ usage(void)
 	printf("  -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n");
 	printf("  -?, --help show this help, then exit\n");
 	printf("\n"
-		   "Main intended use as restore_command in recovery.conf:\n"
+		   "Main intended use as restore_command in postgresql.conf:\n"
 		   "  restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n"
 		   "e.g.\n"
 	"  restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n");
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..f43e41e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1174,8 +1174,15 @@ SELECT pg_stop_backup();


 
- Create a recovery command file recovery.conf in the cluster
- data directory (see ). You might
+ Set up recovery parameters in postgresql.conf (see
+  and
+ ).
+
+   
+   
+
+ Create a recovery trigger file recovery.trigger
+ in the cluster data directory. You might
  also want to temporarily modify pg_hba.conf to prevent
  ordinary users from connecting until you are sure the recovery was successful.
 
@@ -1187,7 +1194,7 @@ SELECT pg_stop_backup();
  recovery be terminated because of an external error, the server can
  simply be restarted and it will continue recovery.  Upon completion
  of the recovery process, the server will rename
- recovery.conf to recovery.done (to prevent
+ recovery.trigger to recovery.done (to prevent
  accidentally re-entering recovery mode later) and then
  commence normal database operations.
 
@@ -1203,12 +1210,11 @@ SELECT pg_stop_backup();

 

-The key part of all this is to set up a recovery configuration file that
-describes how you want to recover and how far the recovery should
-run.  You can use recovery.conf.sample (normally
-located in the installation's share/ directory) as a
-prototype.  The one thing that you absolutely must specify in
-recovery.conf is the restore_command,
+The key part of all this is to set up recovery parameters that
+specify how you want to recover and how far the recovery should
+run. The one thing that you absolutely must specify in
+postgresql.conf to recover from the backup is
+the restore_command,
 which tells PostgreSQL how to retrieve archived
 WAL file segments.  Like the archive_command, this is
 a shell command string.  It can contain %f, which is
@@ -1270,7 +1276,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'

 If you want to recover to some previous point in time (say, right before
 the junior DBA dropped your main transaction table), just specify the
-required stopping point in recovery.conf.  You can specify
+required stopping point in postgresql.conf.  You can specify
 the stop point, known as the recovery target, either by
 date/time, named restore point or by completion of a specific transaction
 ID.  As of this writing only the date/time and named restore point options
@@ -1367,8 +

Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-07-21 Thread Abhijit Menon-Sen
At 2016-07-21 12:46:29 -0400, robertmh...@gmail.com wrote:
>
> I join with others in thinking it's a reasonable contrib module.

I don't like the use of the term "empty" to describe an UPDATE or DELETE
without a WHERE clause.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 and fsync=off

2016-04-29 Thread Abhijit Menon-Sen
At 2016-04-28 13:44:23 -0700, and...@anarazel.de wrote:
>
> Abhijit had a patch implementing automatically running fsync whenever
> reenabled IIRC. Abhijit?

The patch I had written is attached, and it's not quite the same thing.
Here's how I originally described it in response to a question from
Robert:

«In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his
rationale as follows:

«What I am thinking of is that, currently, if you start the
server for initial loading with fsync=off, and then restart it,
you're open to data loss. So when the current config file
setting is changed from off to on, we should fsync the data
directory. Even if there was no crash restart.»

That's what I tried to implement.»

I remember there was some subsequent discussion about it being better to
issue fsync during a checkpoint when we see that its value has changed,
but if I did any work on it (which I have a vague memory of), I can't
find it now. Sorry.

Do you want a patch along those lines now, or is it too late?

-- Abhijit
>From 1768680b672bcb037446230323cabcf9960f7f9a Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Fri, 1 May 2015 17:59:51 +0530
Subject: Recursively fsync PGDATA on the next restart after fsync was disabled

Even if we didn't crash, we want to fsync PGDATA on startup if we know
the server ran with fsync=off at some point since the previous restart.
Otherwise, starting the server with fsync=off for initial data loading
and then restarting it opens you to data loss on a power failure after
the restart.
---
 src/backend/access/transam/xlog.c   | 9 +++--
 src/bin/pg_controldata/pg_controldata.c | 2 ++
 src/include/catalog/pg_control.h| 8 +++-
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 084174d..af12992 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4823,6 +4823,7 @@ BootStrapXLOG(void)
 	ControlFile->checkPoint = checkPoint.redo;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->unloggedLSN = 1;
+	ControlFile->fsync_disabled = false;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5893,10 +5894,12 @@ StartupXLOG(void)
 	 */
 
 	if (enableFsync &&
-		(ControlFile->state != DB_SHUTDOWNED &&
-		 ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY))
+		(ControlFile->fsync_disabled ||
+		 (ControlFile->state != DB_SHUTDOWNED &&
+		  ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)))
 	{
 		fsync_pgdata(data_directory);
+		ControlFile->fsync_disabled = false;
 	}
 
 	if (ControlFile->state == DB_SHUTDOWNED)
@@ -8272,6 +8275,8 @@ CreateCheckPoint(int flags)
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
 	ControlFile->minRecoveryPointTLI = 0;
+	if (!enableFsync)
+		ControlFile->fsync_disabled = true;
 
 	/*
 	 * Persist unloggedLSN value. It's reset on crash recovery, so this goes
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d8cfe5e..e99014f 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -290,6 +290,8 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile.backupEndPoint);
 	printf(_("End-of-backup record required:%s\n"),
 		   ControlFile.backupEndRequired ? _("yes") : _("no"));
+	printf(_("Fsync disabled at runtime:%s\n"),
+		   ControlFile.fsync_disabled ? _("yes") : _("no"));
 	printf(_("Current wal_level setting:%s\n"),
 		   wal_level_str(ControlFile.wal_level));
 	printf(_("Current wal_log_hints setting:%s\n"),
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 2e4c381..a71d73e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -21,7 +21,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	942
+#define PG_CONTROL_VERSION	943
 
 /*
  * Body of CheckPoint XLOG records.  This is declared here because we keep
@@ -182,6 +182,12 @@ typedef struct ControlFileData
 	bool		track_commit_timestamp;
 
 	/*
+	 * Indicates whether fsync was ever disabled since the last restart.
+	 * Tested and set at checkpoints, reset at startup.
+	 */
+	bool		fsync_disabled;
+
+	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
 	 * explicitly, since the pg_control version will surely look wrong to a
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 and fsync=off

2016-04-27 Thread Abhijit Menon-Sen
Here's a patch just to help things along.

-- Abhijit
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f3e3de0..353de2e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -182,7 +182,9 @@
 
 #wal_level = minimal			# minimal, replica, or logical
 	# (change requires restart)
-#fsync = on# turns forced synchronization on or off
+#fsync = on# provide crash safety by flushing disk writes
+	# (disabling this can lead to unrecoverable
+	# data corruption)
 #synchronous_commit = on		# synchronization level;
 	# off, local, remote_write, remote_apply, or on
 #wal_sync_method = fsync		# the default is the first option

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 and fsync=off

2016-04-27 Thread Abhijit Menon-Sen
At 2016-04-27 17:58:08 +0800, cr...@2ndquadrant.com wrote:
>
> #fsync = on # turns forced synchronization on or 
> off

I suggest:# provide crash safety by flushing 
disk writes
  # (Disabling this can lead to 
unrecoverable data
  # loss if the system crashes.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Breakage with VACUUM ANALYSE + partitions

2016-04-13 Thread Abhijit Menon-Sen
At 2016-04-12 09:00:57 -0400, robertmh...@gmail.com wrote:
>
> On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund  wrote:
> >
> > 3) Actually handle the case of the last open segment not being
> >RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> 
> #3 seems like it's probably about 15 years overdue, so let's do that
> anyway.

I don't have anything useful to contribute, I'm just trying to
understand this fix.

_mdfd_getseg() looks like this:

targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
for (nextsegno = 1; nextsegno <= targetseg; nextsegno++)
{
Assert(nextsegno == v->mdfd_segno + 1);

if (v->mdfd_chain == NULL)
{
/*
 * Normally we will create new segments only if 
authorized by the
 * caller (i.e., we are doing mdextend()). […]
 */
if (behavior == EXTENSION_CREATE || InRecovery)
{
if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
/* mdextend */
v->mdfd_chain = _mdfd_openseg(reln, forknum, 
+nextsegno, O_CREAT);
}
else
{
/* We won't create segment if not existent */
v->mdfd_chain = _mdfd_openseg(reln, forknum, 
nextsegno, 0);
}
if (v->mdfd_chain == NULL)
/* return NULL or ERROR */
}
v = v->mdfd_chain;
}

Do I understand correctly that the case of the "last open segment"
(i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
(i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
_mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
InRecovery?

And that "We won't create segment if not existent" should happen, but
doesn't only because the next segment file wasn't removed earlier, so
we have to add an extra check for that case?

In other words, is something like the following what's needed here, or
is there more to it?

-- Abhijit

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 578276d..58ea5a0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1783,6 +1783,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, 
BlockNumber blkno,
 
if (v->mdfd_chain == NULL)
{
+   bool thisSegNeedsExtension = _mdnblocks(reln, forknum, 
v) < RELSEG_SIZE;
+
/*
 * Normally we will create new segments only if 
authorized by the
 * caller (i.e., we are doing mdextend()).  But when 
doing WAL
@@ -1799,7 +1801,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, 
BlockNumber blkno,
 */
if (behavior == EXTENSION_CREATE || InRecovery)
{
-   if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
+   if (thisSegNeedsExtension)
{
char   *zerobuf = palloc0(BLCKSZ);
 
@@ -1810,7 +1812,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, 
BlockNumber blkno,
}
v->mdfd_chain = _mdfd_openseg(reln, forknum, 
+nextsegno, O_CREAT);
}
-   else
+   else if (!thisSegNeedsExtension)
{
/* We won't create segment if not existent */
v->mdfd_chain = _mdfd_openseg(reln, forknum, 
nextsegno, 0);


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pglogical questions and problems

2016-04-13 Thread Abhijit Menon-Sen
At 2016-04-13 03:46:21 -0700, j...@commandprompt.com wrote:
>
> ALTER DATABASE ADD NODE;
> ALTER SCHEMA SUBSCRIBE ALL;
> CREATE REPLICATION SET;
> 
> But I am unaware if that is possible within the constraints of the
> extensions API.

It is not possible.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incomplete startup packet errors

2016-04-13 Thread Abhijit Menon-Sen
At 2016-04-13 10:02:22 +0200, mag...@hagander.net wrote:
>
> I wonder if it would make sense to only log that error if *at least
> one byte* has been received and then it becomes empty.

Yes, it would be very nice to eliminate that logspam, as you say.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-05 Thread Abhijit Menon-Sen
At 2016-04-05 18:45:58 -0300, alvhe...@2ndquadrant.com wrote:
>
> I changed the regression test a bit more, so please recheck.

Looks good, thank you.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-05 Thread Abhijit Menon-Sen
OK, thanks for the clarification. Here's the earlier patch, but with
the relevant added docs and tests retained.

-- Abhijit
>From dfb6ded15246ec65cc911864bfcff285eef1c4d4 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Tue, 5 Apr 2016 11:55:09 +0530
Subject: =?UTF-8?q?Implement=20"ALTER=20=E2=80=A6=20DEPENDS=20ON=20EXTENSI?=
 =?UTF-8?q?ON=20=E2=80=A6"=20for=20functions=20and=20triggers?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A new dependency type, DEPENDENCY_AUTO_EXTENSION, indicates that an
object depends on an extension (but doesn't belong to it, so pg_dump
should continue to dump it) and should be dropped when the extension
itself is dropped. A new statement type, AlterObjectDependsStmt, is
used to declare such dependencies.

Includes documentation and tests.
---
 doc/src/sgml/catalogs.sgml | 13 ++
 doc/src/sgml/ref/alter_function.sgml   | 20 +
 doc/src/sgml/ref/alter_trigger.sgml| 16 +++
 src/backend/catalog/dependency.c   |  2 +
 src/backend/catalog/objectaddress.c| 25 +++
 src/backend/commands/alter.c   | 32 ++
 src/backend/nodes/copyfuncs.c  | 17 
 src/backend/nodes/equalfuncs.c | 15 +++
 src/backend/parser/gram.y  | 36 +++-
 src/backend/tcop/utility.c | 27 
 src/include/catalog/dependency.h   |  9 +++-
 src/include/catalog/objectaddress.h|  4 ++
 src/include/commands/alter.h   |  1 +
 src/include/nodes/nodes.h  |  1 +
 src/include/nodes/parsenodes.h | 15 +++
 src/include/parser/kwlist.h|  1 +
 src/test/modules/test_extensions/Makefile  |  2 +-
 .../test_extensions/expected/test_extdepend.out| 50 ++
 .../modules/test_extensions/sql/test_extdepend.sql | 32 ++
 src/tools/pgindent/typedefs.list   |  1 +
 20 files changed, 315 insertions(+), 4 deletions(-)
 create mode 100644 src/test/modules/test_extensions/expected/test_extdepend.out
 create mode 100644 src/test/modules/test_extensions/sql/test_extdepend.sql

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index bb75229..3c128a0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2888,6 +2888,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/doc/src/sgml/ref/alter_function.sgml b/doc/src/sgml/ref/alter_function.sgml
index f40363e..1ef905f 100644
--- a/doc/src/sgml/ref/alter_function.sgml
+++ b/doc/src/sgml/ref/alter_function.sgml
@@ -29,6 +29,8 @@ ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] )
 SET SCHEMA new_schema
+ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] )
+DEPENDS ON EXTENSION extension_name
 
 where action is one of:
 
@@ -148,6 +150,15 @@ ALTER FUNCTION name ( [ [ extension_name
+
+ 
+  The name of an extension that the function depends on.
+ 
+
+   
+
 
  CALLED ON NULL INPUT
  RETURNS NULL ON NULL INPUT
@@ -300,6 +311,15 @@ ALTER FUNCTION sqrt(integer) SET SCHEMA maths;
   
 
   
+   To mark the function sqrt for type
+   integer as being dependent on the extension
+   mathlib:
+
+ALTER FUNCTION sqrt(integer) DEPENDS ON EXTENSION mathlib;
+
+  
+
+  
To adjust the search path that is automatically set for a function:
 
 ALTER FUNCTION check_password(text) SET search_path = admin, pg_temp;
diff --git a/doc/src/sgml/ref/alter_trigger.sgml b/doc/src/sgml/ref/alter_trigger.sgml
index c63b5df..37c4d03 100644
--- a/doc/src/sgml/ref/alter_trigger.sgml
+++ b/doc/src/sgml/ref/alter_trigger.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  
 
 ALTER TRIGGER name ON table_name RENAME TO new_name
+ALTER TRIGGER name ON table_name DEPENDS ON EXTENSION extension_name
 
  
 
@@ -70,6 +71,15 @@ ALTER TRIGGER name ON 
 

+
+   
+extension_name
+
+ 
+  The name of an extension that the trigger depends on.
+ 
+
+   
   
  
 
@@ -93,6 +103,12 @@ ALTER TRIGGER name ON 
 ALTER TRIGGER emp_stamp ON emp RENAME TO emp_track_chgs;
 
+
+  
+   To mark a trigger as being dependent on an extension:
+
+ALTER TRIGGER emp_stamp ON emp DEPENDS ON EXTENSION emplib;
+
  
 
  
diff --git a/src/backend/catalog/dependency.c b/src/backend/c

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-05 Thread Abhijit Menon-Sen
At 2016-04-05 12:33:56 +0530, a...@2ndquadrant.com wrote:
>
> Álvaro: I did document and test the extra types you added, but now
> that I think about it a bit more, it's hard to argue that it's useful
> to have a table, for example, depend on an extension. There's really
> nothing about a table that "doesn't work without" an extension.

I think it makes sense to implement this for triggers and functions. It
may also be useful for indexes and materialised views, which can refer
to functions in an extension (and in future, sequences as well).

It's certainly good to know the grammar would work if we wanted to add
other object types in future, but I think we should leave it at that.

Thoughts?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-05 Thread Abhijit Menon-Sen
Álvaro: I did document and test the extra types you added, but now that
I think about it a bit more, it's hard to argue that it's useful to have
a table, for example, depend on an extension. There's really nothing
about a table that "doesn't work without" an extension.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-05 Thread Abhijit Menon-Sen
Hi.

Here's the updated patch. It fixes a couple of small problems, and
includes documentation and tests (which I placed in a new file in
src/test/modules/test_extensions, on Petr's advice).

I wanted to post this before I went on to attempt any more grammar
cleanups. Please let me know if there's anything else you'd like to
see here.

Thanks again for your review and suggestions.

-- Abhijit
>From d4882446de50c49c6563dac6dbdd386c01f9477a Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Tue, 5 Apr 2016 11:55:09 +0530
Subject: =?UTF-8?q?Implement=20"ALTER=20=E2=80=A6=20DEPENDS=20ON=20EXTENSI?=
 =?UTF-8?q?ON=20=E2=80=A6"=20for=20various=20object=20types?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A new dependency type, DEPENDENCY_AUTO_EXTENSION, indicates that an
object depends on an extension (but doesn't belong to it, so pg_dump
should continue to dump it) and should be dropped when the extension
itself is dropped. A new statement type, AlterObjectDependsStmt, is
used to declare such dependencies.

Includes documentation and tests.
---
 doc/src/sgml/catalogs.sgml |  13 +++
 doc/src/sgml/ref/alter_function.sgml   |  20 
 doc/src/sgml/ref/alter_operator.sgml   |  18 
 doc/src/sgml/ref/alter_rule.sgml   |  18 +++-
 doc/src/sgml/ref/alter_table.sgml  |  25 +
 doc/src/sgml/ref/alter_trigger.sgml|  16 +++
 doc/src/sgml/ref/alter_tsconfig.sgml   |  23 -
 doc/src/sgml/ref/alter_type.sgml   |  27 +
 src/backend/catalog/dependency.c   |   2 +
 src/backend/catalog/objectaddress.c|  25 +
 src/backend/commands/alter.c   |  32 ++
 src/backend/nodes/copyfuncs.c  |  17 
 src/backend/nodes/equalfuncs.c |  15 +++
 src/backend/parser/gram.y  |  86 +++-
 src/backend/tcop/utility.c |  27 +
 src/include/catalog/dependency.h   |   9 +-
 src/include/catalog/objectaddress.h|   4 +
 src/include/commands/alter.h   |   1 +
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |  15 +++
 src/include/parser/kwlist.h|   1 +
 src/test/modules/test_extensions/Makefile  |   2 +-
 .../test_extensions/expected/test_extdepend.out| 110 +
 .../modules/test_extensions/sql/test_extdepend.sql |  56 +++
 src/tools/pgindent/typedefs.list   |   1 +
 25 files changed, 558 insertions(+), 6 deletions(-)
 create mode 100644 src/test/modules/test_extensions/expected/test_extdepend.out
 create mode 100644 src/test/modules/test_extensions/sql/test_extdepend.sql

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index bb75229..3c128a0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2888,6 +2888,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/doc/src/sgml/ref/alter_function.sgml b/doc/src/sgml/ref/alter_function.sgml
index f40363e..1ef905f 100644
--- a/doc/src/sgml/ref/alter_function.sgml
+++ b/doc/src/sgml/ref/alter_function.sgml
@@ -29,6 +29,8 @@ ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] )
 SET SCHEMA new_schema
+ALTER FUNCTION name ( [ [ argmode ] [ argname ] argtype [, ...] ] )
+DEPENDS ON EXTENSION extension_name
 
 where action is one of:
 
@@ -148,6 +150,15 @@ ALTER FUNCTION name ( [ [ extension_name
+
+ 
+  The name of an extension that the function depends on.
+ 
+
+   
+
 
  CALLED ON NULL INPUT
  RETURNS NULL ON NULL INPUT
@@ -300,6 +311,15 @@ ALTER FUNCTION sqrt(integer) SET SCHEMA maths;
   
 
   
+   To mark the function sqrt for type
+   integer as being dependent on the extension
+   mathlib:
+
+ALTER FUNCTION sqrt(integer) DEPENDS ON EXTENSION mathlib;
+
+  
+
+  
To adjust the search path that is automatically set for a function:
 
 ALTER FUNCTION check_password(text) SET search_path = admin, pg_temp;
diff --git a/doc/src/sgml/ref/alter_operator.sgml b/doc/src/sgml/ref/alter_operator.sgml
index b2eaa7a..456243f 100644
--- a/doc/src/sgml/ref/alter_operator.sgml
+++ b/doc/src/sgml/ref/alter_operator.sgml
@@ -28,6 +28,9 @@ ALTER OPERATOR name ( { left_typenew_schema
 
 ALTER OPERATOR name ( { left_type | NO

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-04 Thread Abhijit Menon-Sen
At 2016-04-04 18:55:03 -0300, alvhe...@2ndquadrant.com wrote:
>
> At this point I think we're missing user-level docs for the additional
> clause in each ALTER command.

Thanks for having a look. Now that you're happy with the grammar, I'll
write the remaining docs and resubmit the patch later today.

> Do you have any ideas on how this would appear in regression tests?

No specific ideas.

At a high level, I think we should install an empty extension, create
one of each kind of object, alter them to depend on the extension, check
that pg_depend has the right 'x' rows, then drop the extension and make
sure the objects have gone away.

Does that sound reasonable? Suggestions welcome.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-04 Thread Abhijit Menon-Sen
At 2016-04-04 17:28:07 +0900, masao.fu...@gmail.com wrote:
>
> Barring any objections, I'll commit this patch.

No objections, just a minor wording tweak:

doc/src/sgml/config.sgml:

"The synchronous standbys will be the standbys that their names appear
early in this list" should be "The synchronous standbys will be those
whose names appear earlier in this list".

doc/src/sgml/high-availability.sgml:

"The standbys that their names appear early in this list are given
higher priority and will be considered as synchronous" should be "The
standbys whose names appear earlier in the list are given higher
priority and will be considered as synchronous".

"The standbys that their names appear early in the list will be used as
the synchronous standby" should be "The standbys whose names appear
earlier in the list will be used as synchronous standbys".

You may prefer to reword this in some other way, but the current "that
their names appear" wording should be changed.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-04-01 Thread Abhijit Menon-Sen
At 2016-03-29 10:15:51 -0400, da...@pgmasters.net wrote:
>
> Either way it looks like you need to post a patch with more
> documentation - do you know when you'll have that ready?

Here it is.

(I was actually looking for other potential callers, but I couldn't find
any. There are some places that take a RangeVar and make a list from it,
but they are creating new nodes, and don't quite fit. So the only change
in this patch is to add a comment above the get_object_address_rv
function.)

Álvaro, do you like this one any better?

-- Abhijit
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..189b771 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2864,6 +2864,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index c48e37b..a284bed 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object,
 		{
 			case DEPENDENCY_NORMAL:
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 /* no problem */
 break;
 			case DEPENDENCY_INTERNAL:
@@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object,
 subflags = DEPFLAG_NORMAL;
 break;
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 subflags = DEPFLAG_AUTO;
 break;
 			case DEPENDENCY_INTERNAL:
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index d2aaa6d..39128c1 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -999,6 +999,32 @@ get_object_address(ObjectType objtype, List *objname, List *objargs,
 }
 
 /*
+ * Return an ObjectAddress based on a RangeVar and an obect name. The
+ * name of the relation identified by the RangeVar is prepended to the
+ * list passed in as objname. This is useful to find the ObjectAddress
+ * of objects that depend on a relation. All other considerations are
+ * exactly as for get_object_address above.
+ */
+
+ObjectAddress
+get_object_address_rv(ObjectType objtype, RangeVar *rel, List *objname,
+	  List *objargs, Relation *relp, LOCKMODE lockmode,
+	  bool missing_ok)
+{
+	if (rel)
+	{
+		objname = lcons(makeString(rel->relname), objname);
+		if (rel->schemaname)
+			objname = lcons(makeString(rel->schemaname), objname);
+		if (rel->catalogname)
+			objname = lcons(makeString(rel->catalogname), objname);
+	}
+
+	return get_object_address(objtype, objname, objargs,
+			  relp, lockmode, missing_ok);
+}
+
+/*
  * Find an ObjectAddress for a type of object that is identified by an
  * unqualified name.
  */
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 5af0f2f..dd0b4c9f 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -23,6 +23,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_conversion.h"
 #include "catalog/pg_event_trigger.h"
+#include "catalog/pg_extension.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_language.h"
@@ -32,6 +33,7 @@
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_proc.h"
+#include "catalog/pg_trigger.h"
 #include "catalog/pg_ts_config.h"
 #include "catalog/pg_ts_dict.h"
 #include "catalog/pg_ts_parser.h"
@@ -391,6 +393,32 @@ ExecRenameStmt(RenameStmt *stmt)
 }
 
 /*
+ * Executes an ALTER OBJECT / DEPENDS ON EXTENSION statement.
+ *
+ * Return value is the address of the altered object.
+ */
+ObjectAddress
+ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt)
+{
+	ObjectAddress address;
+	ObjectAddress extAddr;
+	Relation	rel = NULL;
+
+	address = get_object_address_rv(stmt->objectType, stmt->relation, stmt->objname,
+	stmt->objargs, , AccessExclusiveLock, false);
+
+	if (rel)
+		heap_close(rel, NoLock);
+
+	extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL,
+ , AccessExclusiveLock, false);
+
+	recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION);
+
+	return address;
+}
+
+/*
  * Executes an ALTER OBJECT / SET SCHEMA statement.  Based on the object
  * type, the function appropriate to that type is executed.
  *
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 6b5d1d6..ecc260d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3203,6 +3203,20 @@ _copyRenameStmt(const RenameStmt *from)
 	return newnode;
 }
 
+static AlterObjectDependsStmt *

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-25 Thread Abhijit Menon-Sen
At 2016-03-24 22:48:51 +0530, a...@2ndquadrant.com wrote:
>
> > I think I would like to see code implement both alternatives to see
> > which one is least ugly.  Maybe a third idea will manifest itself
> > upon seeing those.
> 
> Here's the first one. ExecAlterObjectDependsStmt() looks like this:

Here's the second one, which is only slightly different from the first.
ExecAlterObjectDependsStmt() now looks like this:

+ObjectAddress
+ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt)
+{
+ObjectAddress address;
+ObjectAddress extAddr;
+Relationrel = NULL;
+
+address = get_object_address_rv(stmt->objectType, stmt->relation, 
stmt->objname,
+stmt->objargs, , AccessExclusiveLock, 
false);
+
+if (rel)
+heap_close(rel, NoLock);
+
+extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL,
+ , AccessExclusiveLock, false);
+
+recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION);
+
+return address;
+}

And the new get_object_address_rv() looks like this:

+ObjectAddress
+get_object_address_rv(ObjectType objtype, RangeVar *rel, List *objname,
+  List *objargs, Relation *relp, LOCKMODE lockmode,
+  bool missing_ok)
+{
+if (rel)
+{
+objname = lcons(makeString(rel->relname), objname);
+if (rel->schemaname)
+objname = lcons(makeString(rel->schemaname), objname);
+if (rel->catalogname)
+objname = lcons(makeString(rel->catalogname), objname);
+}
+
+return get_object_address(objtype, objname, objargs,
+  relp, lockmode, missing_ok);
+}

Complete patch attached for reference, as before. (I know I haven't
documented the function. I will go through the code to see if there are
any other potential callers, but I wanted to share what I had already.)

-- Abhijit
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..189b771 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2864,6 +2864,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index c48e37b..a284bed 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object,
 		{
 			case DEPENDENCY_NORMAL:
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 /* no problem */
 break;
 			case DEPENDENCY_INTERNAL:
@@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object,
 subflags = DEPFLAG_NORMAL;
 break;
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 subflags = DEPFLAG_AUTO;
 break;
 			case DEPENDENCY_INTERNAL:
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index d2aaa6d..137f94d 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -998,6 +998,24 @@ get_object_address(ObjectType objtype, List *objname, List *objargs,
 	return address;
 }
 
+ObjectAddress
+get_object_address_rv(ObjectType objtype, RangeVar *rel, List *objname,
+	  List *objargs, Relation *relp, LOCKMODE lockmode,
+	  bool missing_ok)
+{
+	if (rel)
+	{
+		objname = lcons(makeString(rel->relname), objname);
+		if (rel->schemaname)
+			objname = lcons(makeString(rel->schemaname), objname);
+		if (rel->catalogname)
+			objname = lcons(makeString(rel->catalogname), objname);
+	}
+
+	return get_object_address(objtype, objname, objargs,
+			  relp, lockmode, missing_ok);
+}
+
 /*
  * Find an ObjectAddress for a type of object that is identified by an
  * unqualified name.
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 5af0f2f..dd0b4c9f 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -23,6 +23,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_conversion.h"
 #include "catalog/pg_event_trigger.h"
+#include "catalog/pg_extension.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_language.h"
@@ -32,6 +33,7 @@
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_proc.h"
+#include "catalog/pg_trigger.h"
 #include "catalog/pg_ts_config.h"
 #include "catalog/pg_ts_dict.h"
 #include "catalog/pg_ts_parser.h"
@@ -391,6 +393,32 @@ ExecRenameStmt(RenameStmt *stmt)
 }
 
 /*
+ * Executes an ALTER OBJECT / DEPENDS ON 

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-24 Thread Abhijit Menon-Sen
At 2016-03-24 12:31:16 -0300, alvhe...@2ndquadrant.com wrote:
>
> In other words I think the conclusion here is that we must use
> qualified_name in the new production rather than switching the old
> production to any_name.

Makes sense.

> I think I would like to see code implement both alternatives to see
> which one is least ugly.  Maybe a third idea will manifest itself upon
> seeing those.

Here's the first one. ExecAlterObjectDependsStmt() looks like this:

+ObjectAddress
+ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt)
+{
+ObjectAddress address;
+ObjectAddress extAddr;
+Relationrel = NULL;
+
+/*
+ * If the parser handed us a RangeVar, we add the relation's name to
+ * stmt->objname so that we can pass it to get_object_address().
+ */
+if (stmt->relation)
+{
+stmt->objname = lcons(makeString(stmt->relation->relname), 
stmt->objname);
+if (stmt->relation->schemaname)
+stmt->objname = lcons(makeString(stmt->relation->schemaname), 
stmt->objname);
+if (stmt->relation->catalogname)
+stmt->objname = lcons(makeString(stmt->relation->catalogname), 
stmt->objname);
+}
+
+address = get_object_address(stmt->objectType, stmt->objname, 
stmt->objargs,
+ , AccessExclusiveLock, false);
+
+if (rel)
+heap_close(rel, NoLock);
+
+extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL,
+ , AccessExclusiveLock, false);
+
+recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION);
+
+return address;
+}

(This works fine for both functions and triggers, I tested it.)

Complete patch attached for reference.

I'll post the get_object_address_rv() variant tomorrow, but comments are
welcome in the meantime.

-- Abhijit
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..189b771 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2864,6 +2864,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index c48e37b..a284bed 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object,
 		{
 			case DEPENDENCY_NORMAL:
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 /* no problem */
 break;
 			case DEPENDENCY_INTERNAL:
@@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object,
 subflags = DEPFLAG_NORMAL;
 break;
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 subflags = DEPFLAG_AUTO;
 break;
 			case DEPENDENCY_INTERNAL:
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 5af0f2f..339a313 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -23,6 +23,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_conversion.h"
 #include "catalog/pg_event_trigger.h"
+#include "catalog/pg_extension.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_language.h"
@@ -32,6 +33,7 @@
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_proc.h"
+#include "catalog/pg_trigger.h"
 #include "catalog/pg_ts_config.h"
 #include "catalog/pg_ts_dict.h"
 #include "catalog/pg_ts_parser.h"
@@ -391,6 +393,45 @@ ExecRenameStmt(RenameStmt *stmt)
 }
 
 /*
+ * Executes an ALTER OBJECT / DEPENDS ON EXTENSION statement.
+ *
+ * Return value is the address of the altered object.
+ */
+ObjectAddress
+ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt)
+{
+	ObjectAddress address;
+	ObjectAddress extAddr;
+	Relation	rel = NULL;
+
+	/*
+	 * If the parser handed us a RangeVar, we add the relation's name to
+	 * stmt->objname so that we can pass it to get_object_address().
+	 */
+	if (stmt->relation)
+	{
+		stmt->objname = lcons(makeString(stmt->relation->relname), stmt->objname);
+		if (stmt->relation->schemaname)
+			stmt->objname = lcons(makeString(stmt->relation->schemaname), stmt->objname);
+		if (stmt->relation->catalogname)
+			stmt->objname = lcons(makeString(stmt->relation->catalogname), stmt->objname);
+	}
+
+	address = get_object_address(stmt->objectType, stmt->objname, stmt->objargs,
+ , AccessExclusiveLock, false);
+
+	if (rel)
+		heap_close(rel, NoLock);
+
+	extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL,
+ , AccessExclusiveLock, false);
+
+	recordDependencyOn(, , 

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-23 Thread Abhijit Menon-Sen
Hi.

I implemented "ALTER FUNCTION … DEPENDS ON EXTENSION" using a new node
(AlterObjectDependsStmt), and tried to add "ALTER TRIGGER … DEPENDS ON
EXTENSION" (partly because I wanted to make sure the code could support
multiple object types, partly because it's convenient in this particular
use case). Then I ran into a problem, which I could use some help with.

The main ExecAlterObjectDependsStmt() function is very simple:

+ObjectAddress
+ExecAlterObjectDependsStmt(AlterObjectDependsStmt *stmt)
+{
+   ObjectAddress address;
+   ObjectAddress extAddr;
+
+   address = get_object_address(stmt->objectType, stmt->objname, stmt->objargs,
+NULL, AccessExclusiveLock, false);
+   extAddr = get_object_address(OBJECT_EXTENSION, stmt->extname, NULL,
+NULL, AccessExclusiveLock, false);
+
+   recordDependencyOn(, , DEPENDENCY_AUTO_EXTENSION);
+
+   return address;
+}

All that's needed is to construct the node based on variants of the
ALTER command:

+AlterObjectDependsStmt:
+ALTER FUNCTION function_with_argtypes DEPENDS ON EXTENSION name
+{
+AlterObjectDependsStmt *n = 
makeNode(AlterObjectDependsStmt);
+n->objectType = OBJECT_FUNCTION;
+n->objname = $3->funcname;
+n->objargs = $3->funcargs;
+n->extname = list_make1(makeString($7));
+$$ = (Node *)n;
+}
+| ALTER TRIGGER name ON any_name DEPENDS ON EXTENSION name
+{
+AlterObjectDependsStmt *n = 
makeNode(AlterObjectDependsStmt);
+n->objectType = OBJECT_TRIGGER;
+n->objname = list_make1(lappend($5, makeString($3)));
+n->objargs = NIL;
+n->extname = list_make1(makeString($9));
+$$ = (Node *)n;
+}
+;

Now, the first part of this works fine. But with the second part, I get
a reduce/reduce conflict if I use any_name. Here's an excerpt from the
verbose bison output:

State 2920

  1181 AlterObjectDependsStmt: ALTER TRIGGER name ON any_name DEPENDS . ON 
EXTENSION name

ON  shift, and go to state 3506


State 2921

  1165 RenameStmt: ALTER TRIGGER name ON qualified_name RENAME . TO name

TO  shift, and go to state 3507


State 2922

  829 attrs: '.' . attr_name
  2006 indirection_el: '.' . attr_name
  2007   | '.' . '*'

  …

  attr_name   go to state 3508
  ColLabelgo to state 1937
  unreserved_keyword  go to state 1938
  col_name_keywordgo to state 1939
  type_func_name_keyword  go to state 1940
  reserved_keywordgo to state 1941


State 3506

  1181 AlterObjectDependsStmt: ALTER TRIGGER name ON any_name DEPENDS ON . 
EXTENSION name

EXTENSION  shift, and go to state 4000


State 3507

  1165 RenameStmt: ALTER TRIGGER name ON qualified_name RENAME TO . name

…

namego to state 4001
ColId   go to state 543
unreserved_keyword  go to state 544
col_name_keywordgo to state 545


State 3508

  829 attrs: '.' attr_name .
  2006 indirection_el: '.' attr_name .

RENAMEreduce using rule 2006 (indirection_el)
'['   reduce using rule 2006 (indirection_el)
'.'   reduce using rule 829 (attrs)
'.'   [reduce using rule 2006 (indirection_el)]
$default  reduce using rule 829 (attrs)

I can see that the problem is that it can reduce '.' in two ways, but
I'm afraid I don't remember enough about yacc to know how to fix it. If
anyone has suggestions, I would be grateful.

Meanwhile, I tried using qualified_name instead of any_name, which does
work without conflicts, but I end up with a RangeVar instead of the sort
of List* that I can feed to get_object_address.

I could change ExecAlterObjectDependsStmt() to check if stmt->relation
is set, and if so, convert the RangeVar back to a List* in the right
format before passing it to get_object_address. That would work fine,
but it doesn't seem like a good solution.

I could write some get_object_address_rv() wrapper that does essentially
the same conversion, but that's only slightly better.

Are there any other options?

(Apart from the above, the rest of the patch is really just boilerplate
for the new node type, but I've attached it anyway for reference.)

-- Abhijit
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..189b771 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2864,6 +2864,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be 

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-21 Thread Abhijit Menon-Sen
At 2016-03-21 15:43:09 -0400, robertmh...@gmail.com wrote:
>
> I also think we should allow a function to depend on multiple
> extensions, as Alvaro mentions downthread.

I'm working on an updated patch, will post shortly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-21 Thread Abhijit Menon-Sen
At 2016-03-21 12:04:40 +0530, a...@2ndquadrant.com wrote:
>
> I'll write up a patch for this. Thanks for the suggestions.

Here's a patch to implement ALTER FUNCTION x DEPENDS ON EXTENSION y.

The changes to functioncmds.c:AlterFunction were less intrusive than I
had originally feared.

-- Abhijit
>From 722f23b75369f035f094753e47663c50580b052c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Tue, 1 Mar 2016 06:44:28 +0530
Subject: =?UTF-8?q?Add=20DEPENDENCY=5FAUTO=5FEXTENSION/ALTER=20FUNCTION=20?=
 =?UTF-8?q?=E2=80=A6=20DEPENDS=20ON=20EXTENSION=20=E2=80=A6?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The new dependency type is used for functions that require an extension
to operate, but do not actually belong to the extension per se and, in
particular, should not be ignored by pg_dump. The new command allows a
function to be marked as depending on an extension in this way.
---
 doc/src/sgml/catalogs.sgml  | 13 +
 src/backend/catalog/dependency.c|  2 ++
 src/backend/commands/functioncmds.c | 31 ---
 src/backend/parser/gram.y   |  7 ++-
 src/include/catalog/dependency.h|  9 -
 src/include/parser/kwlist.h |  1 +
 6 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..189b771 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2864,6 +2864,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index c48e37b..a284bed 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object,
 		{
 			case DEPENDENCY_NORMAL:
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 /* no problem */
 break;
 			case DEPENDENCY_INTERNAL:
@@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object,
 subflags = DEPFLAG_NORMAL;
 break;
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 subflags = DEPFLAG_AUTO;
 break;
 			case DEPENDENCY_INTERNAL:
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index a745d73..b0f0f88 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -41,6 +41,7 @@
 #include "catalog/objectaccess.h"
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_cast.h"
+#include "catalog/pg_extension.h"
 #include "catalog/pg_language.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_proc.h"
@@ -50,6 +51,7 @@
 #include "catalog/pg_type_fn.h"
 #include "commands/alter.h"
 #include "commands/defrem.h"
+#include "commands/extension.h"
 #include "commands/proclang.h"
 #include "miscadmin.h"
 #include "optimizer/var.h"
@@ -466,7 +468,8 @@ compute_common_attribute(DefElem *defel,
 		 List **set_items,
 		 DefElem **cost_item,
 		 DefElem **rows_item,
-		 DefElem **parallel_item)
+		 DefElem **parallel_item,
+		 DefElem **extdepend_item)
 {
 	if (strcmp(defel->defname, "volatility") == 0)
 	{
@@ -521,6 +524,13 @@ compute_common_attribute(DefElem *defel,
 
 		*parallel_item = defel;
 	}
+	else if (strcmp(defel->defname, "extdepend") == 0)
+	{
+		if (*extdepend_item)
+			goto duplicate_error;
+
+		*extdepend_item = defel;
+	}
 	else
 		return false;
 
@@ -682,7 +692,8 @@ compute_attributes_sql_style(List *options,
 		  _items,
 		  _item,
 		  _item,
-		  _item))
+		  _item,
+		  NULL))
 		{
 			/* recognized common option */
 			continue;
@@ -1179,6 +1190,7 @@ AlterFunction(AlterFunctionStmt *stmt)
 	DefElem*cost_item = NULL;
 	DefElem*rows_item = NULL;
 	DefElem*parallel_item = NULL;
+	DefElem*extdepend_item = NULL;
 	ObjectAddress address;
 
 	rel = heap_open(ProcedureRelationId, RowExclusiveLock);
@@ -1217,7 +1229,8 @@ AlterFunction(AlterFunctionStmt *stmt)
 	 _items,
 	 _item,
 	 _item,
-	 _item) == false)
+	 _item,
+	 _item) == false)
 			elog(ERROR, "option \"%s\" not recognized", defel->defname);
 	}
 
@@ -1303,6 +1316,18 @@ AlterFunction(AlterFunctionStmt *stmt)
 	heap_close(rel, NoLock);
 	heap_fr

Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-21 Thread Abhijit Menon-Sen
At 2016-03-21 13:04:33 +0300, a.korot...@postgrespro.ru wrote:
>
> I'm not sure why we want to make new dependency type by ALTER FUNCTION
> command, not ALTER EXTENSION?

It's a matter of semantics. It means something very different than what
an 'e' dependency means. The extension doesn't own the function (and so
pg_dump shouldn't ignore it), but the function depends on the extension
(and so dropping the extension should drop it).

> The argument could be that 'x' dependency type would be used for other
> objects not extensions.

I suppose this is possible, but yes, I agree with you that it's not
clear how or why this would be useful.

> So, I would prefer this patch to extend ALTER EXTENSION command while
> it's aimed to this particular problem.

OK, so that's what the patch does, and it's certainly the simplest
approach for reasons discussed earlier (though perhaps not as clear
semantically as the ALTER FUNCTION approach). But:

> I even think we could extend existent grammar rule
> 
> | ALTER EXTENSION name add_drop FUNCTION function_with_argtypes
> > *** AlterExtensionContentsStmt:
> > *** 3982,3987 
> > --- 3987,3993 
> > n->objtype = OBJECT_FUNCTION;
> > n->objname = $6->funcname;
> > n->objargs = $6->funcargs;
> > +   n->deptype = 'e';
> > $$ = (Node *)n;
> > }

How exactly do you propose to do this, i.e., what would the final
command to declare an 'x' dependency look like?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-03-21 Thread Abhijit Menon-Sen
At 2016-03-19 17:46:25 -0300, alvhe...@2ndquadrant.com wrote:
>
> I don't think the first patch is acceptable standalone -- we need both
> things together.

OK.

> But in reality, pg_depend handling is mixed up with other changes all
> over the place.

Yes, I noticed that. :-/

> Anyway I think this should be something along the lines of
> ALTER FUNCTION foo() DEPENDS ON EXTENSION bar;

OK. That's reasonable.

> ALTER FUNCTION foo() OWNED BY EXTENSION bar;

If the function is really OWNED BY EXTENSION, then the right way to
declare it would seem to be ALTER EXTENSION … ADD FUNCTION. I prefer
DEPENDS ON EXTENSION for this reason, there's no ambiguity about what
we're declaring.

> Another argument to focus only on extensions is that pg_dump knows
> specifically about extensions for supressing objects to dump, and we
> don't have any other object type doing the same kind of thing; so
> perhaps extensions-only is fine.

That's the argument that motivates this particular patch. I think if we
have a DEPENDS ON EXTENSION framework, it (a) addresses the immediate
need, and (b) gives us a straightforward way to add DEPENDS ON  in
future when we find some need for it.

I'll write up a patch for this. Thanks for the suggestions.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind just doesn't fsync *anything*?

2016-03-09 Thread Abhijit Menon-Sen
At 2016-03-10 08:35:43 +0100, michael.paqu...@gmail.com wrote:
>
> > I guess the easiest fix would be to shell out to initdb -s?
> 
> What do you mean? I am not sure what initdb has to do with that as we
> have no need for it in pg_rewind.

initdb -S/--sync-only fsyncs everything in the data directory and exits.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-02-29 Thread Abhijit Menon-Sen
At 2016-02-29 19:56:07 -0600, jim.na...@bluetreble.com wrote:
>
> I don't see why this would be limited to just functions. […] Am I
> missing something?

No, you are not missing anything. The specific problem I was trying to
solve involved a function, so I sketched out a solution for functions.
Once we have some consensus on whether that's an acceptable approach,
I'll extend the patch in whatever way we agree seems appropriate.

> Maybe the better way to handle this would be through ALTER EXTENSION?

That's what this (second) patch does.

> Given the audience for this, I think it'd probably be OK to just
> provide a function that does this, instead of DDL.

That seems like a promising idea. Can you suggest some possible usage?
Thanks.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-02-29 Thread Abhijit Menon-Sen
At 2016-01-18 11:08:19 +0530, a...@2ndquadrant.com wrote:
>
> I'm proposing to address a part of that problem by allowing extension
> dependencies to be explicitly declared for functions and objects
> created either by a user or dynamically by the extension itself—things
> that need the extension to function, but aren't a part of it.

I didn't hear any further suggestions, so here's a patch for discussion.

1. This adds the 'x'/DEPENDENCY_AUTO_EXTENSION type.
2. This adds an 'ALTER FUNCTION … ADD DEPENDENT FUNCTION …' command.

I split up the two because we may want the new dependency type without
going to the trouble of adding a new command. Maybe extension authors
should just insert an 'x' row into pg_depend directly?

I was inclined to implement it using ALTER FUNCTION, but AlterFunction()
is focused on altering the pg_proc entry for a function, so the new code
didn't fit. Ultimately, ExecAlterExtensionContentsStmt() was the closest
match, so that's where I did it.

Comments welcome. I'll add this patch to the CF.

-- Abhijit
>From 9835f0990a015431393d608c8710d9effe301c9d Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Tue, 1 Mar 2016 06:44:28 +0530
Subject: Add a DEPENDENCY_AUTO_EXTENSION dependency type

This is useful for functions that require the extension to operate, but
do not belong to the extension per se and, in particular, should not be
ignored by pg_dump.
---
 doc/src/sgml/catalogs.sgml   | 13 +
 src/backend/catalog/dependency.c |  2 ++
 src/include/catalog/dependency.h |  9 -
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 951f59b..189b771 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2864,6 +2864,19 @@
   
  
 
+
+
+ DEPENDENCY_AUTO_EXTENSION (x)
+ 
+  
+   The dependent object is not a member of the extension that is the
+   referenced object (and so should not be ignored by pg_dump), but
+   cannot function without it and should be dropped when the
+   extension itself is. The dependent object may be dropped on its
+   own as well.
+  
+ 
+

 
Other dependency flavors might be needed in future.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index c48e37b..a284bed 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -587,6 +587,7 @@ findDependentObjects(const ObjectAddress *object,
 		{
 			case DEPENDENCY_NORMAL:
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 /* no problem */
 break;
 			case DEPENDENCY_INTERNAL:
@@ -786,6 +787,7 @@ findDependentObjects(const ObjectAddress *object,
 subflags = DEPFLAG_NORMAL;
 break;
 			case DEPENDENCY_AUTO:
+			case DEPENDENCY_AUTO_EXTENSION:
 subflags = DEPFLAG_AUTO;
 break;
 			case DEPENDENCY_INTERNAL:
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 049bf9f..380f74a 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -61,6 +61,12 @@
  * created only during initdb.  The fields for the dependent object
  * contain zeroes.
  *
+ * DEPENDENCY_AUTO_EXTENSION ('x'): the dependent object is not a member
+ * of the extension that is the referenced object (and so should not be
+ * ignored by pg_dump), but cannot function without it and should be
+ * dropped when the extension itself is. The dependent object may be
+ * dropped on its own as well.
+ *
  * Other dependency flavors may be needed in future.
  */
 
@@ -70,7 +76,8 @@ typedef enum DependencyType
 	DEPENDENCY_AUTO = 'a',
 	DEPENDENCY_INTERNAL = 'i',
 	DEPENDENCY_EXTENSION = 'e',
-	DEPENDENCY_PIN = 'p'
+	DEPENDENCY_PIN = 'p',
+	DEPENDENCY_AUTO_EXTENSION = 'x'
 } DependencyType;
 
 /*
-- 
2.1.4

>From f06f59e8f7f25406510178fbf62b59dcfd59428c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Tue, 1 Mar 2016 06:46:11 +0530
Subject: =?UTF-8?q?Add=20experimental=20'ALTER=20EXTENSION=20=E2=80=A6=20A?=
 =?UTF-8?q?DD=20DEPENDENT=20FUNCTION=20=E2=80=A6'=20command?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This uses the new 'x' dependency type.
---
 src/backend/commands/extension.c |  7 +--
 src/backend/nodes/copyfuncs.c|  1 +
 src/backend/nodes/equalfuncs.c   |  1 +
 src/backend/parser/gram.y| 39 ++-
 src/include/nodes/parsenodes.h   |  1 +
 src/include/parser/kwlist.h  |  1 +
 6 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 9d84b79..c33dca6 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -2985,6 +2985,7 @@ ExecAlterExtensionContentsStmt(AlterExtensionContentsStmt *stmt,
 	ObjectAddress object;
 	Relati

Re: [HACKERS] Writing new unit tests with PostgresNode

2016-02-21 Thread Abhijit Menon-Sen
At 2016-02-22 07:45:37 +, e...@xs4all.nl wrote:
> 
> I think what's needed is:
> 
> use 5.008_008;

That is meant to fail if the code is run on a version of Perl older than
5.8.8, not fail if the code uses language features not present in 5.8.8.
It is little help for those trying to maintain backwards compatibility.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-01-17 Thread Abhijit Menon-Sen
At 2016-01-16 12:18:53 -0500, robertmh...@gmail.com wrote:
>
> This seems like one manifestation of the more general problem that we
> don't have any real idea what objects a function definition depends
> on.

Yes.

I'm proposing to address a part of that problem by allowing extension
dependencies to be explicitly declared for functions and objects created
either by a user or dynamically by the extension itself—things that need
the extension to function, but aren't a part of it.

Put that way, ALTER EXTENSION doesn't sound like the way to do it. Maybe
ALTER FUNCTION … DEPENDS ON EXTENSION …? I don't particularly care how
the dependency is recorded, it's the dependency type that's important.

I'll post a patch along those lines in a bit, just so we have something
concrete to discuss; meanwhile, suggestions for another syntax to record
the dependency are welcome.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-01-16 Thread Abhijit Menon-Sen
Right, here's another try.

The extension does trigger-based DML auditing. You install it using
CREATE EXTENSION and then call one of its functions to enable auditing
for a particular table. That function will create a customised trigger
function based on the table's columns and a trigger that uses it:

CREATE FUNCTION fn_audit_$table_name() RETURNS TRIGGER …
CREATE TRIGGER … ON $table_name … EXECUTE fn_audit_$table_name;

All that works fine (with pg_dump too). But if you drop the extension,
the triggers stop working because the trigger function calls functions
in the extension that are now gone.

To mitigate this problem, the extension actually does:

CREATE FUNCTION fn_audit…
ALTER EXTENSION … ADD FUNCTION fn_audit…

Now the trigger depends on the trigger function (as before), and the
trigger function depends on the extension, so you can't inadvertently
break the system by dropping the extension.

But now pg_dump has a problem: it'll dump the trigger definitions, but
not the trigger functions (because of their new 'e' dependency on the
extension). So if you restore, you get the extension and the triggers,
but the trigger functions are gone, and things break.

*This* is the problem I'm trying to solve. Sorry, my earlier explanation
was not clear, because I didn't fully understand the problem and what
the extension was doing.

One possible solution is to make the trigger function depend on the
extension with a dependency type that isn't 'e', and therefore doesn't
prevent pg_dump from including the function in its output. We would need
some way to record the dependency, but no changes to pg_dump would be
needed.

Thoughts?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-01-15 Thread Abhijit Menon-Sen
I'm sorry, it wasn't clear from my earlier post that the triggers are
dependent on a function provided by the extension.

So when you do CREATE EXTENSION foo, it creates foo_somefunc() that
RETURNS TRIGGER. Later, a trigger is created (somehow; in this case it
is by some other function in the extension, but it could be by the user
directly as well) that executes this function.

But that's only a partial answer to the questions raised here, and I'll
return to the drawing board and draw up a more complete explanation.

Thanks for reading.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] dealing with extension dependencies that aren't quite 'e'

2016-01-14 Thread Abhijit Menon-Sen
Hi.

I'm looking at an extension that creates some triggers (on user tables)
dynamically (i.e., not during CREATE EXTENSION, but afterwards). The
author has two problems with it:

* «DROP EXTENSION ext» won't work without adding CASCADE, which is an
  (admittedly relatively minor) inconvenience to users.

* More importantly, pg_dump will dump all those trigger definitions,
  which is inappropriate in this case because the extension will try
  to create them.

As an experiment, I implemented "ALTER EXTENSION … ADD TRIGGER … ON …"
(a few-line patch to gram.y) and taught pg_dump.c's getTriggers() to
skip triggers with any 'e' dependency.

That works, in the sense that DROP EXTENSION will drop the triggers (but
the triggers can't be dropped on their own any more), and pg_dump will
ignore them. I'm trying to find a more generally useful mechanism that
addresses this problem and has a chance of being accepted upstream.

Rather than overloading 'e', we could introduce a new dependency type
that references an extension, but means that the dependent object should
be dropped when the extension is, but it can also be dropped on its own,
and pg_dump should ignore it. That would work for this case, and I can
imagine it *may* be useful for other types of objects (e.g., sequences
that depend on a seqam function provided by an extension, indexes that
depend on index functions provided by an extension).

But that leaves open the question of how exactly to record the
dependency: ALTER EXTENSION … ADD … is the easiest way, but it doesn't
feel right to introduce object-type-specific logic there to determine
the dependency type to use. Besides, I'm not entirely comfortable with
tying pg_dump behaviour so closely with the dependency, though there's
some sort of precedent there with deptype = 'i'.

In short, I can see some scope for improvement, but not clearly enough
to make a concrete proposal. I'm hoping for advice or suggestions with
a view towards submitting something to the next commitfest (2016-03).

Thanks.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2015-06-23 Thread Abhijit Menon-Sen
 either skip
it until we reach the RBM_ZERO_AND_CLEANUP_LOCK part, or add a new
branch before the «if (!isExtend)» one.

In summary, we're either adding one new branch that handles everything
related to RBM_CACHE_ONLY and side-stepping some things on the way that
happen to not apply (isLocalBuf, isExtend, TRACE_*), or we're adding a
bunch of tests to ReadBuffer_common.

But splitting BufferAlloc into BufferAlloc_shared and the rest of
BufferAlloc doesn't look especially nice either. If you look at the
shared_buffers lookup from BufferAlloc:

 /* see if the block is in the buffer pool already */
 LWLockAcquire(newPartitionLock, LW_SHARED);
 buf_id = BufTableLookup(newTag, newHash);
 if (buf_id = 0)
 {
 /*
  * Found it.  Now, pin the buffer so no one can steal it from the
  * buffer pool, and check to see if the correct data has been loaded
  * into the buffer.
  */
 buf = GetBufferDescriptor(buf_id);
 
 valid = PinBuffer(buf, strategy);
 
 /* Can release the mapping lock as soon as we've pinned it */
 LWLockRelease(newPartitionLock);
 
 *foundPtr = TRUE;
 
 if (!valid)
 {
 /*
  * We can only get here if (a) someone else is still reading in
  * the page, or (b) a previous read attempt failed.  We have to
  * wait for any active read attempt to finish, and then set up our
  * own read attempt if the page is still not BM_VALID.
  * StartBufferIO does it all.
  */
 if (StartBufferIO(buf, true))
 {
 /*
  * If we get here, previous attempts to read the buffer must
  * have failed ... but we shall bravely try again.
  */
 *foundPtr = FALSE;
 }
 }
 
 return buf;
 }
 
 /*
  * Didn't find it in the buffer pool.  We'll have to initialize a new
  * buffer.  Remember to unlock the mapping lock while doing the work.
  */
 LWLockRelease(newPartitionLock);
 
 /* Loop here in case we have to try another victim buffer */
 for (;;)
 {

There are two options here. I split out BufferAlloc_shared and either
make BufferAlloc call it, or not.

If I make BufferAlloc call it, the LWLockAcquire/Release could move to
the _shared function without confusion, but I'll either have to return
the PinBuffer BM_VALID test result via a validPtr, or repeat the test
in the caller before calling StartBufferIO. Both the INIT_BUFFERTAG and
BufTableHashCode() would also have to be in both functions, since they
are used for BufTableInsert() later in BufferAlloc.

On the other hand, if I don't make BufferAlloc call the _shared
function, then we end up with a new few-line special-case function plus
either a self-contained new branch in ReadBuffer_common, or a bunch of
RBM_CACHE_ONLY tests. That doesn't really seem an improvement over what
the original patch did, i.e. introduce a single special-case function to
handle a single special-case access.

I'm open to suggestions.

-- Abhijit

P.S. For reference, I've attached the patch that I posted earlier,
rebased to apply to master.
commit 4cbbbd242b7ee6333e1d1a09794aa1cd9d39035f
Author: Abhijit Menon-Sen a...@2ndquadrant.com
Date:   Tue Jun 23 13:38:01 2015 +0530

Introduce XLogLockBlockRangeForCleanup()

When replaying B-Tree vacuum records, we need to confirm that a range of
blocks is unpinned, but there was no way to ask the buffer manager if a
block was pinned without having it read in uncached pages.

This patch exposes a new interface that returns pages if they're in the
cache, or InvalidBuffer, and uses that in a function that locks a range
of blocks without reading them in from disk.

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2f85867..e49a9ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -416,33 +416,10 @@ btree_xlog_vacuum(XLogReaderState *record)
 	{
 		RelFileNode thisrnode;
 		BlockNumber thisblkno;
-		BlockNumber blkno;
 
 		XLogRecGetBlockTag(record, 0, thisrnode, NULL, thisblkno);
-
-		for (blkno = xlrec-lastBlockVacuumed + 1; blkno  thisblkno; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers

Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-12 Thread Abhijit Menon-Sen
At 2015-06-11 14:38:03 +0900, langote_amit...@lab.ntt.co.jp wrote:

  On the other hand, I don't like the idea of doing (3) by adding
  command line arguments to pg_basebackup and adding a new option to
  the command. I don't think that level of flexibility is justified;
  it would also make it easier to end up with a broken base backup (by
  inadvertently excluding more than you meant to).
 
 Maybe a combination of (2) and part of (3). In absence of any command
 line argument, the behavior is (2), to exclude. Provide an option to
 *include* it (-S/--serverlog).

I don't like that idea any more than having the command-line argument to
exclude pg_log. (And people who store torrented files in PGDATA may like
it even less.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-11 Thread Abhijit Menon-Sen
At 2015-06-11 14:28:36 +0900, michael.paqu...@gmail.com wrote:

 After spending the night thinking about that, honestly, I think that
 we should go with (2) and keep the base backup as light-weight as
 possible and not bother about a GUC.

OK. Then the patch I posted earlier should be sufficient.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Abhijit Menon-Sen
At 2015-06-10 13:22:27 -0400, robertmh...@gmail.com wrote:

 I'm not clear on which of these options you are voting for:
 
 (1) include pg_log in pg_basebackup as we do currently
 (2) exclude it
 (3) add a switch controlling whether or not it gets excluded
 
 I can live with (3), but I bet most people want (2).

Thanks for spelling out the options.

I strongly prefer (2), but I could live with (3) if it were done as a
GUC setting. (And if that's what we decide to do, I'm willing to write
up the patch.)

Whether or not it's a good idea to let one's logfiles grow to 8GB, the
fact that doing so breaks base backups means that being able to exclude
pg_log *somehow* is more of a necessity than personal preference.

On the other hand, I don't like the idea of doing (3) by adding command
line arguments to pg_basebackup and adding a new option to the command.
I don't think that level of flexibility is justified; it would also
make it easier to end up with a broken base backup (by inadvertently
excluding more than you meant to).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-07 Thread Abhijit Menon-Sen
Hi.

This is a followup to a 2014-02 discussion that led to pg_stats_temp
being excluded from pg_basebackup. At the time, it was discussed to
exclude pg_log as well, but nothing eventually came of that.

Recently, one of our customers has had a basebackup fail because pg_log
contained files that were 8GB:

FATAL: archive member pg_log/postgresql-20150119.log too large for tar format

I think pg_basebackup should also skip pg_log entries, as it does for
pg_stats_temp and pg_replslot, etc. I've attached a patch along those
lines for discussion.

-- Abhijit

P.S. Aren't we leaking statrelpath?
From 8db162c1385b1cdd2b0e666975b76aa814f09f35 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Mon, 27 Apr 2015 12:58:52 +0530
Subject: Skip files in pg_log during basebackup

---
 src/backend/replication/basebackup.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1e86e4c..cc75a03 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -27,6 +27,7 @@
 #include nodes/pg_list.h
 #include pgtar.h
 #include pgstat.h
+#include postmaster/syslogger.h
 #include replication/basebackup.h
 #include replication/walsender.h
 #include replication/walsender_private.h
@@ -72,6 +73,9 @@ static bool backup_started_in_recovery = false;
 /* Relative path of temporary statistics directory */
 static char *statrelpath = NULL;
 
+/* Relative path to log directory */
+static char logpath[MAXPGPATH];
+
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -157,6 +161,18 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		else
 			statrelpath = pgstat_stat_directory;
 
+		/*
+		 * Do the same for the log_directory.
+		 */
+
+		if (is_absolute_path(Log_directory) 
+			path_is_prefix_of_path(DataDir, Log_directory))
+			snprintf(logpath, MAXPGPATH, ./%s, Log_directory + datadirpathlen + 1);
+		else if (strncmp(Log_directory, ./, 2) != 0)
+			snprintf(logpath, MAXPGPATH, ./%s, Log_directory);
+		else
+			strncpy(logpath, Log_directory, MAXPGPATH);
+
 		/* Add a node for the base directory at the end */
 		ti = palloc0(sizeof(tablespaceinfo));
 		ti-size = opt-progress ? sendDir(., 1, true, tablespaces, true) : -1;
@@ -965,6 +981,19 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
+		 * We can skip pg_log (or whatever log_directory is set to, if
+		 * it's under the data directory), but include it as an empty
+		 * directory anyway, so we get permissions right.
+		 */
+		if (strcmp(pathbuf, logpath) == 0)
+		{
+			if (!sizeonly)
+_tarWriteHeader(pathbuf + basepathlen + 1, NULL, statbuf);
+			size += 512;
+			continue;
+		}
+
+		/*
 		 * Skip pg_replslot, not useful to copy. But include it as an empty
 		 * directory anyway, so we get permissions right.
 		 */
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup

2015-06-07 Thread Abhijit Menon-Sen
At 2015-06-08 13:09:02 +0900, michael.paqu...@gmail.com wrote:

 It seems to be that:
 http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com

(Note that this is about calculating the wrong size, whereas my bug is
about the file being too large to be written to a tar archive.)

 And a recent discussion about that is this one:
 http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216

Oh, sorry, I somehow did miss that thread entirely. Thanks for the
pointer. (I've added Vaishnavi to the Cc: list here.)

I'm not convinced that we need a mechanism to let people exclude the
torrent files they've stored in their data directory, but if we have to
do it, the idea of having a GUC setting rather than specifying excludes
on the basebackup command line each time does have a certain appeal.

Anyone else interested in doing it that way?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlog - pg_xjournal?

2015-06-02 Thread Abhijit Menon-Sen
At 2015-06-01 23:35:23 -0500, htf...@gmail.com wrote:

 No, it won't prevent the incredibly stupid from doing incredibly
 stupid things, nothing will.

I hate to speechify, but I think we should try hard to avoid framing
such questions in terms of incredibly stupid people and the things
they might do.

We have anecdotal and circumstantial evidence that the names pg_xlog and
pg_clog have given some people the impression that they can delete files
therein. Sometimes do this when their server is in imminent danger of
running out of space, sometimes not. But our documentation makes it
clear that these files are important.

I think naming these directories to convey the right impression is a
straightforward interface design problem, and we also know that big
flashing red warnings are less effective than one might hope for. I
do not think a bigger, stripier warning is worth doing in isolation.
I do think it's worth choosing better names.

-- Abhijit

P.S. Unrelated to Michael's mail, but I also don't think it's worth
debating whether people will run rm -rf *log or rm -rf log/* or
whatever other variant you can think of. I'm arguing for correcting
a mis-perception, not try to dodge specific harmful commands. Tom's
proposal of using a symlink but dropping it after third-party tools
have had time to catch up seems like the best approach to me.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlog - pg_xjournal?

2015-05-31 Thread Abhijit Menon-Sen
At 2015-05-31 13:46:33 -0400, t...@sss.pgh.pa.us wrote:

 just always create pg_xlog as a symlink to pg_xjournal during initdb.

At first glance, the Subject: of this thread made me think that *was*
Joel's proposal. :-) But I think it's a great idea, and worth doing.

I think pg_journal (no x) is sufficient. The journal is an idea that
people are familiar with from filesystems anyway.

 Note that we'd really also have to rename pg_clog etc

pg_clog could become pg_commits or pg_xactstatus or pg_commit_status or
something. What else is there? I'd hope pg_logical can be left alone.

 A more difficult question is whether we'd also rename pg_resetxlog,
 pg_receivexlog, etc.

I don't think it's necessary. (Of course, people have wanted to rename
pg_resetxlog to make it sound more scary anyway, but that's a different
matter.)

 In the end though, this is a lot of thrashing for a problem that
 only comes up rarely ...

I'll agree with Joel that it comes up far too often for comfort anyway.
I've known a number of people who were on the verge of deleting stuff
from pg_xlog, but just happened to check with me first.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH, TRIVIAL] don't specify S_IRUSR|S_IWUSR without O_CREAT

2015-05-29 Thread Abhijit Menon-Sen
Just for the record: a minor nit I noticed yesterday.

-- Abhijit
From 07353c86483f7e26d44a9bbe94b32315537cee73 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 29 May 2015 23:15:15 +0530
Subject: The file mode is ignored without O_CREAT, so set it to 0

---
 src/backend/storage/file/fd.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b4f6590..2f313ff 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -416,6 +416,7 @@ void
 fsync_fname(char *fname, bool isdir)
 {
 	int			fd;
+	int			flags;
 	int			returncode;
 
 	/*
@@ -423,14 +424,13 @@ fsync_fname(char *fname, bool isdir)
 	 * systems don't allow us to fsync files opened read-only; so we need both
 	 * cases here
 	 */
+	flags = PG_BINARY;
 	if (!isdir)
-		fd = OpenTransientFile(fname,
-			   O_RDWR | PG_BINARY,
-			   S_IRUSR | S_IWUSR);
+		flags |= O_RDWR;
 	else
-		fd = OpenTransientFile(fname,
-			   O_RDONLY | PG_BINARY,
-			   S_IRUSR | S_IWUSR);
+		flags |= O_RDONLY;
+
+	fd = OpenTransientFile(fname, flags, 0);
 
 	/*
 	 * Some OSs don't allow us to open directories at all (Windows returns
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Abhijit Menon-Sen
At 2015-05-29 13:14:18 -0400, t...@sss.pgh.pa.us wrote:

 Pushed with minor revisions.

Thanks, looks good.

 Since we're only logging the failures anyway, I think it is reasonable
 to log a complaint for any unwritable file in the data directory. 

Sounds reasonable, patch attached. ETXTBSY has no special meaning that I
can find under Windows, so I included that change as well.

-- Abhijit
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b4f6590..70e2347 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2646,11 +2646,12 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
 
 	if (fd  0)
 	{
-		if (errno == EACCES || (isdir  errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir  errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir  errno == EACCES)
 			return;
 #endif
 
@@ -2701,11 +2702,12 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 	fd = OpenTransientFile((char *) fname, flags, 0);
 	if (fd  0)
 	{
-		if (errno == EACCES || (isdir  errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir  errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir  errno == EACCES)
 			return;
 #endif
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f0d66fa..f343168 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -617,11 +617,12 @@ pre_sync_fname(const char *fname, bool isdir)
 
 	if (fd  0)
 	{
-		if (errno == EACCES || (isdir  errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir  errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir  errno == EACCES)
 			return;
 #endif
 
@@ -682,11 +683,12 @@ fsync_fname_ext(const char *fname, bool isdir)
 	fd = open(fname, flags);
 	if (fd  0)
 	{
-		if (errno == EACCES || (isdir  errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir  errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir  errno == EACCES)
 			return;
 #endif
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Abhijit Menon-Sen
At 2015-05-28 19:56:15 +0530, a...@2ndquadrant.com wrote:

 I have a separate patch to initdb with the corresponding changes, which
 I will post after dinner and a bit more testing.

Here's that patch too.

I was a bit unsure how far to go with this. It fixes the problem of not
following pg_xlog if it's a symlink (Andres) and the one where it died
on bogus stuff in pg_tblspc (Tom) and unreadable files anywhere else
(it now spews errors to stderr, but carries on for as long as it can).

I've done a bit more than strictly necessary to fix those problems,
though, and made the code largely similar to what's in the other patch.
If you want something minimal, let me know and I'll cut it down to size.

-- Abhijit

P.S. If this patch gets applied, then the Adapted from walktblspc_links
in initdb.c comment in fd.c should be changed: s/links/entries/.
From 8e69433fae0b4f51879793f6594c76b99d764818 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 28 May 2015 22:02:29 +0530
Subject: Make initdb -S behave like xlog.c:fsync_pgdata()

In particular, it should silently skip unreadable/unexpected files in
the data directory and follow symlinks only for pg_xlog and under
pg_tblspc.
---
 src/bin/initdb/initdb.c | 235 +---
 1 file changed, 122 insertions(+), 113 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f1c4920..8ebaa55 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -218,9 +218,9 @@ static char **filter_lines_with_token(char **lines, const char *token);
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
 static void walkdir(char *path, void (*action) (char *fname, bool isdir));
-static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_entries(char *path, void (*action) (char *fname, bool isdir));
 static void pre_sync_fname(char *fname, bool isdir);
-static void fsync_fname(char *fname, bool isdir);
+static void fsync_fname_ext(char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -248,7 +248,7 @@ static void load_plpgsql(void);
 static void vacuum_db(void);
 static void make_template0(void);
 static void make_postgres(void);
-static void perform_fsync(void);
+static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -518,38 +518,38 @@ writefile(char *path, char **lines)
  * walkdir: recursively walk a directory, applying the action to each
  * regular file and directory (including the named directory itself).
  *
- * Adapted from copydir() in copydir.c.
+ * Adapted from walkdir() in backend/storage/file/fd.c.
  */
 static void
 walkdir(char *path, void (*action) (char *fname, bool isdir))
 {
 	DIR		   *dir;
-	struct dirent *direntry;
-	char		subpath[MAXPGPATH];
+	struct dirent *de;
 
 	dir = opendir(path);
 	if (dir == NULL)
 	{
 		fprintf(stderr, _(%s: could not open directory \%s\: %s\n),
 progname, path, strerror(errno));
-		exit_nicely();
+		return;
 	}
 
-	while (errno = 0, (direntry = readdir(dir)) != NULL)
+	while (errno = 0, (de = readdir(dir)) != NULL)
 	{
+		char		subpath[MAXPGPATH];
 		struct stat fst;
 
-		if (strcmp(direntry-d_name, .) == 0 ||
-			strcmp(direntry-d_name, ..) == 0)
+		if (strcmp(de-d_name, .) == 0 ||
+			strcmp(de-d_name, ..) == 0)
 			continue;
 
-		snprintf(subpath, MAXPGPATH, %s/%s, path, direntry-d_name);
+		snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name);
 
 		if (lstat(subpath, fst)  0)
 		{
 			fprintf(stderr, _(%s: could not stat file \%s\: %s\n),
 	progname, subpath, strerror(errno));
-			exit_nicely();
+			continue;
 		}
 
 		if (S_ISDIR(fst.st_mode))
@@ -559,75 +559,72 @@ walkdir(char *path, void (*action) (char *fname, bool isdir))
 	}
 
 	if (errno)
-	{
 		fprintf(stderr, _(%s: could not read directory \%s\: %s\n),
 progname, path, strerror(errno));
-		exit_nicely();
-	}
 
-	if (closedir(dir))
-	{
-		fprintf(stderr, _(%s: could not close directory \%s\: %s\n),
-progname, path, strerror(errno));
-		exit_nicely();
-	}
+	(void) closedir(dir);
 
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
 	(*action) (path, true);
 }
 
 /*
- * walktblspc_links: call walkdir on each entry under the given
- * pg_tblspc directory, or do nothing if pg_tblspc doesn't exist.
+ * walktblspc_entries -- apply the action to the entries in pb_tblspc
+ *
+ * We expect to find only symlinks to tablespace directories here, but
+ * we'll apply the action to regular files, and call walkdir() on any
+ * directories as well

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Abhijit Menon-Sen
At 2015-05-28 17:37:16 -0400, t...@sss.pgh.pa.us wrote:

 I've committed this after some mostly-cosmetic rearrangements.

Thank you.

 I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

Here's a revision of that patch that's more along the lines of what you
committed.

It occurred to me that we should probably also silently skip EACCES on
opendir and stat/lstat in walkdir. That's what the attached initdb patch
does. If you think it's a good idea, there's also a small fixup attached
for the backend version too.

-- Abhijit
From 26bb25e99a2954381587bbfe296a50b19a7f047c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 28 May 2015 22:02:29 +0530
Subject: Make initdb -S behave like xlog.c:fsync_pgdata()

In particular, it should silently skip unreadable/unexpected files in
the data directory and follow symlinks only for pg_xlog and under
pg_tblspc.
---
 src/bin/initdb/initdb.c | 305 
 1 file changed, 151 insertions(+), 154 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f1c4920..9d5478c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -71,6 +71,13 @@
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
 static const char *auth_methods_host[] = {trust, reject, md5, password, ident, radius,
 #ifdef ENABLE_GSS
 	gss,
@@ -217,10 +224,13 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(char *path, void (*action) (char *fname, bool isdir));
-static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir));
-static void pre_sync_fname(char *fname, bool isdir);
-static void fsync_fname(char *fname, bool isdir);
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir),
+	bool process_symlinks);
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir);
+#endif
+static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -248,7 +258,7 @@ static void load_plpgsql(void);
 static void vacuum_db(void);
 static void make_template0(void);
 static void make_postgres(void);
-static void perform_fsync(void);
+static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -521,56 +531,58 @@ writefile(char *path, char **lines)
  * Adapted from copydir() in copydir.c.
  */
 static void
-walkdir(char *path, void (*action) (char *fname, bool isdir))
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir),
+		bool process_symlinks)
 {
 	DIR		   *dir;
-	struct dirent *direntry;
-	char		subpath[MAXPGPATH];
+	struct dirent *de;
 
 	dir = opendir(path);
 	if (dir == NULL)
 	{
-		fprintf(stderr, _(%s: could not open directory \%s\: %s\n),
-progname, path, strerror(errno));
-		exit_nicely();
+		if (errno != EACCES)
+			fprintf(stderr, _(%s: could not open directory \%s\: %s\n),
+	progname, path, strerror(errno));
+		return;
 	}
 
-	while (errno = 0, (direntry = readdir(dir)) != NULL)
+	while (errno = 0, (de = readdir(dir)) != NULL)
 	{
+		char		subpath[MAXPGPATH];
 		struct stat fst;
+		int			sret;
 
-		if (strcmp(direntry-d_name, .) == 0 ||
-			strcmp(direntry-d_name, ..) == 0)
+		if (strcmp(de-d_name, .) == 0 ||
+			strcmp(de-d_name, ..) == 0)
 			continue;
 
-		snprintf(subpath, MAXPGPATH, %s/%s, path, direntry-d_name);
+		snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, fst);
+		else
+			sret = lstat(subpath, fst);
 
-		if (lstat(subpath, fst)  0)
+		if (sret  0)
 		{
-			fprintf(stderr, _(%s: could not stat file \%s\: %s\n),
-	progname, subpath, strerror(errno));
-			exit_nicely();
+			if (errno != EACCES)
+fprintf(stderr, _(%s: could not stat file \%s\: %s\n),
+		progname, subpath, strerror(errno));
+			continue;
 		}
 
-		if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action);
-		else if (S_ISREG(fst.st_mode))
+		if (S_ISREG(fst.st_mode))
 			(*action) (subpath, false);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false);
 	}
 
 	if (errno)
-	{
 		fprintf(stderr, _(%s: could not read directory \%s\: %s\n),
 progname, path, strerror(errno));
-		exit_nicely();
-	}
 
-	if (closedir(dir))
-	{
-		fprintf(stderr, _(%s: could not close directory \%s\: %s\n),
-progname, path, strerror

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Abhijit Menon-Sen
Hi.

Here's an updated patch for the fsync problem(s).

A few points that may be worth considering:

1. I've made the ReadDir → ReadDirExtended change, but in retrospect I
   don't think it's really worth it. Using ReadDir and letting it die
   is probably fine. (Aside: I left ReadDirExtended static for now.)

2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
   I wasn't sure if I should move that to fd.c as well. I think it's
   borderline OK for now.

3. There's a comment in fsync_fname that we need to open directories
   with O_RDONLY and non-directories with O_RDWR. Does this also apply
   to pre_sync_fname (i.e. sync_file_range and posix_fadvise) on any
   platforms we care about? It doesn't seem so, but I'm not sure.

4. As a partial aside, why does fsync_fname use OpenTransientFile? It
   looks like it should use BasicOpenFile as pre_sync_fname does. We
   close the fd immediately after calling fsync anyway. Or is there
   some reason I'm missing that we should use OpenTransientFile in
   pre_sync_fname too?

   (Aside²: it's pointless to specify S_IRUSR|S_IWUSR in fsync_fname
   since we're not setting O_CREAT. Minor, so will submit separate
   patch.)

5. I made walktblspc_entries use stat() instead of lstat(), so that we
   can handle symlinks and directories the same way. Note that we won't
   continue to follow links inside the sub-directories because we use
   walkdir instead of recursing.

   But this depends on S_ISDIR() returning true after stat() on Windows
   with a junction. Does that work? And are the entries in pb_tblspc on
   Windows actually junctions?

I've tested this with unreadable files in PGDATA and junk inside
pb_tblspc. Feedback welcome.

I have a separate patch to initdb with the corresponding changes, which
I will post after dinner and a bit more testing.

-- Abhijit
From 491dcb8c7203fd992dd9c441f5463a75c88e6f52 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 28 May 2015 17:22:18 +0530
Subject: Skip unreadable files in fsync_pgdata; follow only expected symlinks

---
 src/backend/access/transam/xlog.c |  40 +-
 src/backend/storage/file/fd.c | 262 ++
 src/include/storage/fd.h  |   6 +-
 3 files changed, 226 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 087b6be..32447e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11605,20 +11605,53 @@ SetWalWriterSleeping(bool sleeping)
 
 /*
  * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and under pg_tblspc.
  */
 static void
 fsync_pgdata(char *datadir)
 {
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+	struct stat st;
+
 	if (!enableFsync)
 		return;
 
+	snprintf(pg_xlog, MAXPGPATH, %s/pg_xlog, datadir);
+	snprintf(pg_tblspc, MAXPGPATH, %s/pg_tblspc, datadir);
+
+	/*
+	 * If pg_xlog is a symlink, then we need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+	if (lstat(pg_xlog, st)  0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+errmsg(could not stat directory \pg_xlog\: %m)));
+
+#ifndef WIN32
+	if (S_ISLNK(st.st_mode))
+		xlog_is_symlink = true;
+#else
+	if (pgwin32_is_junction(subpath))
+		xlog_is_symlink = true;
+#endif
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.
 	 */
 #if defined(HAVE_SYNC_FILE_RANGE) || \
 	(defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED))
-	walkdir(datadir, pre_sync_fname);
+	walkdir(DEBUG1, datadir, pre_sync_fname);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, pre_sync_fname);
+	walktblspc_entries(DEBUG1, pg_tblspc, pre_sync_fname);
 #endif
 
 	/*
@@ -11628,5 +11661,8 @@ fsync_pgdata(char *datadir)
 	 * file fsyncs don't guarantee that the directory entry for the file is
 	 * synced.
 	 */
-	walkdir(datadir, fsync_fname);
+	walkdir(LOG, datadir, fsync_fname_ext);
+	if (xlog_is_symlink)
+		walkdir(LOG, pg_xlog, fsync_fname_ext);
+	walktblspc_entries(LOG, pg_tblspc, fsync_fname_ext);
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 68d43c6..916f8b5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1921,6 +1921,37 @@ TryAgain:
 }
 
 /*
+ * Read a directory opened with AllocateDir and return the next dirent.
+ * On error, ereports at a caller-specified level and returns NULL if
+ * the level is not ERROR or more severe.
+ */
+static struct dirent *
+ReadDirExtended(int elevel, DIR *dir, const char *dirname)
+{
+	struct dirent *dent;
+
+	/* Give a generic message for AllocateDir failure, if caller didn't */
+	if (dir == NULL)
+	{
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg(could

[HACKERS] [PATCH] readlink missing nul-termination in pg_rewind

2015-05-28 Thread Abhijit Menon-Sen
This is just something I noticed in passing. (I did a quick check of all
the other uses of readlink in the source, and they do get this right.)

-- Abhijit

P.S. Also in passing, I note that pg_rewind will follow links under any
directory anywhere named pg_tblspc (which probably doesn't matter), and
does not follow pg_xlog if it's a symlink (which probably does). If you
want, I can submit a trivial patch for the latter.
From d1e5cbea21bb916253bce2ad189deb1924864508 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 28 May 2015 21:03:50 +0530
Subject: readlink() doesn't nul-terminate the buffer, so we have to

---
 src/bin/pg_rewind/copy_fetch.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 9e317a2..4a7150b 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -126,6 +126,7 @@ recurse_dir(const char *datadir, const char *parentpath,
 		 fullpath);
 			}
 
+			link_target[len] = '\0';
 			callback(path, FILE_TYPE_SYMLINK, 0, link_target);
 
 			/*
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Abhijit Menon-Sen
At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote:

 I'm trying a couple of approaches to that (e.g. using readdir directly
 instead of ReadDir), but other suggestions are welcome.

Here's what that looks like, but not yet fully tested.

-- Abhijit
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 087b6be..6956e88 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11609,16 +11609,41 @@ SetWalWriterSleeping(bool sleeping)
 static void
 fsync_pgdata(char *datadir)
 {
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+	struct stat st;
+
 	if (!enableFsync)
 		return;
 
+	snprintf(pg_xlog, MAXPGPATH, %s/pg_xlog, datadir);
+	snprintf(pg_tblspc, MAXPGPATH, %s/pg_tblspc, datadir);
+
+	if (lstat(pg_xlog, st)  0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+errmsg(could not stat directory \pg_xlog\: %m)));
+
+	xlog_is_symlink = false;
+#ifndef WIN32
+	if (S_ISLNK(st.st_mode))
+		xlog_is_symlink = true;
+#else
+	if (pgwin32_is_junction(subpath))
+		xlog_is_symlink = true;
+#endif
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.
 	 */
 #if defined(HAVE_SYNC_FILE_RANGE) || \
 	(defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED))
-	walkdir(datadir, pre_sync_fname);
+	walkdir(DEBUG1, datadir, pre_sync_fname);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, pre_sync_fname);
+	walktblspc_links(DEBUG1, pg_tblspc, pre_sync_fname);
 #endif
 
 	/*
@@ -11628,5 +11653,8 @@ fsync_pgdata(char *datadir)
 	 * file fsyncs don't guarantee that the directory entry for the file is
 	 * synced.
 	 */
-	walkdir(datadir, fsync_fname);
+	walkdir(WARNING, datadir, fsync_fname_ext);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, fsync_fname_ext);
+	walktblspc_links(WARNING, pg_tblspc, fsync_fname_ext);
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 68d43c6..711b4b7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2443,48 +2443,113 @@ looks_like_temp_rel_name(const char *name)
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
- * Adapted from pre_sync_fname in initdb.c
+ * Ignores errors trying to open unreadable files, and logs other errors at a
+ * caller-specified level.
  */
 void
-pre_sync_fname(char *fname, bool isdir)
+pre_sync_fname(int elevel, char *fname, bool isdir)
 {
 	int			fd;
 
 	fd = BasicOpenFile(fname, O_RDONLY | PG_BINARY, 0);
 
+	if (fd  0)
+	{
+		if (errno == EACCES || (isdir  errno == EISDIR))
+			return;
+
+#ifdef ETXTBSY
+		if (errno == ETXTBSY)
+			return;
+#endif
+
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg(could not open file \%s\: %m, fname)));
+	}
+
+	(void) pg_flush_data(fd, 0, 0);
+	(void) close(fd);
+}
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required, and logs other
+ * errors at a caller-specified level.
+ */
+void
+fsync_fname_ext(int elevel, char *fname, bool isdir)
+{
+	int			fd;
+	int			returncode;
+
 	/*
-	 * Some OSs don't allow us to open directories at all (Windows returns
-	 * EACCES)
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here
 	 */
-	if (fd  0  isdir  (errno == EISDIR || errno == EACCES))
-		return;
+	if (!isdir)
+		fd = OpenTransientFile(fname,
+			   O_RDWR | PG_BINARY,
+			   S_IRUSR | S_IWUSR);
+	else
+		fd = OpenTransientFile(fname,
+			   O_RDONLY | PG_BINARY,
+			   S_IRUSR | S_IWUSR);
 
 	if (fd  0)
-		ereport(FATAL,
-(errmsg(could not open file \%s\: %m,
-		fname)));
+	{
+		if (errno == EACCES || (isdir  errno == EISDIR))
+			return;
 
-	pg_flush_data(fd, 0, 0);
+#ifdef ETXTBSY
+		if (errno == ETXTBSY)
+			return;
+#endif
 
-	close(fd);
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg(could not open file \%s\: %m, fname)));
+	}
+
+	returncode = pg_fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be logged.
+	 */
+	if (returncode != 0  !(isdir  errno == EBADF))
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg(could not fsync file \%s\: %m, fname)));
+
+	CloseTransientFile(fd);
 }
 
 /*
  * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself)
- * and following symbolic links.
+ * regular file and directory (including the named directory itself).
  *
- * NB: There is another version of walkdir in initdb.c, but that version
- * behaves differently with respect to symbolic links.  Caveat emptor!
+ * Adapted from walkdir in initdb.c
  */
 void
-walkdir(char *path, void (*action) (char 

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Abhijit Menon-Sen
At 2015-05-27 20:22:18 -0400, t...@sss.pgh.pa.us wrote:

 I doubt that that (not using AllocateDir) […]

(Disregarding per your followup.)

 What I think is a reasonable compromise is to treat AllocateDir
 failure as nonfatal, but to continue using ReadDir etc which means
 that any post-open failure in reading a successfully-opened directory
 would be fatal.

OK, that's halfway between the two patches I posted. Will do.

 Meanwhile, it seems like you have copied a couple of very dubious
 decisions in initdb's walktblspc_links(): […]

 Now, we don't really have to do anything about these things in order
 to fix the immediate problem, but I wonder if we shouldn't try to
 clean up initdb's behavior while we're at it.

OK, I'll fix that in both.

 Independently of that, I thought the plan was to complain about any
 problems at LOG message level, not DEBUG1, and definitely not WARNING
 (which is lower than LOG for this purpose). 

I'll change the level to LOG for fsync_fname_ext, but I think DEBUG1 is
more appropriate for the pre_sync_fname run. Or do you think I should
use LOG for that too?

 And why would you use a different message level for pg_xlog/ than for
 other files anyway?

That was just a mistake, I forgot to change it after copying.

Thanks for having a look. I'll post a revised patch shortly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Abhijit Menon-Sen
At 2015-05-27 23:41:29 -0400, t...@sss.pgh.pa.us wrote:

 What about turning ReadDir into a wrapper around a ReadDirExtended
 function that adds an elevel argument?

Sounds reasonable, doing.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Abhijit Menon-Sen
At 2015-05-26 19:07:20 +0200, and...@anarazel.de wrote:

 Abhijit, do you recall why the code was changed to follow all symlinks
 in contrast to explicitly going through the tablespaces as initdb -S
 does? I'm pretty sure early versions of the patch pretty much had a
 verbatim copy of the initdb logic?

Yes, earlier versions of the patch did follow symlinks only in
pg_tblspc. I changed this because Álvaro pointed out that this
was likely to miss important stuff, e.g. in

20150403163232.ga28...@eldon.alvh.no-ip.org,
20150406131636.gd4...@alvh.no-ip.org

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Abhijit Menon-Sen
At 2015-05-26 22:44:03 +0200, and...@anarazel.de wrote:

 So, this was discussed in the following thread, starting at:
 http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org

Sorry, I didn't see this before replying.

 There are no other places we it's allowed to introduce symlinks and
 we have refuted bugreports of people having problems after doing that.

OK.

 So what I propose is:
 1) Remove the automatic symlink following
 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in
initdb -S
 3) Add a elevel argument to walkdir(), return if AllocateDir() fails,
continue for stat() failures in the readdir() loop.
 4) Add elevel argument to pre_sync_fname, fsync_fname, return after
errors.
 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By
virtue of not following symlinks we should not need to worry about
EROFS

Makes sense. I've got most of that done, I'll just remove the symlink
following (or rather, restrict it to the cases mentioned), test a bit,
and post.

 I'm inclined to think that 4) is a big enough compat break that a
 fsync_fname_ext with the new argument is a good idea.

Agreed.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Abhijit Menon-Sen
At 2015-05-26 03:54:51 +0200, and...@anarazel.de wrote:

 Say a symlink goes to a binary, which is currently being executed:
 ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So we'd
 need to ignore a lot of errors, possibly ignoring valid ones.

Right. That's why I started out by being conservative and following only
the expected symlinks in pg_tblspc (as other parts of the code do).

 Another thing is whether we should handle a recursive symlink in
 pgdata? I personally think not, but...

I think not too.

 It's also not just as simple as making fsync_fname fail gracefully
 upon EACCESS - the opendir() could fail just as well.

I'll post a proposed patch shortly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-05-12 Thread Abhijit Menon-Sen
At 2015-05-13 03:04:11 +0200, and...@anarazel.de wrote:

 I can live with that, although I'd personally go with
 pgstattuple_approx()...

I could live with that too. Here's an incremental patch to rename the
function. (I'm not resubmitting the whole thing since you said you've
made some other changes too.)

Thank you.

-- Abhijit
From 57d597f176294ebfe30efa6d73569505cbb41e31 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Wed, 13 May 2015 09:19:07 +0530
Subject: Rename pgstatapprox to pgstattuple_approx

---
 contrib/pgstattuple/pgstatapprox.c|  4 ++--
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql |  4 ++--
 contrib/pgstattuple/pgstattuple--1.3.sql  |  4 ++--
 doc/src/sgml/pgstattuple.sgml | 12 ++--
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 46f5bc0..f3b5671 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -21,7 +21,7 @@
 #include utils/tqual.h
 #include commands/vacuum.h
 
-PG_FUNCTION_INFO_V1(pgstatapprox);
+PG_FUNCTION_INFO_V1(pgstattuple_approx);
 
 typedef struct output_type
 {
@@ -234,7 +234,7 @@ build_tuple(output_type *stat, FunctionCallInfo fcinfo)
  */
 
 Datum
-pgstatapprox(PG_FUNCTION_ARGS)
+pgstattuple_approx(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
 	Relation	rel;
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
index bfcf464..99301a2 100644
--- a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -3,7 +3,7 @@
 -- complain if script is sourced in psql, rather than via ALTER EXTENSION
 \echo Use ALTER EXTENSION pgstattuple UPDATE TO '1.3' to load this file. \quit
 
-CREATE FUNCTION pgstatapprox(IN reloid regclass,
+CREATE FUNCTION pgstattuple_approx(IN reloid regclass,
 OUT table_len BIGINT,   -- physical table length in bytes
 OUT scanned_percent FLOAT8, -- what percentage of the table's pages was scanned
 OUT approx_tuple_count BIGINT,  -- estimated number of live tuples
@@ -14,5 +14,5 @@ CREATE FUNCTION pgstatapprox(IN reloid regclass,
 OUT dead_tuple_percent FLOAT8,  -- dead tuples in % (based on estimate)
 OUT approx_free_space BIGINT,   -- estimated free space in bytes
 OUT approx_free_percent FLOAT8) -- free space in % (based on estimate)
-AS 'MODULE_PATHNAME', 'pgstatapprox'
+AS 'MODULE_PATHNAME', 'pgstattuple_approx'
 LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.3.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
index f69b619..f3996e7 100644
--- a/contrib/pgstattuple/pgstattuple--1.3.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -80,7 +80,7 @@ LANGUAGE C STRICT;
 
 /* New stuff in 1.3 begins here */
 
-CREATE FUNCTION pgstatapprox(IN reloid regclass,
+CREATE FUNCTION pgstattuple_approx(IN reloid regclass,
 OUT table_len BIGINT,   -- physical table length in bytes
 OUT scanned_percent FLOAT8, -- what percentage of the table's pages was scanned
 OUT approx_tuple_count BIGINT,  -- estimated number of live tuples
@@ -91,5 +91,5 @@ CREATE FUNCTION pgstatapprox(IN reloid regclass,
 OUT dead_tuple_percent FLOAT8,  -- dead tuples in % (based on estimate)
 OUT approx_free_space BIGINT,   -- estimated free space in bytes
 OUT approx_free_percent FLOAT8) -- free space in % (based on estimate)
-AS 'MODULE_PATHNAME', 'pgstatapprox'
+AS 'MODULE_PATHNAME', 'pgstattuple_approx'
 LANGUAGE C STRICT;
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 4cba006..cca6dc9 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -361,19 +361,19 @@ pending_tuples | 0
varlistentry
 term
  indexterm
-  primarypgstatapprox/primary
+  primarypgstattuple_approx/primary
  /indexterm
- functionpgstatapprox(regclass) returns record/
+ functionpgstattuple_approx(regclass) returns record/
 /term
 
 listitem
  para
-  functionpgstatapprox/function is a faster alternative to
+  functionpgstattuple_approx/function is a faster alternative to
   functionpgstattuple/function that returns approximate results.
   The argument is the target relation's OID.
   For example:
 programlisting
-test= SELECT * FROM pgstatapprox('pg_catalog.pg_proc'::regclass);
+test= SELECT * FROM pgstattuple_approx('pg_catalog.pg_proc'::regclass);
 -[ RECORD 1 ]+---
 table_len| 573440
 scanned_percent  | 2
@@ -392,7 +392,7 @@ approx_free_percent  | 2.09
  para
  Whereas functionpgstattuple/function always performs a
  full-table scan and returns an exact count of live and dead tuples
- (and their sizes) and free space, functionpgstatapprox/function
+ (and their sizes) and free space, functionpgstattuple_approx

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-05-11 Thread Abhijit Menon-Sen
Hi Andres.

I've attached an updated patch for pgstatbloat, as well as a patch to
replace two uses of BuildTupleFromCStrings() elsewhere in pgstattuple.

I've made the changes you mentioned in your earlier mail, except that I
have not changed the name pending further suggestions about what would
be the best name.

Also:

At 2015-05-09 15:36:49 +0530, a...@2ndquadrant.com wrote:

 At 2015-05-09 02:20:51 +0200, and...@anarazel.de wrote:
 
  I haven't checked, but I'm not sure that it's safe/meaningful to
  call PageGetHeapFreeSpace() on a new page.
 
 OK, I'll check and fix if necessary.

You're right, PageGetHeapFreeSpace() isn't safe on a new page. I've
added a guard to that call in the attached patch, but I'm not sure
that's the right thing to do.

Should I copy the orphaned new-page handling from lazy_scan_heap? What
about for empty pages? Neither feels like a very good thing to do, but
then neither does skipping the new page silently. Should I count the
space it would have free if it were initialised, but leave the page
alone for VACUUM to deal with? Or just leave it as it is?

-- Abhijit
From 421267bba8394255feed8f9b9424d25d0be9f979 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Mon, 11 May 2015 15:54:48 +0530
Subject: Make pgstattuple use heap_form_tuple instead of
 BuildTupleFromCStrings

---
 contrib/pgstattuple/pgstatindex.c | 43 ++---
 contrib/pgstattuple/pgstattuple.c | 45 ++-
 2 files changed, 37 insertions(+), 51 deletions(-)

diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index a2ea5d7..608f729 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -257,7 +257,8 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 	{
 		TupleDesc	tupleDesc;
 		int			j;
-		char	   *values[10];
+		Datum		values[10];
+		bool		nulls[10];
 		HeapTuple	tuple;
 
 		/* Build a tuple descriptor for our result type */
@@ -265,33 +266,31 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 			elog(ERROR, return type must be a row type);
 
 		j = 0;
-		values[j++] = psprintf(%d, indexStat.version);
-		values[j++] = psprintf(%d, indexStat.level);
-		values[j++] = psprintf(INT64_FORMAT,
-			   (indexStat.root_pages +
-indexStat.leaf_pages +
-indexStat.internal_pages +
-indexStat.deleted_pages +
-indexStat.empty_pages) * BLCKSZ);
-		values[j++] = psprintf(%u, indexStat.root_blkno);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.internal_pages);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.leaf_pages);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.empty_pages);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.deleted_pages);
+		values[j++] = Int32GetDatum(indexStat.version);
+		values[j++] = Int32GetDatum(indexStat.level);
+		values[j++] = Int64GetDatum((indexStat.root_pages +
+	 indexStat.leaf_pages +
+	 indexStat.internal_pages +
+	 indexStat.deleted_pages +
+	 indexStat.empty_pages) * BLCKSZ);
+		values[j++] = Int32GetDatum(indexStat.root_blkno);
+		values[j++] = Int32GetDatum(indexStat.internal_pages);
+		values[j++] = Int32GetDatum(indexStat.leaf_pages);
+		values[j++] = Int32GetDatum(indexStat.empty_pages);
+		values[j++] = Int32GetDatum(indexStat.deleted_pages);
 		if (indexStat.max_avail  0)
-			values[j++] = psprintf(%.2f,
-   100.0 - (double) indexStat.free_space / (double) indexStat.max_avail * 100.0);
+			values[j++] = Float8GetDatum(100.0 - (double) indexStat.free_space / (double) indexStat.max_avail * 100.0);
 		else
-			values[j++] = pstrdup(NaN);
+			values[j++] = CStringGetDatum(NaN);
 		if (indexStat.leaf_pages  0)
-			values[j++] = psprintf(%.2f,
-   (double) indexStat.fragments / (double) indexStat.leaf_pages * 100.0);
+			values[j++] = Float8GetDatum((double) indexStat.fragments / (double) indexStat.leaf_pages * 100.0);
 		else
-			values[j++] = pstrdup(NaN);
+			values[j++] = CStringGetDatum(NaN);
 
-		tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
-	   values);
+		for (j = 0; j  10; j++)
+			nulls[j] = false;
 
+		tuple = heap_form_tuple(tupleDesc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
 	}
 
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c3a8b1d..552f188 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -85,28 +85,20 @@ static Datum
 build_pgstattuple_type(pgstattuple_type *stat, FunctionCallInfo fcinfo)
 {
 #define NCOLUMNS	9
-#define NCHARS		32
 
 	HeapTuple	tuple;
-	char	   *values[NCOLUMNS];
-	char		values_buf[NCOLUMNS][NCHARS];
+	Datum		values[NCOLUMNS];
+	bool		nulls[NCOLUMNS];
 	int			i;
 	double		tuple_percent;
 	double		dead_tuple_percent;
 	double		free_percent;	/* free/reusable space in % */
 	TupleDesc	tupdesc;
-	AttInMetadata *attinmeta;
 
 	/* Build a tuple descriptor for our result type

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-05-11 Thread Abhijit Menon-Sen
At 2015-05-11 19:15:47 +0200, and...@anarazel.de wrote:

 TBH, I'd rather not touch unrelated things right now. We're pretty
 badly behind...

OK. That patch is independent; just ignore it.

 I don't really care how it's named, as long as it makes clear that
 it's not an exact measurement.

Not having heard any better suggestions, I picked pgstatapprox as a
compromise between length and familiarity/consistency with pgstattuple.

  Should I count the space it would have free if it were initialised,
  but leave the page alone for VACUUM to deal with?

And this is what the attached patch does.

I also cleaned up a few things that I didn't like but had left alone to
make the code look similar to pgstattuple. In particular, build_tuple()
now does nothing but build a tuple from values calculated earlier in
pgstatapprox_heap().

Thank you.

-- Abhijit

P.S. What, if anything, should be done about the complicated and likely
not very useful skip-only-min#-blocks logic in lazy_scan_heap?
From 0a99fc61d36e3a3ff4003db95e5c299a5f07a850 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatapprox to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatapprox.c | 277 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 ++
 .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 ++
 6 files changed, 450 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatapprox.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (72%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..6083dab 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatapprox.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = pgstattuple - tuple-level statistics
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
new file mode 100644
index 000..46f5bc0
--- /dev/null
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -0,0 +1,277 @@
+/*
+ * contrib/pgstattuple/pgstatapprox.c
+ * Fast bloat estimation functions
+ */
+
+#include postgres.h
+
+#include access/visibilitymap.h
+#include access/transam.h
+#include access/xact.h
+#include access/multixact.h
+#include access/htup_details.h
+#include catalog/namespace.h
+#include funcapi.h
+#include miscadmin.h
+#include storage/bufmgr.h
+#include storage/freespace.h
+#include storage/procarray.h
+#include storage/lmgr.h
+#include utils/builtins.h
+#include utils/tqual.h
+#include commands/vacuum.h
+
+PG_FUNCTION_INFO_V1(pgstatapprox);
+
+typedef struct output_type
+{
+	uint64		table_len;
+	uint64		scanned_percent;
+	uint64		tuple_count;
+	uint64		tuple_len;
+	double		tuple_percent;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	double		dead_tuple_percent;
+	uint64		free_space;
+	double		free_percent;
+} output_type;
+
+#define NUM_OUTPUT_COLUMNS 10
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static void
+pgstatapprox_heap(Relation rel, output_type *stat)
+{
+	BlockNumber scanned,
+nblocks,
+blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+	uint64		misc_count = 0;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+	scanned = 0;
+
+	for (blkno = 0; blkno  nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+	maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the page has only visible tuples, then we can find out the
+		 * free space from the FSM and move on.
+		 */
+
+		if (visibilitymap_test(rel, blkno, vmbuffer))
+		{
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat-tuple_len += BLCKSZ - freespace;
+			stat-free_space += freespace;
+			continue;
+		}
+
+		buf

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-05-09 Thread Abhijit Menon-Sen
At 2015-05-09 02:20:51 +0200, and...@anarazel.de wrote:

  + * Abhijit Menon-Sen a...@2ndquadrant.com
  + * Portions Copyright (c) 2001,2002Tatsuo Ishii (from pgstattuple)
 
 I think for new extension we don't really add authors and such anymore.

OK, I'll get rid of the boilerplate.

 Hm. I do wonder if this should really be called 'statbloat'. Perhaps
 it'd more appropriately be called pg_estimate_bloat or somesuch?

Since we've moved it into pgstattuple, perhaps pgstattuple_approximate()
or pgstattuple_estimated() would be a better name. I don't really care,
I'll change it to whatever people like.

 Also, is free_space really exact? The fsm isn't byte exact.

You're right, I'll fix that.

 Why go through C strings?  I personally think we really shouldn't add
 more callers to BuildTupleFromCStrings, it's such an absurd interface.

I just copied this more or less blindly from pgstattuple. I'll fix it
and submit a separate patch to fix the equivalent code in pgstattuple.

 I think it'd actually be fine to just say that the relation has to be
 a table or matview.

Good point. Agreed.

 I don't believe that rationale is really true these days. I'd much
 rather just rip this out here than copy the rather complex logic.

Yes, thanks, I very much agree that this isn't really worth copying.
I'll drop it in my next submission.

 I haven't checked, but I'm not sure that it's safe/meaningful to call
 PageGetHeapFreeSpace() on a new page.

OK, I'll check and fix if necessary.

Thanks for the feedback. I'll submit a revised patch shortly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Abhijit Menon-Sen
At 2015-05-01 09:57:28 -0400, robertmh...@gmail.com wrote:

 If you don't object to this version, I'll commit it.

Looks fine to me, thank you.

As for the non-backpatchable 0002, I agree with Andres that it should be
included in 9.5; but I take it you're still not convinced? Should I add
that to the CF separately for discussion, or what?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Abhijit Menon-Sen
At 2015-05-01 08:10:16 -0400, robertmh...@gmail.com wrote:

 It seems to me that, at a minimum, it would be good to split those
 controversial and definitely not-back-patchable changes into their
 own patch.

OK, split here (0002*).

 I do mind putting it into xlog.c instead of some place that's actually
 appropriate.

OK, moved to storage/file/fd.c (0001*).

-- Abhijit
From 088b80eb0825339eca688e4347a4ef547edcadbb Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: Recursively fsync PGDATA at startup after a crash

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c |  49 +
 src/backend/storage/file/fd.c | 112 ++
 src/include/storage/fd.h  |   2 +
 3 files changed, 163 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cf4415..084174d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -845,6 +845,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void fsync_pgdata(char *datadir);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -5878,6 +5880,25 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg(control file contains invalid data)));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 *
+	 * We also do this if the control file indicates that fsync was
+	 * disabled at some point while the server was running earlier.
+	 */
+
+	if (enableFsync 
+		(ControlFile-state != DB_SHUTDOWNED 
+		 ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY))
+	{
+		fsync_pgdata(data_directory);
+	}
+
 	if (ControlFile-state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -11123,3 +11144,31 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl-WalWriterSleeping = sleeping;
 	SpinLockRelease(XLogCtl-info_lck);
 }
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ */
+static void
+fsync_pgdata(char *datadir)
+{
+	if (!enableFsync)
+		return;
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync
+	 * the data directory and its contents.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE) || \
+	(defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED))
+	walkdir(datadir, pre_sync_fname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.
+	 */
+	walkdir(datadir, fsync_fname);
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f796717..aba12ca 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2439,3 +2439,115 @@ looks_like_temp_rel_name(const char *name)
 		return false;
 	return true;
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+void
+pre_sync_fname(char *fname, bool isdir)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd  0  isdir  (errno == EISDIR || errno == EACCES))
+		return;
+
+	if (fd  0)
+		ereport(FATAL,
+(errmsg(could not open file \%s\ before fsync,
+		fname)));
+
+	pg_flush_data(fd, 0, 0);
+
+	close(fd);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself)
+ * and following symbolic links.
+ */
+void
+walkdir(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(path);
+	while ((de = ReadDir(dir, path)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (strcmp(de-d_name, .) == 0 ||
+			strcmp(de-d_name, ..) == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name);
+
+		if (lstat(subpath, fst)  0)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg(could not stat file \%s\: %m, subpath)));
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir

Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Abhijit Menon-Sen
At 2015-04-30 15:37:44 -0400, robertmh...@gmail.com wrote:

 1. It doesn't do that.  As soon as we fsync the data directory, we
reset the flag.  That's not what ever disabled means to me.

Could you suggest an acceptable alternative wording? I can't immediately
think of anything better than disabled since the last restart. That is
conditional on our resetting the flag, which we will do only if fsync is
enabled at startup. So it's true, but not the whole truth.

 2. I don't know why it's part of this patch.

In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his
rationale as follows:

«What I am thinking of is that, currently, if you start the server
for initial loading with fsync=off, and then restart it, you're open
to data loss. So when the current config file setting is changed
from off to on, we should fsync the data directory. Even if there
was no crash restart.»

That's what I tried to implement.

 Also, it seems awfully unfortunate to me that we're duplicating a
 whole pile of code into xlog.c here.

I have pointed out and discussed the duplication several times. I did it
this way only because we were considering backporting the changes, and
at the time it seemed better to do this and fix the duplication in a
separate patch.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Abhijit Menon-Sen
At 2015-04-30 16:56:17 -0700, t...@sss.pgh.pa.us wrote:

 As for the notion that this needs to be back-patched, I would say no.

Not even just the fsync after crash part? I could separate that out
from the control file changes and try to eliminate the duplication. I
think that would be worth back-patching, at least.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-04-27 Thread Abhijit Menon-Sen
At 2015-01-30 21:36:42 +0100, and...@2ndquadrant.com wrote:

  Here's an alternative approach. I think it generally is superior and
  going in the right direction, but I'm not sure it's backpatchable.
  
  It basically consists out of:
  1) Make GetLockConflicts() actually work.
 
 already commited as being a independent problem.
 
  2) Allow the startup process to actually acquire locks other than
 AccessExclusiveLocks. There already is code acquiring other locks,
 but it's currently broken because they're acquired in blocking mode
 which isn't actually supported in the startup mode. Using this
 infrastructure we can actually fix a couple small races around
 database creation/drop.
  3) Allow session locks during recovery to be heavier than
 RowExclusiveLock - since relation/object session locks aren't ever
 taken out by the user (without a plain lock first) that's safe.
 
 merged and submitted independently.
 
  5) Make walsender actually react to recovery conflict interrupts
 
 submitted here. (0003)
 
  4) Perform streaming base backups from within a transaction command, to
 provide error handling.
  6) Prevent access to the template database of a CREATE DATABASE during
 WAL replay.
  7) Add an interlock to prevent base backups and movedb() to run
 concurrently. What we actually came here for.
 
 combined, submitted here. (0004)
 
 I think this actually doesn't look that bad.

Hi Andres. Do you intend to commit these patches? (I just verified that
they apply without trouble to HEAD.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-23 Thread Abhijit Menon-Sen
At 2015-04-24 07:22:27 +0530, amit.kapil...@gmail.com wrote:

 Few minor issues:
 1.
 warning on windows
 
 \pgstatbloat.c(193): warning C4715: 'pgstatbloat' : not all control paths
 return a value

This is because the function ends with an ereport(ERROR, …). What would
you suggest here? Just stick a PG_RETURN_NULL() at the end?

 2.
 Failed to install pgstattuple--1.3.sql
 
 You have to name sql file as pgstattuple--1.3.sql rather
 than pgstattuple--1.2.sql

How are you applying the patch? It already contains this:

diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 73%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..93593ee 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql

So it should do the right thing with git apply.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-23 Thread Abhijit Menon-Sen
At 2015-04-24 08:35:40 +0530, amit.kapil...@gmail.com wrote:

  Just stick a PG_RETURN_NULL() at the end?
 
 That should also work.

OK, updated patch attached with just that one change.

I'm not doing anything about the rename. I don't know if it's possible
to make patch(1) rename a file at all (it wasn't, but maybe something
has changed recently?).

Note to friendly neighbourhood committers: the patch is expected to
rename pgstattuple--1.2.sql to pgstattuple--1.3.sql, which it does
if applied using git apply.

-- Abhijit
From f67ddc68ab03fe8237058dcdca1e410ef9e95b5e Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatbloat.c  | 389 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 +
 .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 +++
 6 files changed, 562 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = pgstattuple - tuple-level statistics
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 000..612e22b
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,389 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen a...@2ndquadrant.com
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS
+ * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include postgres.h
+
+#include access/visibilitymap.h
+#include access/transam.h
+#include access/xact.h
+#include access/multixact.h
+#include access/htup_details.h
+#include catalog/namespace.h
+#include funcapi.h
+#include miscadmin.h
+#include storage/bufmgr.h
+#include storage/freespace.h
+#include storage/procarray.h
+#include storage/lmgr.h
+#include utils/builtins.h
+#include utils/tqual.h
+#include commands/vacuum.h
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+			   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-22 Thread Abhijit Menon-Sen
At 2015-04-18 12:28:36 +0530, amit.kapil...@gmail.com wrote:

 I think you have missed to address the below point:
 
  4) prefetch

Updated patch attached, as well as the diff against the earlier version
just to make it easier to see the exact change I made (which is to copy
the skip logic from lazy_scan_heap, as you suggested).

I'm not entirely convinced that this change is worthwhile, but it's easy
to make and difficult to argue against, so here it is. (I did test, and
it seems to work OK after the change.)

Amit, Tomas, thanks again for your review comments.

-- Abhijit
From 3edb5426292d6097cb66339b865e99bf4f766646 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatbloat.c  | 387 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 +
 .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 +++
 6 files changed, 560 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = pgstattuple - tuple-level statistics
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 000..b19459a
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,387 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen a...@2ndquadrant.com
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS
+ * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include postgres.h
+
+#include access/visibilitymap.h
+#include access/transam.h
+#include access/xact.h
+#include access/multixact.h
+#include access/htup_details.h
+#include catalog/namespace.h
+#include funcapi.h
+#include miscadmin.h
+#include storage/bufmgr.h
+#include storage/freespace.h
+#include storage/procarray.h
+#include storage/lmgr.h
+#include utils/builtins.h
+#include utils/tqual.h
+#include commands/vacuum.h
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+			   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build

Re: [HACKERS] initdb -S and tablespaces

2015-04-16 Thread Abhijit Menon-Sen
Hi.

Here's a variation of the earlier patch that follows all links in
PGDATA. Does this look more like what you had in mind?

-- Abhijit
From d86888b0d2f5a3a57027d26ce050a3bbb58670d3 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: Recursively fsync PGDATA at startup if needed

It's needed if we need to perform crash recovery or if fsync was
disabled at some point while the server was running earlier (and
we must store that information in the control file).

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c   | 168 
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/include/catalog/pg_control.h|   8 +-
 3 files changed, 177 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..263d2d0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -835,6 +835,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void fsync_pgdata(char *datadir);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -4811,6 +4813,7 @@ BootStrapXLOG(void)
 	ControlFile-checkPoint = checkPoint.redo;
 	ControlFile-checkPointCopy = checkPoint;
 	ControlFile-unloggedLSN = 1;
+	ControlFile-fsync_disabled = false;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile-MaxConnections = MaxConnections;
@@ -5868,6 +5871,27 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg(control file contains invalid data)));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 *
+	 * We also do this if the control file indicates that fsync was
+	 * disabled at some point while the server was running earlier.
+	 */
+
+	if (enableFsync 
+		(ControlFile-fsync_disabled ||
+		 (ControlFile-state != DB_SHUTDOWNED 
+		  ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY)))
+	{
+		fsync_pgdata(data_directory);
+		ControlFile-fsync_disabled = false;
+	}
+
 	if (ControlFile-state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -8236,6 +8260,8 @@ CreateCheckPoint(int flags)
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile-minRecoveryPoint = InvalidXLogRecPtr;
 	ControlFile-minRecoveryPointTLI = 0;
+	if (!enableFsync)
+		ControlFile-fsync_disabled = true;
 
 	/*
 	 * Persist unloggedLSN value. It's reset on crash recovery, so this goes
@@ -11107,3 +11133,145 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl-WalWriterSleeping = sleeping;
 	SpinLockRelease(XLogCtl-info_lck);
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+static void
+pre_sync_fname(char *fname, bool isdir)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd  0  isdir  (errno == EISDIR || errno == EACCES))
+		return;
+
+	if (fd  0)
+		ereport(FATAL,
+(errmsg(could not open file \%s\ before fsync,
+		fname)));
+
+	pg_flush_data(fd, 0, 0);
+
+	close(fd);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself)
+ * and following symbolic links.
+ */
+static void
+walkdir(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(path);
+	while ((de = ReadDir(dir, path)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (strcmp(de-d_name, .) == 0 ||
+			strcmp(de-d_name, ..) == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name);
+
+		if (lstat(subpath, fst)  0)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg(could not stat file \%s\: %m, subpath)));
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action);
+#ifndef WIN32
+		else if (S_ISLNK(fst.st_mode))
+#else
+		else if (pg_win32_is_junction(subpath))
+#endif
+		{
+#if defined(HAVE_READLINK) || defined(WIN32)
+			char		linkpath[MAXPGPATH];
+			int			len;
+			struct stat lst;
+
+			len

Re: [HACKERS] initdb -S and tablespaces

2015-04-15 Thread Abhijit Menon-Sen
At 2015-04-06 10:16:36 -0300, alvhe...@2ndquadrant.com wrote:

 Well, we have many things that can be set as symlinks; some you can
 have initdb create the links for you (initdb --xlogdir), others you
 can move manually.

Looking at sendDir() in backend/replication/basebackup.c, I notice that
the only places where it expects/allows symlinks are for pg_xlog and in
pg_tblspc.

Still, thanks to the example in basebackup.c, I've got most of a patch
that should follow any symlinks in PGDATA. I just need to test a little
more before I post it.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-15 Thread Abhijit Menon-Sen
At 2015-04-15 11:40:34 +0530, a...@2ndquadrant.com wrote:

 Still, thanks to the example in basebackup.c, I've got most of a patch
 that should follow any symlinks in PGDATA.

I notice that src/bin/pg_rewind/copy_fetch.c has a traverse_datadir()
function that does what we want (but it recurses into symlinks only
inside pg_tblspc).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-06 Thread Abhijit Menon-Sen
Hi Álvaro.

Thanks for taking a look at the patch.

At 2015-04-03 13:32:32 -0300, alvhe...@2ndquadrant.com wrote:

 But then, since the name is already telling us that it's all about
 PGDATA, maybe we can remove the recursively part of the name?

Sure, that makes sense too. Since you and Andres both like
«fsync_pgdata(data_directory)», I'll change it accordingly.

 I also noticed that walkdir() thinks it is completely agnostic on
 what the passed action is; except that the comment at the bottom talks
 about how fsync on directories is important, which seems out of place.

Yes. The function behaves as documented, but the comment is clearly too
specific. I'm not sure where to put it. I could make walkdir() NOT do
it, and instead do it in the caller with the same comment. Thoughts?

 Hmm ... Actually, since surely we must follow symlinks everywhere,
 why do we have to do this separately for pg_tblspc? Shouldn't that
 link-following occur automatically when walking PGDATA in the first
 place?

I'm not sure about that (and that's why I've not attached an updated
patch here). The original idea was to follow only those links that we
expect to be in PGDATA.

I suppose it would be easier in terms of the code to follow all links,
but I don't know if it's the right thing. If that's what you think we
should do, I can post a simplified patch.

 Maybe fsync_recursively should Assert() that it's only being called
 with enableFsync=on; or perhaps we can have it return early if it's
 unset.

I prefer the latter. Will change.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-04-03 Thread Abhijit Menon-Sen
At 2015-03-31 22:43:49 +0530, a...@2ndquadrant.com wrote:

 I'm just posting this WIP patch where I've renamed fastbloat to
 pgstatbloat as suggested by Tomas, and added in the documentation, and
 so on.

Here's the revised version that also adds the count of RECENTLY_DEAD and
INSERT/DELETE_IN_PROGRESS tuples to the call to vac_estimate_reltuples.

Tomas, I agree that the build_output_tuple function could use cleaning
up, but it's the same as the corresponding code in pgstattuple, and it
seems to me reasonable to clean both up together in a separate patch.

-- Abhijit
From d11b84627438fca12955dfad77042be0a27c9acc Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatbloat.c  | 340 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 ++
 .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 
 6 files changed, 513 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = pgstattuple - tuple-level statistics
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 000..2dd1900
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,340 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen a...@2ndquadrant.com
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS
+ * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include postgres.h
+
+#include access/visibilitymap.h
+#include access/transam.h
+#include access/xact.h
+#include access/multixact.h
+#include access/htup_details.h
+#include catalog/namespace.h
+#include funcapi.h
+#include miscadmin.h
+#include storage/bufmgr.h
+#include storage/freespace.h
+#include storage/procarray.h
+#include storage/lmgr.h
+#include utils/builtins.h
+#include utils/tqual.h
+#include commands/vacuum.h
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+			   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple

Re: [HACKERS] What exactly is our CRC algorithm?

2015-04-02 Thread Abhijit Menon-Sen
At 2015-03-25 19:18:51 +0200, hlinn...@iki.fi wrote:

 I think we'll need a version check there. […]
 
 You want to write that or should I?

I'm not familiar with MSVC at all, so it would be nice if you did it.

 How do you like this latest version of the patch otherwise?

I'm sorry, but I'm still not especially fond of it.

Apart from removing the startup check so that client programs can also
use the best available CRC32C implementation without jumping through
hoops, I don't feel that the other changes buy us very much.

Also, assuming that the point is that people who don't care about CRCs
deeply should nevertheless be able to produce special-purpose binaries
with only the optimal implementation included, we should probably have
some instructions about how to do that.

Thinking about what you were trying to do, I would find an arrangement
roughly like the following to be clearer to follow in terms of adding
new implementations and so on:

#if defined(USE_CRC_SSE42) || …can build SSE4.2 CRC code…
#define HAVE_CRC_SSE42 1
pg_crc32c_sse42() { … }
bool sse42_crc32c_available() { … }
pg_comp_crc32c = pg_crc32c_sse42;
#elif defined(USE_CRC_ARM) || …can build ARM CRC code…
#define HAVE_CRC_ARM 1
pg_crc32c_arm() { … }
bool arm_crc32c_available() { … }
pg_comp_crc32c = pg_crc32c_arm;
#endif

#define CRC_SELECTION 1
#if defined(USE_CRC_SSE42) || defined(USE_CRC_ARM) || …
#undef CRC_SELECTION
#endif

#ifdef CRC_SELECTION
pg_crc32c_sb8() { … }

pg_comp_crc32c_choose(…)
{
pg_comp_crc32c = pg_crc32c_sb8;

#ifdef HAVE_CRC_SSE42
if (sse42_crc32c_available())
pg_comp_crc32c = pg_crc32c_sse42;
#elif …
…
#endif

return pg_comp_crc32c(…);
}

pg_comp_crc32c = pg_crc32c_choose;
#endif

Then someone who wants to force the building of (only) the SSE4.2
implementation can build with -DUSE_CRC_SSE42. And if you turn on
USE_CRC_ARM when you can't build ARM code, it won't build. (The
HAVE_CRC_xxx #defines could also move to configure tests.)

If you don't specify any USE_CRC_xxx explicitly, then it'll build
whichever (probably) one it can, and select it at runtime if it's
available.

All that said, I do recognise that there are all relatively cosmetic
concerns, and I don't object seriously to any of it. On the contrary,
thanks for taking the time to review and work on the patch. Nobody
else has expressed an opinion, so I'll leave it to you to decide whether
to commit as-is, or if you want me to pursue the above approach instead.

In the realm of very minor nitpicking, here are a couple of points I
noticed in crc_instructions.h:

1. I think «if (pg_crc32_instructions_runtime_check())» would read
   better if the function were named crc32c_instructions_available or
   pg_crc32c_is_hw_accelerated, or something like that.

2. It documents pg_accumulate_crc32c_byte and
   pg_accumulate_crc32c_uint64, but actually defines pg_asm_crc32b and
   pg_asm_crc32q. If you update the code rather than the documentation,
   _update_ may be slightly preferable to _accumulate_, and maybe the
   suffixes should be _uint8 and _uint64.

3. The descriptions (e.g. Add one byte to the current crc value.)
   should also probably read Update the current crc value for the given
   byte/eight bytes of data.

Thanks.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-04-02 Thread Abhijit Menon-Sen
At 2015-04-03 00:33:10 +0300, hlinn...@iki.fi wrote:

 I came up with the attached.

I like it very much.

src/port/Makefile has (note src/srv):

+# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42
+pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
+pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42)

Other than that, this looks great. Thank you.

 BTW, we might want to move the traditional and legacy crc32
 implementations out of src/common. They are only used in backend
 code now.

I agree.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-04-02 Thread Abhijit Menon-Sen
At 2015-04-02 17:58:23 +0300, hlinn...@iki.fi wrote:

 We're only using inline assembly to force producing SSE 4.2 code, even
 when -msse4.2 is not used. That feels wrong.

Why? It feels OK to me (and to Andres, per earlier discussions about
exactly this topic). Doing it this way allows the binary to run on a
non-SSE4.2 platform (and not use the CRC instructions).

Also, -msse4.2 was added to the compiler later than support for the
instructions was added to the assembler.

 We have a buildfarm animal that still uses gcc 2.95.3, which was
 released in 2001. I don't have a compiler of that vintage to test
 with, but I assume an old enough assembler would not know about the
 crc32q instruction and fail to compile.

GCC from 2002 wouldn't support the symbolic operand names in inline
assembly. binutils from 2007 (IIRC) wouldn't support the assembler
instructions themselves.

We could work around the latter by using the appropriate sequence of
bytes. We could work around the former by using the old syntax for
operands.

 I believe the GCC way to do this would be to put the SSE4.2-specific
 code into a separate source file, and compile that file with
 -msse4.2. And when you compile with -msse4.2, gcc actually also
 supports the _mm_crc32_u8/u64 intrinsics.

I have no objection to this.

Building only that file with -msse4.2 would resolve the problem of the
output binary requiring SSE4.2; and the only compilers to be excluded
are old enough to be uninteresting at least to me personally.

Have you done/are you doing this already, or do you want me to? I could
use advice on how to add build flags to only one file, since I don't
know of any precendent for that.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-03-31 Thread Abhijit Menon-Sen
Hi.

I'm just posting this WIP patch where I've renamed fastbloat to
pgstatbloat as suggested by Tomas, and added in the documentation, and
so on. I still have to incorporate Amit's comments about the estimation
of reltuples according to the way vacuum does it, and I expect to post
that tomorrow (I just need to test a little more).

In the meantime, if anyone else was having trouble installing the
extension due to the incorrect version in the control file, this is the
patch you should be using.

-- Abhijit
From f809e070e8ea13b74c6206ca67a7eaf2a32e60fa Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile   |   4 +-
 contrib/pgstattuple/pgstatbloat.c  | 346 +
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql  |  18 ++
 .../{pgstattuple--1.2.sql = pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control|   2 +-
 doc/src/sgml/pgstattuple.sgml  | 135 
 6 files changed, 519 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql = pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = pgstattuple - tuple-level statistics
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 000..15c2cb9
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,346 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen a...@2ndquadrant.com
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pg_stattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS
+ * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include postgres.h
+
+#include access/visibilitymap.h
+#include access/transam.h
+#include access/xact.h
+#include access/multixact.h
+#include access/htup_details.h
+#include catalog/namespace.h
+#include funcapi.h
+#include miscadmin.h
+#include storage/bufmgr.h
+#include storage/freespace.h
+#include storage/procarray.h
+#include storage/lmgr.h
+#include utils/builtins.h
+#include utils/tqual.h
+#include commands/vacuum.h
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+			   FunctionCallInfo fcinfo);
+static Datum fbstat_relation(Relation rel, FunctionCallInfo fcinfo);
+static Datum fbstat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-03-22 Thread Abhijit Menon-Sen
At 2015-02-24 11:22:41 -0500, da...@pgmasters.net wrote:

 Patch v3 is attached.

 […]

 +/* Log class enum used to represent bits in auditLogBitmap */
 +enum LogClass
 +{
 + LOG_NONE = 0,
 +
 + /* SELECT */
 + LOG_READ = (1  0),
 +
 + /* INSERT, UPDATE, DELETE, TRUNCATE */
 + LOG_WRITE = (1  1),
 +
 + /* DDL: CREATE/DROP/ALTER */
 + LOG_DDL = (1  2),
 +
 + /* Function execution */
 + LOG_FUNCTION = (1  4),
 +
 + /* Function execution */
 + LOG_MISC = (1  5),
 +
 + /* Absolutely everything */
 + LOG_ALL = ~(uint64)0
 +};

The comment above LOG_MISC should be changed.

More fundamentally, this classification makes it easy to reuse LOGSTMT_*
(and a nice job you've done of that, with just a few additional special
cases), I don't think this level is quite enough for our needs. I think
it should at least be possible to specifically log commands that affect
privileges and roles.

I'm fond of finer categorisation for DDL as well, but I could live with
all DDL being lumped together.

I'm experimenting with a few approaches to do this without reintroducing
switch statements to test every command. That will require core changes,
but I think we can find an acceptable arrangement. I'll post a proof of
concept in a few days.

 + * Takes an AuditEvent and, if it log_check(), writes it to the audit
 log.

I don't think log_check is the most useful name, because this sentence
doesn't tell me what the function may do. Similarly, I would prefer to
have log_acl_check be renamed acl_grants_audit or similar. (These are
all static functions anyway, I don't think a log_ prefix is needed.)

 + /* Free the column set */
 + bms_free(tmpSet);

(An aside, really: there are lots of comments like this, which I don't
think add anything to understanding the code, and should be removed.)

 + /*
 +  * We don't have access to the parsetree here, so we have to 
 generate
 +  * the node type, object type, and command tag by decoding
 +  * rte-requiredPerms and rte-relkind.
 +  */
 + auditEvent.logStmtLevel = LOGSTMT_MOD;

(I am also trying to find a way to avoid having to do this.)

 + /* Set object type based on relkind */
 + switch (class-relkind)
 + {
 + case RELKIND_RELATION:
 + utilityAuditEvent.objectType = 
 OBJECT_TYPE_TABLE;
 + break;

This occurs elsewhere too. But I suppose new relkinds are added less
frequently than new commands.

Again on a larger level, I'm not sure how I feel about _avoiding_ the
use of event triggers for audit logging. Regardless of whether we use
the deparse code (which I personally think is a good idea; Álvaro has
been working on it, and it looks very nice) to log extra information,
using the object access hook inevitably means we have to reimplement
the identification/classification code here.

In old pgaudit, I think that extra effort is justified by the need to
be backwards compatible with pre-event trigger releases. In a 9.5-only
version, I am not at all convinced that this makes sense.

Thoughts?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 authentication needs help -SCRAM

2015-03-18 Thread Abhijit Menon-Sen
At 2015-03-14 09:44:02 +0200, hlinn...@iki.fi wrote:

 Perhaps it would be time to restart the discussion on standardizing
 SRP as a SASL mechanism in IETF.

I haven't seen much evidence that there's any interest in doing this; in
fact, I can't remember the author of the draft you pointed to being very
active in the discussions either.

 Assume that the connection is not encrypted, and Eve captures the
 SCRAM handshake between Alice and Bob. Using the captured handshake,
 she can try to guess the password, offline. With a PAKE protocol, she
 cannot do that.

OK. I agree that this is a nice property. SCRAM made the design decision
to hinder such attacks by using PBKDF2 rather than a zero-knowledge key
exchange mechanism as SRP does. This was partly due to the trend that I
mentioned of wanting to require TLS everywhere.

I'm obviously biased in this matter, but I think it's acceptable for the
potential attack to be frustrated by the use of PBKDF2 and defeated by
the use of TLS (which is already possible with Postgres); and that in
the balance, SCRAM is easier to implement securely than SRP.

Of course, if you want to use x as your password everywhere, then SRP
is preferable. ;-)

-- Abhijit

P.S. I don't know why the SRP code was removed from LibreSSL; nor am I
sure how seriously to take that. It's possible that it's only because
it's (still) rather obscure.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 authentication needs help -SCRAM

2015-03-18 Thread Abhijit Menon-Sen
As a followup, I spoke to an IETF friend who's used and implemented both
SRP and SCRAM. He agrees that SRP is cryptographically solid, that it's
significantly more difficult to implement (and therefore has a bit of a
monoculture risk overall, though of course that wouldn't apply to us if
we were to write the code from scratch).

Apparently the patent status is still not entirely clear. Two of the
patents expired, but there are others that may be relevant. Stanford
claims a patent, but apparently grant a free license if you do meet
certain conditions. But he doesn't know of anyone having to go to
court over the use of SRP.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-03-10 Thread Abhijit Menon-Sen
At 2015-01-15 14:32:45 +0100, and...@2ndquadrant.com wrote:

 What I am thinking of is that, currently, if you start the server for
 initial loading with fsync=off, and then restart it, you're open to
 data loss. So when the current config file setting is changed from off
 to on, we should fsync the data directory. Even if there was no crash
 restart.

Patch attached.

Changes:

1. Renamed perform_fsync to fsync_recursively (otherwise it would read
   fsync_pgdata(pg_data))
2. Added ControlData-fsync_disabled to record whether fsync was ever
   disabled while the server was running (tested in CreateCheckPoint)
3. Run fsync_recursively at startup only if fsync is enabled
4. Run it if we're doing crash recovery, or fsync was disabled
5. Use pg_flush_data in pre_sync_fname
6. Issue fsync on directories too
7. Tested that it works if pg_xlog is a symlink (no changes).

(In short, everything you mentioned in your earlier mail.)

Note that I set ControlData-fsync_disabled to false in BootstrapXLOG,
but it gets set to true during a later CreateCheckPoint(). This means
we run fsync again at startup after initdb. I'm not sure what to do
about that.

Is this about what you had in mind?

-- Abhijit
From bb2b5f130525dd44a10eab6829b9802b8f6f7eed Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: Recursively fsync PGDATA at startup if needed

It's needed if we need to perform crash recovery or if fsync was
disabled at some point while the server was running earlier (and
we must store that information in the control file).

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c   | 171 
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/include/catalog/pg_control.h|   8 +-
 3 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71cbe0e..75a6862 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -833,6 +833,12 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void pre_sync_fname(char *fname, bool isdir);
+static void walkdir(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_links(char *path,
+			 void (*action) (char *fname, bool isdir));
+static void fsync_recursively(char *pg_data);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -4721,6 +4727,7 @@ BootStrapXLOG(void)
 	ControlFile-checkPoint = checkPoint.redo;
 	ControlFile-checkPointCopy = checkPoint;
 	ControlFile-unloggedLSN = 1;
+	ControlFile-fsync_disabled = false;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile-MaxConnections = MaxConnections;
@@ -5787,6 +5794,27 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg(control file contains invalid data)));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 *
+	 * We also do this if the control file indicates that fsync was
+	 * disabled at some point while the server was running earlier.
+	 */
+
+	if (enableFsync 
+		(ControlFile-fsync_disabled ||
+		 (ControlFile-state != DB_SHUTDOWNED 
+		  ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY)))
+	{
+		fsync_recursively(data_directory);
+		ControlFile-fsync_disabled = false;
+	}
+
 	if (ControlFile-state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -8137,6 +8165,8 @@ CreateCheckPoint(int flags)
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile-minRecoveryPoint = InvalidXLogRecPtr;
 	ControlFile-minRecoveryPointTLI = 0;
+	if (!enableFsync)
+		ControlFile-fsync_disabled = true;
 
 	/*
 	 * Persist unloggedLSN value. It's reset on crash recovery, so this goes
@@ -11008,3 +11038,144 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl-WalWriterSleeping = sleeping;
 	SpinLockRelease(XLogCtl-info_lck);
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+static void
+pre_sync_fname(char *fname, bool isdir)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd  0  isdir

Re: [HACKERS] MD5 authentication needs help -SCRAM

2015-03-09 Thread Abhijit Menon-Sen
At 2015-03-09 13:52:10 +0200, hlinn...@iki.fi wrote:

 Do you have any insight on why the IETF working group didn't choose a
 PAKE protocol instead of or in addition to SCRAM, when SCRAM was
 standardized?

Hi Heikki.

It was a long time ago, but I recall that SRP was patent-encumbered:

https://datatracker.ietf.org/ipr/search/?rfc=2945submit=rfc

The Wikipedia page says the relevant patents expired in 2011 and 2013.
I haven't followed SRP development since then, maybe it's been revised.

When SCRAM was being discussed, I can't recall any other proposals for
PAKE protocols. Besides, as you may already know, anyone can submit an
internet-draft about anything. It needs to gain general support for an
extended period in order to advance through the standards process.

Could you please explain what exactly you mean about a SCRAM
eavesdropper gaining some advantage in being able to mount a
dictionary attack? I didn't follow that part.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 authentication needs help -SCRAM

2015-03-09 Thread Abhijit Menon-Sen
At 2015-03-08 12:48:44 -0700, j...@agliodbs.com wrote:

 Since SCRAM has been brought up a number of times here, I thought
 I'd loop in the PostgreSQL contributor who is co-author of the SCRAM
 standard to see if he has anything to say about implementing SCRAM as
 a built-in auth method for Postgres.

I think it's a good idea.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-17 Thread Abhijit Menon-Sen
At 2015-02-17 13:01:46 -0500, sfr...@snowman.net wrote:

 I have to admit that I'm confused by this.  Patches don't stabilise
 through sitting in the archives, they stabilise when the comments are
 being addressed, the patch updated, and further comments are
 addressing less important issues.  The issues which Robert and I had
 both commented on didn't appear to be getting addressed.

I'm confused and unhappy about your characterisation of the state of
this patch. You make it seem as though there was broad consensus about
the changes that were needed, and that I left the patch sitting in the
archives for a long time without addressing important issues.

You revised and refined your GRANT proposal in stages, and I tried to
adapt the code to suit. I'm sorry that my efforts were not fast enough 
or responsive enough to make you feel that progress was being made. But
nobody else commented in detail on the GRANT changes except to express
general misgivings, and nobody else even disagreed when I inferred, in
light of the lack of consensus that Robert pointed out, that the code
was unlikely to make it into 9.5.

Given that I've maintained the code over the past year despite its being
rejected in an earlier CF, and given the circumstances outlined above, I
do not think it's reasonable to conclude after a couple of weeks without
a new version that it was abandoned. As I had mentioned earlier, there
are people who already use pgaudit as-is, and complain if I break it.

Anyway, I guess there is no such thing as a constructive history
discussion, so I'll drop it.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-17 Thread Abhijit Menon-Sen
At 2015-02-17 10:52:55 -0500, sfr...@snowman.net wrote:

 From the old thread, David had offered to submit a pull request if
 there was interest and I didn't see any response...

For whatever it's worth, that's because I've been away from work, and
only just returned. I had it on my list to look at the code tomorrow.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-02-11 Thread Abhijit Menon-Sen
At 2015-02-10 14:30:51 +0200, hlinnakan...@vmware.com wrote:

 I propose that we add a new header file in src/port, let's call it
 crc_instructions.h […] the point of the crc_instructions.h header
 is to hide the differences between compilers and architectures.

Moving the assembly code/compiler intrinsics to a separate header is
fine with me, but I really don't like the proposed implementation at
all. Here are some comments:

1. I don't mind moving platform-specific tests for CRC32C instructions
   to configure, but if we only define PG_HAVE_CRC32C_INSTRUCTIONS, we
   would anyway have to reproduce all that platform-specific stuff in
   the header file. To do it properly, I think we should generate the
   right version of crc_instructions.h for the platform. Otherwise,
   what's the point? Might as well have only the complicated header.

2. At this point, I think we should stick with the _sse function rather
   than a generic _hw one to drive the platform-specific instructions.
   The structure of that function (n/8*(8bytes)+n%8*(bytes)) is specific
   to what works best for the SSE instructions. On another platform, we
   may need something very similar, or very different.
   
   With the proposed structure, _hw would inevitably become #ifdef'ed
   non-trivially, e.g. to run a single-byte loop at the start to align
   the main loop, but only for certain combinations of #defines.
   
   I think we should consider what makes the most sense when someone
   actually wants to add support for another platform/compiler, and
   not before.

3. I dislike everything about pg_crc32_instructions_runtime_check()—the
   name, the separate PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK #define,
   the fact that you try to do the test inline rather than at startup…

   Maybe you're trying to avoid checking separately at startup so that a
   program that links with code from src/common doesn't need to do its
   own test. But I don't think the results are worth it.

   As for omitting the slicing-by-8 tables, if there's a more compelling
   reason to do that than Gentoo users might like that ;-), I think it
   should be done with a straightforward -DUSE_CRC_SSE style arrangement
   rather than figuring out that you HAVE_CRC32C_INSTRUCTIONS but don't
   NEED_RUNTIME_CHECK. If you want to build special-purpose binaries,
   just pick whichever CRC implementation you like, done.

 I believe the CRC instructions are also available in 32-bit mode, so
 the check for __x86_64__ is not needed. Right?

I'm not sure, but I guess you're right.

 It would be nice to use GCC's builtin intrinsics,
 __builtin_ia32_crc32* where available, but I guess those are only
 available if you build with -msse4.2, and that would defeat the
 point of the runtime check.

Exactly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-02-09 Thread Abhijit Menon-Sen
At 2015-02-09 12:52:41 +0200, hlinnakan...@vmware.com wrote:

 Ok, I've committed a patch that just moves the existing code to
 common/pg_crc.c […]

Thanks, looks good.

 Attached is a rebased version of the slicing-by-8 patch.

Looks OK too.

 Do you have access to big-endian hardware to test this on?

Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said
it does make a noticeable difference, though I'm afraid I did not keep
the timings after submitting the revised patch. The speedup was some
double-digit percentage, IIRC.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-02-08 Thread Abhijit Menon-Sen
At 2015-02-08 18:46:30 +0200, hlinnakan...@vmware.com wrote:

 So I propose to move pg_crc.c to src/common, and move the tables
 from pg_crc_tables.h directly to pg_crc.c. Thoughts?

Sounds fine to me.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-26 17:45:52 -0500, robertmh...@gmail.com wrote:

  Based on the recent emails, it appears there's been a shift of
  preference to having it be in-core […]
 
 Well, I'm not sure that anyone else here agreed with me on that

Sure, an in-core AUDIT command would be great. Stephen has always said
that would be the most preferable solution; and if we had the code to
implement it, I doubt anyone would prefer the contrib module instead.
But we don't have that code now, and we won't have it in time for 9.5.

We had an opportunity to work on pgaudit in its current form, and I like
to think that the result is useful. To me, the question has always been
whether some variant of that code would be acceptable for 9.5's contrib.
If so, I had some time to work on that. If not… well, hard luck. But the
option to implement AUDIT was not available to me, which is why I have
not commented much on it recently.

 The basic dynamic here seems to be you asking for changes and Abhijit
 making them but without any real confidence, and I don't feel good
 about that. 

I understand how I might have given you that impression, but I didn't
mean to, and I don't really feel that way.

I appreciate Stephen's suggestions and, although it took me some time to
understand them fully, I think the use of GRANT to provide finer-grained
auditing configuration has improved pgaudit. I am slightly concerned by
the resulting complexity, but I think that can be addressed by examples
and so on. I wouldn't be unhappy if this code were to go into contrib.

(I should point out that it is also not the case that I do not hold any
opinions and would be happy with anything pgaudit-shaped being included.
For example, I strongly prefer GRANT to the 'alice:*:*' approach.)

Anyway, I think it's reasonably clear now that pgaudit is unlikely to
make it into 9.5 in any form, so I'll find something else to do.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-26 18:47:29 -0600, jim.na...@bluetreble.com wrote:

 Anything happen with this?

Nothing so far.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-27 17:00:27 -0600, jim.na...@bluetreble.com wrote:

 It would be best to get this into a commit fest so it's not lost.

It's there already.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perl coding error in msvc build system?

2015-01-23 Thread Abhijit Menon-Sen
At 2014-06-03 22:30:50 -0400, pete...@gmx.net wrote:

 I'm not sure whether the following coding actually detects any errors:
 
 Solution.pm:
 
 open(P, cl /? 21 |) || die cl command not found;

Since nobody with a Windows system has commented, I'm just writing to
say that from a Perl perspective, I agree with your analysis and the
patch looks perfectly sensible.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup fails with long tablespace paths

2015-01-23 Thread Abhijit Menon-Sen
At 2014-12-24 08:10:46 -0500, pete...@gmx.net wrote:

 As a demo for how this might look, attached is a wildly incomplete
 patch to produce GNU long-link headers.

Hi Peter.

In what way exactly is this patch wildly incomplete? (I ask because it's
been added to the current CF).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-21 Thread Abhijit Menon-Sen
[After a discussion on IRC with Stephen…]

At 2015-01-20 21:47:02 -0500, sfr...@snowman.net wrote:

 Review the opening of this email though and consider that we could
 look at what privileges has the audit role granted to the current
 role? as defining what is to be audited.

Right, I understand now how that would work. I'll try to find time to
(a) implement this, (b) remove the backwards-compatibility code, and
(c) split up the USE_DEPARSE_FUNCTIONS stuff.

  For example, what if I want to see all the tables created and
  dropped by a particular user?
 
 I hadn't been intending to address that with this scheme, but I think
 we have that by looking for privilege grants where the audit role is
 the grantee and the role-to-be-audited the grantor.

For CREATE, yes, with a bit of extra ACL-checking code in the utility
hook; but I don't think we'll get very far without the ability to log
ALTER/DROP too. :-) So there has to be some alternative mechanism for
that, and I'm hoping Robert (or anyone!) has something in mind.

 Well, I was primairly digging for someone to say yes, the object
 access stuff will be reworked to be based on event triggers, thus
 removing the need for the object access bits.

(This doesn't entirely make sense to me, but it's tangential anyway, so
I won't comment on it for now.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-20 Thread Abhijit Menon-Sen
At 2015-01-20 20:36:39 -0500, robertmh...@gmail.com wrote:

 I think this is throwing the baby out with the bathwater.  Neither C
 functions nor all-or-nothing are going to be of any practical use.

Do you see some approach that has a realistic chance of making 9.5 and
would also actually be worth putting into 9.5? Suggestions very welcome.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   >