Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Heikki Linnakangas

On 04/22/2015 03:30 AM, Michael Paquier wrote:

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.


Currently, the archive command doesn't know if the segment it's 
archiving is partial or not, so you can't put any logic there to manage 
it. But if we archive it with the .partial suffix, then you can put 
logic in the restore_command to check for .partial files, if you really 
want to.


I feel that the best approach is to archive the last, partial segment, 
but with the .partial suffix. I don't see any plausible real-world setup 
where the current behaviour would be better. I don't really see much 
need to archive the partial segment at all, but there's also no harm in 
doing it, as long as it's clearly marked with the .partial suffix.


BTW, pg_receivexlog also uses a ".partial" file, while it's streaming 
WAL from the server. The .partial suffix is removed when the segment is 
complete. So there's some precedence to this. pg_receivexlog adds just 
".partial" to the filename, it doesn't add any information of what 
portion of the file is valid like I suggested here, though. Perhaps we 
should follow pg_receivexlog's example at promotion too, for consistency.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add pg_settings.pending_restart column

2015-04-21 Thread Michael Paquier
On Thu, Mar 5, 2015 at 12:04 PM, Peter Eisentraut  wrote:
> On 2/17/15 10:45 AM, Robert Haas wrote:
>> You don't really need the "else" here, and in parallel cases:
>>
>>  if (*conf->variable != newval)
>>  {
>> +record->status |= GUC_PENDING_RESTART;
>>  ereport(elevel,
>>  (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
>>   errmsg("parameter \"%s\" cannot be
>> changed without restarting the server",
>>  name)));
>>  return 0;
>>  }
>> +else
>> +record->status &= ~GUC_PENDING_RESTART;
>>  return -1;
>>
>> The if-statement ends with "return 0" so there is no reason for the "else".
>
> I kind of liked the symmetry of if/else, but I can change it.

This feature looks useful to me. I had a quick look and it is working
as intended: issuing SIGHUP to reload parameters updates the
pending_restart status correctly.

One additional comment on top of what has already been mentioned is
that this lacks parenthesis IMO:
- values[16] = conf->status & GUC_PENDING_RESTART ? "t" : "f";
+ values[16] = (conf->status & GUC_PENDING_RESTART) ? "t" : "f";
Also, documentation was not correctly formatted.

Changes with ALTER SYSTEM (and include files) get recognized as well.
For example:
=# \! echo max_prepared_transactions = 100 >> $PGDATA/postgresql.conf
=# select pg_reload_conf();
 pg_reload_conf

 t
(1 row)
=# select name from pg_settings where pending_restart;
   name
---
 max_prepared_transactions
(1 row)
=# alter system set max_connections = 1000;
ALTER SYSTEM
=# select pg_reload_conf();
 pg_reload_conf

 t
(1 row)
=# select name from pg_settings where pending_restart;
   name
---
 max_connections
 max_prepared_transactions
(2 rows)

Attached is a rebased patch with previous comments addressed as I was
looking at it.
Switching this patch as "Ready for committer".
Regards,
-- 
Michael
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index d0b78f2..53d3f4f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8822,6 +8822,14 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   or when examined by a non-superuser)
   
  
+ 
+  pending_restart
+  boolean
+  true if the value has been changed in the
+  configuration file but needs a restart; or false
+  otherwise.
+  
+ 
 

   
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f43aff2..e09b021 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5897,12 +5897,14 @@ set_config_option(const char *name, const char *value,
 {
 	if (*conf->variable != newval)
 	{
+		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
 (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
  errmsg("parameter \"%s\" cannot be changed without restarting the server",
 		name)));
 		return 0;
 	}
+	record->status &= ~GUC_PENDING_RESTART;
 	return -1;
 }
 
@@ -5985,12 +5987,14 @@ set_config_option(const char *name, const char *value,
 {
 	if (*conf->variable != newval)
 	{
+		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
 (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
  errmsg("parameter \"%s\" cannot be changed without restarting the server",
 		name)));
 		return 0;
 	}
+	record->status &= ~GUC_PENDING_RESTART;
 	return -1;
 }
 
@@ -6073,12 +6077,14 @@ set_config_option(const char *name, const char *value,
 {
 	if (*conf->variable != newval)
 	{
+		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
 (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
  errmsg("parameter \"%s\" cannot be changed without restarting the server",
 		name)));
 		return 0;
 	}
+	record->status &= ~GUC_PENDING_RESTART;
 	return -1;
 }
 
@@ -6179,12 +6185,14 @@ set_config_option(const char *name, const char *value,
 	if (*conf->variable == NULL || newval == NULL ||
 		strcmp(*conf->variable, newval) != 0)
 	{
+		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
 (errcode(ERRCODE_CANT_CHANGE_RUNTIME_PARAM),
  errmsg("parameter \"%s\" cannot be changed without restarting the server",
 		name)));
 		return 0;
 	}
+	record->status &= ~GUC_PENDING_RESTART;
 	return -1;
 }
 
@@ -6272,12 +6280,14 @@ set_config_option(const char *name, const char *value,
 {
 	if (*conf->variable != newval)
 	{
+		record->status |= GUC_PENDING_RESTART;
 		ereport(elevel,
 (errcode(ERRCODE_CANT_CHANGE_RUNT

Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Heikki Linnakangas

On 04/22/2015 12:42 AM, Robert Haas wrote:

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas  wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas 
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.


But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.


I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 000200XX to
pg_xlog/000100XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 00010012 segment as
"00020012.0128.partial", where 0128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.


So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted.  It sounds to me like
today this will work and with your proposed change it will require
manual intervention.


No. If there is no streaming replication involved, no partial files will 
be archived, with or without this patch. There is no change to that 
scenario.


Note that it's a bit complicated to set up that scenario today. 
Archiving is never enabled in recovery mode, so you'll need to use a 
custom cron job or something to maintain the archive that C uses. The 
files will not automatically flow from B to the second archive. With the 
patch we're discussing, however, it would be easy: just set 
archive_mode='always' in B.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Authenticating from SSL certificates

2015-04-21 Thread kee...@thebrocks.net
Hello,

I'm looking into connection to postgres using authentication from client
certificates. [1]

The documentation states that the common name (aka CN) is read from the
certificate and used as the user's login (aka auth_user).
The problem is the common name is typically the user's full name. A field
like email address would contain a more computer friendly identifier.

So my feature request is to ​allow the postgres admin to specify the field
in the ssl client certificate to be used to read the auth_user​.​


I started to dig into the code and have some thoughts, but wanted to get
any advice before I started writing up some code.


Add a "user" option to pg_hba.conf:
# TYPE  DATABASE USER  ADDRESS   METHOD
hostssl all  all   all   cert map=usermap user=CN

1. Documentation seems straight forward [1]
2. The configuration value would be added in parse_hba_line and this value
is accessible via port->hba.
3. The certificate can be parsed from port->peer with something like
X509_NAME_field_to_text [2].
4. The user requested field would then be passed as auth_user
into check_usermap [3].

The current code parses the ssl common name and populates peer_cn pretty
early on. [4]
That suggests to me that most of the ssl parsing wants to be done up front.
Then again, peer_cn is not used anywhere else so it may be fine to just
delete this field from the structure.


An alternative is to populate peer_cn with the user requested field. [4]
The configuration option would be in postgresql.conf and would reside in a
global variable (similar to ssl_cert_file).

Any pointers would be great.
I could find a little history in the archives, but couldn't determine if
any decisions or conclusions had been made.

Thanks,
Keenan

[1]: http://www.postgresql.org/docs/9.4/static/auth-methods.html#AUTH-CERT
[2]:
https://github.com/postgres/postgres/blob/b0a738f428ca4e52695c0f019c1560c64cc59aef/contrib/sslinfo/sslinfo.c#L171-L192
[3]:
https://github.com/postgres/postgres/blob/b0a738f428ca4e52695c0f019c1560c64cc59aef/src/backend/libpq/auth.c#L2153
[4]:
https://github.com/postgres/postgres/blob/b0a738f428ca4e52695c0f019c1560c64cc59aef/src/backend/libpq/be-secure-openssl.c#L428-L445


Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-21 Thread Kouhei Kaigai
Hanada-san,

> I reviewed the Custom/Foreign join API patch again after writing a patch of 
> join
> push-down support for postgres_fdw.
>
Thanks for your dedicated jobs, my comments are inline below.

> Here, please let me summarize the changes in the patch as the result of my 
> review.
> 
> * Add set_join_pathlist_hook_type in add_paths_to_joinrel
> This hook is intended to provide a chance to add one or more CustomPaths for 
> an
> actual join combination.  If the join is reversible, the hook is called for 
> both
> A * B and B * A.  This is different from FDW API but it seems fine because 
> FDWs
> should have chances to process the join in more abstract level than CSPs.
> 
> Parameters are same as hash_inner_and_outer, so they would be enough for 
> hash-like
> or nestloop-like methods.  I’m not sure whether mergeclause_list is necessary
> as a parameter or not.  It’s information for merge join which is generated 
> when
> enable_mergejoin is on and the join is not FULL OUTER.  Does some CSP need it
> for processing a join in its own way?  Then it must be in parameter list 
> because
> select_mergejoin_clauses is static so it’s not accessible from external 
> modules.
>
I think, a preferable way is to reproduce the mergeclause_list by extension 
itself,
rather than pass it as a hook argument, because it is uncertain whether CSP 
should
follow "enable_mergejoin" parameter even if it implements a logic like 
merge-join.
Of course, it needs to expose select_mergejoin_clauses. It seems to me a 
straight-
forward way.

> The timing of the hooking, after considering all built-in path types, seems 
> fine
> because some of CSPs might want to use built-in paths as a template or a 
> source.
> 
> One concern is in the document of the hook function.  "Implementing Custom 
> Paths”
> says:
> 
> > A custom scan provider will be also able to add paths by setting the 
> > following
> hook, to replace built-in join paths by custom-scan that performs as if a scan
> on preliminary joined relations, which us called after the core code has 
> generated
> what it believes to be the complete and correct set of access paths for the 
> join.
> 
> I think “replace” would mis-lead readers that CSP can remove or edit existing
> built-in paths listed in RelOptInfo#pathlist or linked from cheapest_foo.  
> IIUC
> CSP can just add paths for the join relation, and planner choose it if it’s 
> the
> cheapest.
>
I adjusted the documentation stuff as follows:

   A custom scan provider will be also able to add paths by setting the
   following hook, to add CustomPath nodes that perform as
   if built-in join logic doing. It is typically expected to take two
   input relations then generate a joined output stream, or just scans
   preliminaty joined relations like materialized-view. This hook is
   called next to the consideration of core join logics, then planner
   will choose the best path to run the relations join in the built-in
   and custom ones.

Probably, it can introduce what this hook works correctly.
v12 patch updated only this portion.

> * Add new FDW API GetForeignJoinPaths in make_join_rel
> This FDW API is intended to provide a chance to add ForeignPaths for a join 
> relation.
> This is called only once for a join relation, so FDW should consider reversed
> combination if it’s meaningful in their own mechanisms.
> 
> Note that this is called only when the join relation was *NOT* found in the
> PlannerInfo, to avoid redundant calls.
>
Yep, it is designed according to the discussion upthreads.
It can produce N-way remote join paths even if intermediate join relation is
more expensive than local join + two foreign scan.

> Parameters seems enough for postgres_fdw to process N-way join on remote side
> with pushing down join conditions and remote filters.
>
You ensured it clearly.

> * Treat scanrelid == 0 as pseudo scan
> A foreign/custom join is represented by a scan against a pseudo relation, i.e.
> result of a join.  Usually Scan has valid scanrelid, oid of a relation being
> scanned, and many functions assume that it’s always valid.  The patch adds 
> another
> code paths for scanrelid == 0 as custom/foreign join scans.
>
Right,

> * Pseudo scan target list support
> CustomScan and ForeignScan have csp_ps_tlist and fdw_ps_tlist respectively, 
> for
> column reference tracking.  A scan generated for custom/foreign join would 
> have
> column from multiple relations in its target list, i.e. output columns.  
> Ordinary
> scans have all valid columns of the relation as output, so references to them
> can be resolved easily, but we need an additional mechanism to determine where
> a reference in a target list of custom/foreign scan come from.  This is very
> similar to what IndexOnlyScan does, so we reuse INDEX_VAR as mark of an 
> indirect
> reference to another relation’s var.
>
Right, FDW/CSP driver is responsible to set *_ps_tlist to inform the core 
planner
which columns of relations are referenced, and which attr

Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"

2015-04-21 Thread Jan de Visser
On April 21, 2015 09:34:51 PM Jan de Visser wrote:
> On April 21, 2015 09:01:14 PM Jan de Visser wrote:
> > On April 21, 2015 07:32:05 PM Payal Singh wrote:
... snip ...
> 
> Urgh. It appears you are right. Will fix.
> 
> jan

Attached a new attempt. This was one from the category 'I have no idea how 
that ever worked", but whatever. For reference, this is how it looks for me 
(magic man-behind-the-curtain postgresql.conf editing omitted):

jan@wolverine:~/Projects/postgresql$ initdb -D data
... Bla bla bla ...
jan@wolverine:~/Projects/postgresql$ pg_ctl -D data -l logfile start
server starting
jan@wolverine:~/Projects/postgresql$ tail -5 logfile
LOG:  database system was shut down at 2015-04-21 22:03:33 EDT
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload
server signaled
jan@wolverine:~/Projects/postgresql$ tail -5 logfile
LOG:  database system was shut down at 2015-04-21 22:03:33 EDT
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  received SIGHUP, reloading configuration files
jan@wolverine:~/Projects/postgresql$ pg_ctl -D data reload
server signaled
pg_ctl: Reload of server with PID 14656 FAILED
Consult the server log for details.
jan@wolverine:~/Projects/postgresql$ tail -5 logfile
LOG:  autovacuum launcher started
LOG:  received SIGHUP, reloading configuration files
LOG:  received SIGHUP, reloading configuration files
LOG:  syntax error in file "/home/jan/Projects/postgresql/data/postgresql.conf" 
line 1, near end of line
LOG:  configuration file "/home/jan/Projects/postgresql/data/postgresql.conf" 
contains errors; no changes were applied
jan@wolverine:~/Projects/postgresql$ diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a9f20ac..a7819d2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1222,6 +1222,15 @@ PostmasterMain(int argc, char *argv[])
 #endif
 
 	/*
+	 * Update postmaster.pid with startup time as the last reload time:
+	 */
+	{
+		char last_reload_info[32];
+		snprintf(last_reload_info, 32, "%ld %d", (long) MyStartTime, 1);
+		AddToDataDirLockFile(LOCK_FILE_LINE_LAST_RELOAD, last_reload_info);
+	}
+
+	/*
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
@@ -2341,6 +2350,8 @@ static void
 SIGHUP_handler(SIGNAL_ARGS)
 {
 	int			save_errno = errno;
+	boolreload_success;
+	charlast_reload_info[32];
 
 	PG_SETMASK(&BlockSig);
 
@@ -2348,7 +2359,16 @@ SIGHUP_handler(SIGNAL_ARGS)
 	{
 		ereport(LOG,
 (errmsg("received SIGHUP, reloading configuration files")));
-		ProcessConfigFile(PGC_SIGHUP);
+		reload_success = ProcessConfigFile(PGC_SIGHUP);
+
+		/*
+		 * Write the current time and the result of the reload to the
+		 * postmaster.pid file.
+		 */
+		snprintf(last_reload_info, 32, "%ld %d",
+(long) time(NULL), reload_success);
+		AddToDataDirLockFile(LOCK_FILE_LINE_LAST_RELOAD, last_reload_info);
+
 		SignalChildren(SIGHUP);
 		SignalUnconnectedWorkers(SIGHUP);
 		if (StartupPID != 0)
diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index c5e0fac..3162cd5 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -109,7 +109,7 @@ STRING			\'([^'\\\n]|\\.|\'\')*\'
  * All options mentioned in the configuration file are set to new values.
  * If an error occurs, no values will be changed.
  */
-void
+bool
 ProcessConfigFile(GucContext context)
 {
 	bool		error = false;
@@ -202,7 +202,7 @@ ProcessConfigFile(GucContext context)
 		 * the config file.
 		 */
 		if (head == NULL)
-			return;
+			return false;
 	}
 
 	/*
@@ -430,6 +430,7 @@ ProcessConfigFile(GucContext context)
 	 * freed here.
 	 */
 	FreeConfigVariables(head);
+	return !error;
 }
 
 /*
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 80d7bc7..0ffe97b 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -73,6 +73,20 @@ typedef enum
 	RUN_AS_SERVICE_COMMAND
 } CtlCommand;
 
+typedef struct
+{
+	pgpid_tpid;
+	char  *datadir;
+	time_t startup_ts;
+	intport;
+	char  *socketdir;
+	char  *listenaddr;
+	unsigned long  shmkey;
+	intshmid;
+	time_t reload_ts;
+	bool   reload_ok;
+} PIDFileContents;
+
 #define DEFAULT_WAIT	60
 
 static bool do_wait = false;
@@ -153,6 +167,8 @@ static int	CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo,
 static pgpid_t get_pgpid(bool is_status_request);
 static char **readfile(const char *path);
 static void free_readfile(char **optlines);
+static PIDFileContents * get_pidfile_contents(const char *path);
+static void free_pidfile_contents(PIDFileContents *contents);
 static int	start_postmaster(void);
 static void read_post_opts(void);
 
@@ -415,6 +431,78 @@ free_readfile(char **optlines)
 }
 
 /*
+ * Read and parse the contents of

Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"

2015-04-21 Thread Jan de Visser
On April 21, 2015 09:01:14 PM Jan de Visser wrote:
> On April 21, 2015 07:32:05 PM Payal Singh wrote:
> > I'm trying to review this patch and applied
> > http://www.postgresql.org/message-id/attachment/37123/Let_pg_ctl_check_the
> > _r esult_of_a_postmaster_config_reload.patch to postgres. gmake check
> > passed but while starting postgres I see:
> > 
> > [postgres@vagrant-centos65 data]$ LOG:  incomplete data in
> > "postmaster.pid": found only 5 newlines while trying to add line 8
> > LOG:  redirecting log output to logging collector process
> > HINT:  Future log output will appear in directory "pg_log".
> > 
> > 
> > Also, a simple syntax error test gave no warning at all on doing a reload,
> > but just as before there was an error message in the logs:
> > 
> > [postgres@vagrant-centos65 data]$ /usr/local/pgsql/bin/pg_ctl -D
> > /usr/local/pgsql/data reload
> > server signaled
> > [postgres@vagrant-centos65 data]$ cd pg_log
> > [postgres@vagrant-centos65 pg_log]$ ls
> > postgresql-2015-04-21_232328.log  postgresql-2015-04-21_232858.log
> > [postgres@vagrant-centos65 pg_log]$ grep error
> > postgresql-2015-04-21_232858.log
> > LOG:  syntax error in file "/usr/local/pgsql/data/postgresql.conf" line
> > 211, near token "/"
> > LOG:  configuration file "/usr/local/pgsql/data/postgresql.conf" contains
> > errors; no changes were applied
> > 
> > I'm guessing since this is a patch submitted to the commitfest after the
> > current one, am I too early to start reviewing it?
> > 
> > Payal
> 
> But, but, but...  it worked for me... :-)
> 
> I'll have a look. I'll apply my patch to a clean tree and see if any bits
> have rotted in the last month or so.
> 
> One thing to note is that you won't get the actual error; just a message
> that reloading failed.
> 
> jan

Urgh. It appears you are right. Will fix.

jan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"

2015-04-21 Thread Jan de Visser
(Please don't top post)

On April 21, 2015 07:32:05 PM Payal Singh wrote:
> I'm trying to review this patch and applied
> http://www.postgresql.org/message-id/attachment/37123/Let_pg_ctl_check_the_r
> esult_of_a_postmaster_config_reload.patch to postgres. gmake check passed
> but while starting postgres I see:
> 
> [postgres@vagrant-centos65 data]$ LOG:  incomplete data in
> "postmaster.pid": found only 5 newlines while trying to add line 8
> LOG:  redirecting log output to logging collector process
> HINT:  Future log output will appear in directory "pg_log".
> 
> 
> Also, a simple syntax error test gave no warning at all on doing a reload,
> but just as before there was an error message in the logs:
> 
> [postgres@vagrant-centos65 data]$ /usr/local/pgsql/bin/pg_ctl -D
> /usr/local/pgsql/data reload
> server signaled
> [postgres@vagrant-centos65 data]$ cd pg_log
> [postgres@vagrant-centos65 pg_log]$ ls
> postgresql-2015-04-21_232328.log  postgresql-2015-04-21_232858.log
> [postgres@vagrant-centos65 pg_log]$ grep error
> postgresql-2015-04-21_232858.log
> LOG:  syntax error in file "/usr/local/pgsql/data/postgresql.conf" line
> 211, near token "/"
> LOG:  configuration file "/usr/local/pgsql/data/postgresql.conf" contains
> errors; no changes were applied
> 
> I'm guessing since this is a patch submitted to the commitfest after the
> current one, am I too early to start reviewing it?
> 
> Payal

But, but, but...  it worked for me... :-)

I'll have a look. I'll apply my patch to a clean tree and see if any bits have 
rotted in the last month or so. 

One thing to note is that you won't get the actual error; just a message that 
reloading failed.

jan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Michael Paquier
On Wed, Apr 22, 2015 at 6:42 AM, Robert Haas  wrote:
>
> On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas  wrote:
> > On 04/21/2015 12:04 PM, Michael Paquier wrote:
> >> On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas 
> >> wrote:
> >>> Note that even though we don't archive the partial last segment on the
> >>> previous timeline, the same WAL is copied to the first segment on the new
> >>> timeline. So the WAL isn't lost.
> >>
> >> But if the failed master has archived those segments safely, we may need
> >> them, no? I am not sure we can ignore a user who would want to do a PITR
> >> with recovery_target_timeline pointing to the one of the failed master.
> >
> > I think it would be acceptable. If you want to maintain an up-to-the-second
> > archive, you can use pg_receivexlog. Mind you, if the standby wasn't
> > promoted, the partial segment would not be present in the archive anyway.
> > And you can copy the WAL segment manually from 000200XX to
> > pg_xlog/000100XX before starting PITR.
> >
> > Another thought is that we could archive the partial file, but with a
> > different name to avoid confusing it with the full segment. For example, we
> > could archive a partial 00010012 segment as
> > "00020012.0128.partial", where 0128 indicates how
> > far that file is valid (this naming is similar to how the backup history
> > files are named). Recovery wouldn't automatically pick up those files, but
> > the DBA could easily copy the partial file into pg_xlog with the full
> > segment's name, if he wants to do PITR to that piece of WAL.
>
> So, suppose you A replicating to B (via an archive) replicating to C
> (via a separate archive); A dies, B is promoted.  It sounds to me like
> today this will work and with your proposed change it will require
> manual intervention.  I don't think that's OK.


This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix broken Install.bat when target directory contains a space

2015-04-21 Thread Michael Paquier
On Wed, Apr 22, 2015 at 12:40 AM, Asif Naeem  wrote:

> Thank you Michael, latest patch looks good to me. I have changed its
> status to ready for committer.
>

Thanks!
-- 
Michael


Re: [HACKERS] Idea: closing the loop for "pg_ctl reload"

2015-04-21 Thread Payal Singh
I'm trying to review this patch and applied
http://www.postgresql.org/message-id/attachment/37123/Let_pg_ctl_check_the_result_of_a_postmaster_config_reload.patch
to postgres. gmake check passed but while starting postgres I see:

[postgres@vagrant-centos65 data]$ LOG:  incomplete data in
"postmaster.pid": found only 5 newlines while trying to add line 8
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".


Also, a simple syntax error test gave no warning at all on doing a reload,
but just as before there was an error message in the logs:

[postgres@vagrant-centos65 data]$ /usr/local/pgsql/bin/pg_ctl -D
/usr/local/pgsql/data reload
server signaled
[postgres@vagrant-centos65 data]$ cd pg_log
[postgres@vagrant-centos65 pg_log]$ ls
postgresql-2015-04-21_232328.log  postgresql-2015-04-21_232858.log
[postgres@vagrant-centos65 pg_log]$ grep error
postgresql-2015-04-21_232858.log
LOG:  syntax error in file "/usr/local/pgsql/data/postgresql.conf" line
211, near token "/"
LOG:  configuration file "/usr/local/pgsql/data/postgresql.conf" contains
errors; no changes were applied

I'm guessing since this is a patch submitted to the commitfest after the
current one, am I too early to start reviewing it?

Payal

Payal Singh,
Database Administrator,
OmniTI Computer Consulting Inc.
Phone: 240.646.0770 x 253

On Thu, Mar 5, 2015 at 4:06 PM, Jim Nasby  wrote:

> On 3/4/15 7:13 PM, Jan de Visser wrote:
>
>> On March 4, 2015 11:08:09 PM Andres Freund wrote:
>>
>>> Let's get the basic feature (notification of failed reloads) done
>>> first. That will be required with or without including the error
>>> message.  Then we can get more fancy later, if somebody really wants to
>>> invest the time.
>>>
>>
>> Except for also fixing pg_reload_conf() to pay attention to what happens
>> with
>> postmaster.pid, the patch I submitted does exactly what you want I think.
>>
>> And I don't mind spending time on stuff like this. I'm not smart enough
>> to deal
>> with actual, you know, database technology.
>>
>
> Yeah, lets at least get this wrapped and we can see about improving it.
>
> I like the idea of doing a here-doc or similar in the .pid, though I think
> it'd be sufficient to just prefix all the continuation lines with a tab. An
> uglier option would be just striping the newlines out.
> --
> Jim Nasby, Data Architect, Blue Treble Consulting
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Jim Nasby

On 4/21/15 3:21 PM, Robert Haas wrote:

It's possible that we could use this infrastructure to freeze more
aggressively in other circumstances.  For example, perhaps VACUUM
should freeze any page it intends to mark all-visible.  That's not a
guaranteed win, because it might increase WAL volume: setting a page
all-visible does not emit an FPI for that page, but freezing any tuple
on it would, if the page hasn't otherwise been modified since the last
checkpoint.  Even if that were no issue, the freezing itself must be
WAL-logged.  But if we could somehow get to a place where all-visible
=> frozen, then autovacuum would never need to visit all-visible
pages, a huge win.


I don't know how bad the extra WAL traffic would be; we'd obviously need 
to incur it eventually, so it's a question of how common it is for a 
page to go all-visible but then go not-all-visible again before 
freezing. It would presumably be far more traffic than some form of a 
FrozenMap though...



We could also attack the problem from the other end.  Instead of
trying to set the bits on the individual tuples, we could decide that
whenever a page is marked all-visible, we regard it as frozen
regardless of the bits set or not set on the individual tuples.
Anybody who wants to modify the page must freeze any unfrozen tuples
"for real" before clearing the visibility map bit.  This would have
the same end result as the previous idea: all-visible would
essentially imply frozen, and autovacuum could ignore those pages
categorically.


Pushing what's currently background work onto foreground processes 
doesn't seem like a good idea...



I'm not saying those ideas don't have problems, because they do.  But
I think they are worth further exploring.  The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches.  Clearly, though, we need to do something
about this.  Freezing is a big problem for lots of users.


Did XID-LSN die? I see at the bottom of the thread it was returned with 
feedback; I guess Heikki just hasn't had time and there's no major 
blockers? From what I remember this is probably a better solution, but 
if it's not going to make it into 9.6 then we should probably at least 
look further into a FM.



All that having been said, I don't think adding a new fork is a good
approach.  We already have problems pretty commonly where our
customers complain about running out of inodes.  Adding another fork
for every table would exacerbate that problem considerably.


Andres idea of adding this to the VM may work well to handle that. It 
would double the size of the VM, but it would still be a ratio of 
32,000-1 compared to heap size, or 2MB for a 64GB table.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2015 proposal: Support for microvacuum for GiST

2015-04-21 Thread Ilia Ivanicki
>
> GSoC should be treated as a full-time job, that's how much time you're
>> expected to dedicate to it. Having bachelor's degree exams in June would be
>> a serious problem. You'll need to discuss with the potential mentors on how
>> to make up for that time.
>>
>
My bachelor's diploma is almost done and I will have enough time for GSoC
work.


> Other than that, the schedule seems fairly relaxed. In fact, this project
> seems a bit too small for a GSoC project. I'd suggest coming up with some
> additional GiST-related work that you could do, in addition to the
> microvacuum thing. Otherwise I think there's a risk that you finish the
> patch in May, and have nothing to do for the rest of the summer.


I want to take additional work-item for gsoc 2015.

I don't known, which item of todo is completed, but I compose list of
items:

1) add support for microvacuum for GIN index in common with Anastasiya
Lubennikova (she will be realize function amgettuple in GIN), if it's a
possible feature.

2) bug with Index on inet changes query result (
http://www.postgresql.org/message-id/flat/201010112055.o9bktzf7011...@wwwmaster.postgresql.org#201010112055.o9bktzf7011...@wwwmaster.postgresql.org
)

3) Teach GIN cost estimation about "fast scans"(I know very little about
GIN, but discussion in mailing list was interesting for me)

4) pg_restore unusable for expensive matviews (
http://www.postgresql.org/message-id/flat/20140820021530.2534.43...@wrigleys.postgresql.org#20140820021530.2534.43...@wrigleys.postgresql.org
)

5) may be community can suggest me such thing with GiST or microvacuum for
GiST will be usefull for all.

Best wishes,
Ivanitskiy Ilya.


[HACKERS] Buffer management improvement wiki page

2015-04-21 Thread Jim Nasby
There's been far more ideas and testing done around improving shared 
buffer management than I can remember, and I suspect I'm not alone in 
that regard. So I've created a wiki page as a place to pull this 
information together. I'll try and keep highlights/important links 
posted there, but help would be welcome. I think that in particular any 
test data would be very useful to post there.


https://wiki.postgresql.org/wiki/Shared_Buffer_Improvements
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 2:53 PM, Jim Nasby  wrote:
> I think that would help, but it still leaves user backends trying to advance
> the clock, which is quite painful. Has anyone tested running the clock in
> the background? We need a wiki page with all the ideas that have been tested
> around buffer management...

Amit's bgreclaimer patch did that, but we weren't able to demonstrate
a clear benefit.  I haven't given up on the idea yet, but we've got to
be able to prove that it's a good idea in practice as well as in
theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update docs in fdwhandler.sgml

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 5:45 AM, Etsuro Fujita
 wrote:
> Since we now allow CHECK constraints to be placed on foreign tables, not
> only NOT NULL, I think it'd be better to update docs on considerations
> about constraints on foreign tables in fdwhandler.sgml, so as to provide
> more general considerations.  Please find attached a patch.

Looks good to me, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas  wrote:
> On 04/21/2015 12:04 PM, Michael Paquier wrote:
>> On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas 
>> wrote:
>>> Note that even though we don't archive the partial last segment on the
>>> previous timeline, the same WAL is copied to the first segment on the new
>>> timeline. So the WAL isn't lost.
>>
>> But if the failed master has archived those segments safely, we may need
>> them, no? I am not sure we can ignore a user who would want to do a PITR
>> with recovery_target_timeline pointing to the one of the failed master.
>
> I think it would be acceptable. If you want to maintain an up-to-the-second
> archive, you can use pg_receivexlog. Mind you, if the standby wasn't
> promoted, the partial segment would not be present in the archive anyway.
> And you can copy the WAL segment manually from 000200XX to
> pg_xlog/000100XX before starting PITR.
>
> Another thought is that we could archive the partial file, but with a
> different name to avoid confusing it with the full segment. For example, we
> could archive a partial 00010012 segment as
> "00020012.0128.partial", where 0128 indicates how
> far that file is valid (this naming is similar to how the backup history
> files are named). Recovery wouldn't automatically pick up those files, but
> the DBA could easily copy the partial file into pg_xlog with the full
> segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted.  It sounds to me like
today this will work and with your proposed change it will require
manual intervention.  I don't think that's OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 5:41 PM, Bruce Momjian  wrote:
> On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote:
>> On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian  wrote:
>> > Slightly improved patch applied.
>>
>> Is it?
>
> The patch has a slightly modified 'if' statement to check a constant
> before calling a function, and use elseif:
>
> < + if (!interpretOidsOption(stmt->options, true) && cxt.hasoids)
> ---
> > + if (cxt.hasoids && !interpretOidsOption(stmt->options, true))
> 47c57
> < + if (interpretOidsOption(stmt->options, true) && !cxt.hasoids)
> ---
> > + else if (!cxt.hasoids && interpretOidsOption(stmt->options, 
> true))
>
> I realize the change is subtle.

What I meant was - I didn't see an attachment on that message.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing spinlock acquisition within clock sweep loop

2015-04-21 Thread Merlin Moncure
On Tue, Apr 21, 2015 at 3:25 PM, Andres Freund  wrote:
> On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote:
>> The main sources of contention, buffer_strategy_lock, has been removed
>> FWICT via 5d7962c6 and d72731a7.  However, during the sweep each tick
>> locks the buffer header via spinlock in order to to adjust
>> usage_count.
>
> FWIW, I think the best approach to achieve this is to make most buffer
> header manipulations lockless. It's not trivial but quite possible. I'd
> a very rough POC a while back ([1]); that only degenerated to something
> spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER
> IIRC).  That'd not necessarily reduce the number of atomic ops, but
> making things lockless makes does away with the problem of sleeping
> while holding a spinlock etc.

Right.  lockless doesn't necessarily translate to better performance
though if you have to increase atomic ops to get away with it.

> The primary use case for that is less the clock sweep than things index
> root pages and small lookup pages. I've seen several times that the
> spinlock can really hurt there.

Yeah.  I'm trying to address that basically...see below.

>> Clock sweep buffer acquisition currently looks like this:
>> loop:
>> tick();
>> spinlock();
>> in use? goto loop
>> usage_count > 0? yes: decrement, goto loop
>> return buffer;
>>
>> Proposed buffer acquisition would look like:
>> loop:
>> tick();
>> victim < last_victim? yes: atomic re-fetch completePasses (see below)
>> usage_count > completePasses? yes: goto loop
>> try_spinlock(); no lock? goto loop
>> usage_count > completePasses? yes: goto loop
>> in use? goto loop
>> return buffer;
>>
>> (try_spinlock is an adorned TAS() as a opposed to a full spin).  Usage
>> count is double checked after lock acquisition in case the local copy
>> is stale.
>
> I don't really see what this improves? You could do the "try spinlock"
> bit just as well with the old algorithm and I don't really see a
> accuracy benefit in the proposed approach? Maybe I'm missing something
> entirely?

Yes: the 'try lock' isn't necessary here, i just threw it in. It's
been suggested few times in the past...I see no useful reason why the
clocksweep (using the current algorithm) should sit and spin.
Contended buffers should be hopped over.  The real win is in
eliminating the spinlock during clock tick except for buffers that are
real candidates for being returned.  refcount is a different problem,
but has different characteristics; it has to be 100% accurate while
usage_count does not, something which can be exploited (previously I
suggested to simply remove all locking around it but I'm discarding
that idea as there's no way to really prove it works well in unusual
circumstances).

> [1]: The basic trick I have in mind is that by putting flags, usagecount
> and refcount into one integer (in separate bits) it is possible to do
> CAS loops to manipulate each of those. E.g. to decrement usagecount you
> can do something roughly like
>
> do
> {
> val = pg_atomic_read_u32(desc->flag_and_counts);
> if (BufferDescUsageCount(val) == 0)
>break;
>
> newval = BufferDescChangeUsageCount(val, -1);
> }
> while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts,
>  &val, newval))
>
> Where BufferDescUsageCount just to some bit masking and shifting to
> manipulate the right part. Same for refcount and most flags.
>
> To deal with changing tags I'd introduced a flag value that was used to
> say 'you have to spinlock this time'.

Interesting.  I'm not sure sure that these ideas are contradictory,
especially since refcount and usage_count are not always manipulated
at the same time.

Maybe you can get the best of both worlds; lockless pin/unpin with a
lock free sweep: you'd have a fancy 'BufferDescIncreaseUsageCount',
but not a decrease, since the completePasses (or some derivative of
it), increases to give you the 'decrease' which is implicit as the
clock sweeps around.  What I'm proposing would decrease the atomic ops
from the current 'atomic increment + spinlock' to a simple
'increment', plus the unusual atomic read to re-fetch completePasses.

OTOH, one tricky bit I didn't mention is that inferring usage_count
from an always incrementing value and completePasses would require the
'increment' routine to have knowledge of completePasses where
previously it doesn't need it to support the 'max 5' check, requiring
an atomic read, although perhaps not on every pin; a local copy would
mostly do.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-21 Thread Dean Rasheed
On 21 April 2015 at 20:50, Stephen Frost  wrote:
> Thanks a lot for this.  Please take a look at the attached.

I've given this a quick read-through, and it looks good to me. The
interaction of permissive and restrictive policies from hooks matches
my expections, and it's a definite improvement having tests for RLS
hooks.

The only thing I spotted was that the file comment for
test_rls_hooks.c needs updating.

Is there any documentation for hooks? If not, perhaps that's something
we should be considering too.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 9:38 AM, Amit Kapila  wrote:
> On Mon, Apr 20, 2015 at 10:08 PM, Robert Haas  wrote:
>>
>> On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila 
>> wrote:
>> > One disadvantage of retaining parallel-paths could be that it can
>> > increase the number of combinations planner might need to evaluate
>> > during planning (in particular during join path evaluation) unless we
>> > do some special handling to avoid evaluation of such combinations.
>>
>> Yes, that's true.  But the overhead might not be very much.  In the
>> common case, many baserels and joinrels will have no parallel paths
>> because the non-parallel paths is known to be better anyway.  Also, if
>> parallelism does seem to be winning, we're probably planning a query
>> that involves accessing a fair amount of data,
>
> Am I understanding right that by above you mean to say that retain the
> parallel and non-parallel path only if parallel-path wins over non-parallel
> path?

Yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-21 Thread Peter Eisentraut
On 4/21/15 4:45 PM, Jim Nasby wrote:
> This comment made me wonder... has anyone considered handing the pruning
> work off to a bgworker, at least for SELECTs? That means the selects
> themselves wouldn't be burdened by the actual prune work, only in
> notifying the bgworker. While that's not going to be free, presumably
> it's a lot cheaper...

The nice thing about having foreground queries to the light cleanup is
that they can work in parallel and naturally hit the interesting parts
of the table first.

In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table.  That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance tuning assisted by a GUI application

2015-04-21 Thread Jim Nasby

On 4/16/15 8:42 AM, Jacek Wielemborek wrote:

I had a brief discussion on #postgresql and thought that perhaps there
might be a need for a tool that would enable a fine-tuning of PostgreSQL
performance settings by conveniently testing them with a sample SQL
query with the aid of a simple GUI application. To illustrate this, I
created this little proof of concept:

https://gist.github.com/d33tah/d01f3599e55e53d00f68

Screenshot can be found here: https://imgur.com/TguH6Xq


...


What do you think? Would anybody be interested in an application like this?


Possibly not in hackers, but I think people on -general might be very 
interested. Now whether it's a good idea to hand them such a tool... ;)

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 11:04 AM, Bruce Momjian  wrote:
> Yes, it might be too much optimization to try to get the checkpoint to
> flush all those pages sequentially, but I was thinking of our current
> behavior where, after an update of all rows, we effectively write out
> the entire table because we have dirtied every page.  I guess with later
> prune-based writes, we aren't really writing all the pages as we have
> the pattern where pages with prunable content is kind of random. I guess
> I was just wondering what value there is to your write-then-skip idea,
> vs just writing the first X% of pages we find?  Your idea certainly
> spreads out the pruning, and doesn't require knowing the size of the
> table, though I though that information was easily determined.
>
> One thing to consider is how we handle pruning of index scans that hit
> multiple heap pages.  Do we still write X% of the pages in the table, or
> %X of the heap pages we actually access via SELECT?  With the
> write-then-skip approach, we would do X% of the pages we access, while
> with the first-X% approach, we would probably prune all of them as we
> would not be accessing most of the table.  I don't think we can do the
> first first-X% of pages and have the percentage based on the number of
> pages accessed as we have no way to know how many heap pages we will
> access from the index.  (We would know for bitmap scans, but that
> complexity doesn't seem worth it.)  That would argue, for consistency
> with sequential and index-based heap access, that your approach is best.

I actually implemented something like this for setting hint bits a few
years ago:

http://www.postgresql.org/message-id/aanlktik5qzr8wts0mqcwwmnp-qhgrdky5av5aob7w...@mail.gmail.com
http://www.postgresql.org/message-id/aanlktimgkag7wdu-x77gnv2gh6_qo5ss1u5b6q1ms...@mail.gmail.com

At least in later versions, the patch writes a certain number of
hinted pages, then skips writing a run of pages, then writes another
run of hinted pages.  The basic problem here is that, after the fsync
queue compaction patch went in, the benefits on my tests were pretty
modest.  Yeah, it costs something to write out lots of dirty pages,
but before the fsync queue compaction stuff, the initial scan of an
unhinted table took like 6x the time on the machine I tested on, but
after that, it was like 1.5x the time.  Blunting that spike just
wasn't exciting enough.

It strikes me that it would be better to have an integrated strategy
for this problem.  It doesn't make sense to have one strategy for
deciding whether to set hint bits and a separate strategy for deciding
whether to HOT-prune.  And if we decide to set hint bits and
HOT-prune, it might be smart to try to mark the page all-visible, too,
if it is and we're not about to update it.  I believe we're losing a
lot of performance on OLTP workloads by re-dirtying the same pages
over and over again.  We've probably all hit cases where there is an
obvious loss of performance because of this sort of thing, but I'm
starting to think it's hurting us in a lot of less-obvious ways.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-21 Thread Jim Nasby

On 4/21/15 10:04 AM, Bruce Momjian wrote:

One thing to consider is how we handle pruning of index scans that hit
multiple heap pages.  Do we still write X% of the pages in the table, or
%X of the heap pages we actually access via SELECT?  With the
write-then-skip approach, we would do X% of the pages we access, while
with the first-X% approach, we would probably prune all of them as we
would not be accessing most of the table.  I don't think we can do the
first first-X% of pages and have the percentage based on the number of
pages accessed as we have no way to know how many heap pages we will
access from the index.


This comment made me wonder... has anyone considered handing the pruning 
work off to a bgworker, at least for SELECTs? That means the selects 
themselves wouldn't be burdened by the actual prune work, only in 
notifying the bgworker. While that's not going to be free, presumably 
it's a lot cheaper...

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 6:13 PM, Alvaro Herrera
 wrote:
> Bruce Momjian wrote:
>> On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
>> > Bruce Momjian wrote:
>> > This seems simple to implement: keep two counters, where the second one
>> > is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
>> > reset the first counter so that further 5 pages will get HOT pruned.  5%
>> > seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
>> > essentially +infinity.)
>>
>> This would tend to dirty non-sequential heap pages --- it seems best to
>> just clean as many as we are supposed to, then skip the rest, so we can
>> write sequential dirty pages to storage.
>
> Keep in mind there's a disconnect between dirtying a page and writing it
> to storage.  A page could remain dirty for a long time in the buffer
> cache.  This writing of sequential pages would occur at checkpoint time
> only, which seems the wrong thing to optimize.  If some other process
> needs to evict pages to make room to read some other page in, surely
> it's going to try one page at a time, not write "many sequential dirty
> pages."

Well, for a big sequential scan, we use a ring buffer, so we will
typically be evicting the pages that we ourselves read in moments
before.  So in this case we would do a lot of sequential writes of
dirty pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Andres Freund
On 2015-04-21 16:26:08 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund  wrote:
> > I've now named the functions:
> >
> > * pg_replication_origin_create
> > * pg_replication_origin_drop
> > * pg_replication_origin_get (map from name to id)
> > * pg_replication_progress_setup_origin : configure session to replicate
> >   from a specific origin
> > * pg_replication_progress_reset_origin
> > * pg_replication_progress_setup_tx_details : configure per transaction
> >   details (LSN and timestamp currently)
> > * pg_replication_progress_is_replaying : Is a origin configured for the 
> > session
> > * pg_replication_progress_advance : "manually" set the replication
> >   progress to a value. Primarily useful for copying values from other
> >   systems and such.
> > * pg_replication_progress_get : How far did replay progress for a
> >   certain origin
> > * pg_get_replication_progress : SRF returning the replay progress for
> >   all origin.
> >
> > Any comments?
> 
> Why are we using functions for this rather than DDL?

Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 4:27 PM, Andres Freund  wrote:
> On 2015-04-21 16:21:47 -0400, Robert Haas wrote:
>> All that having been said, I don't think adding a new fork is a good
>> approach.  We already have problems pretty commonly where our
>> customers complain about running out of inodes.  Adding another fork
>> for every table would exacerbate that problem considerably.
>
> Really? These days? There's good arguments against another fork
> (increased number of fsyncs, more stat calls, increased number of file
> handles, more WAL logging, ...), but the number of inodes themselves
> seems like something halfway recent filesystems should handle.

Not making it up...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Andres Freund
On 2015-04-21 16:21:47 -0400, Robert Haas wrote:
> All that having been said, I don't think adding a new fork is a good
> approach.  We already have problems pretty commonly where our
> customers complain about running out of inodes.  Adding another fork
> for every table would exacerbate that problem considerably.

Really? These days? There's good arguments against another fork
(increased number of fsyncs, more stat calls, increased number of file
handles, more WAL logging, ...), but the number of inodes themselves
seems like something halfway recent filesystems should handle.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund  wrote:
> I've now named the functions:
>
> * pg_replication_origin_create
> * pg_replication_origin_drop
> * pg_replication_origin_get (map from name to id)
> * pg_replication_progress_setup_origin : configure session to replicate
>   from a specific origin
> * pg_replication_progress_reset_origin
> * pg_replication_progress_setup_tx_details : configure per transaction
>   details (LSN and timestamp currently)
> * pg_replication_progress_is_replaying : Is a origin configured for the 
> session
> * pg_replication_progress_advance : "manually" set the replication
>   progress to a value. Primarily useful for copying values from other
>   systems and such.
> * pg_replication_progress_get : How far did replay progress for a
>   certain origin
> * pg_get_replication_progress : SRF returning the replay progress for
>   all origin.
>
> Any comments?

Why are we using functions for this rather than DDL?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing spinlock acquisition within clock sweep loop

2015-04-21 Thread Andres Freund
On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote:
> The main sources of contention, buffer_strategy_lock, has been removed
> FWICT via 5d7962c6 and d72731a7.  However, during the sweep each tick
> locks the buffer header via spinlock in order to to adjust
> usage_count.

FWIW, I think the best approach to achieve this is to make most buffer
header manipulations lockless. It's not trivial but quite possible. I'd
a very rough POC a while back ([1]); that only degenerated to something
spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER
IIRC).  That'd not necessarily reduce the number of atomic ops, but
making things lockless makes does away with the problem of sleeping
while holding a spinlock etc.

The primary use case for that is less the clock sweep than things index
root pages and small lookup pages. I've seen several times that the
spinlock can really hurt there.


I do believe that we can't continue with our current clock sweep
approach for much longer. It's already extremely expensive and even if
we fix some of the most glaring problems (as discussed nearby atm), it's
still a horribly cache/runtime inefficient way to do things in many
workloads.

> Right now usage_count ranges across a small number (currently 0-5),
> incrementing upon allocation and decrementing on sweep.  Because of
> this, each tick of the clock sweep spinlocks the buffer in preparation
> of A. the buffer being returned or B. the buffer having a non zero
> usage count so that it can be safely decremented.   B is potential
> optimization fruit, I think.

> The idea:
> Instead, why not let usage_count rotate around on it's own in a
> clock-like fashion?  the idea is that, instead of defining usage count
> as a fixed offset from zero, it is based on the number of times the
> buffer has been touched since the last sweep.  In other words, each
> time the buffer clock crosses midnight, 'usage count' for the whole
> pool drops by one, implicitly. Each time a buffer is touched, it's
> usage_count is set to 'completePasses' + 1, but, just like today, is
> never allowed to increment past completePasses + 5.  So, what used to
> be usage_count becomes defined as the distance between the usage_count
> and completePasses.
>
> Because of usage_count tracking rotation, calculating what used to be
> usage_count requires a bit of logic to arrive at the right number as
> rotation happens.  I'm pretty sure all the rotation edge cases can be
> handled but I'm glossing over them for now.

> Clock sweep buffer acquisition currently looks like this:
> loop:
> tick();
> spinlock();
> in use? goto loop
> usage_count > 0? yes: decrement, goto loop
> return buffer;
>
> Proposed buffer acquisition would look like:
> loop:
> tick();
> victim < last_victim? yes: atomic re-fetch completePasses (see below)
> usage_count > completePasses? yes: goto loop
> try_spinlock(); no lock? goto loop
> usage_count > completePasses? yes: goto loop
> in use? goto loop
> return buffer;
>
> (try_spinlock is an adorned TAS() as a opposed to a full spin).  Usage
> count is double checked after lock acquisition in case the local copy
> is stale.

I don't really see what this improves? You could do the "try spinlock"
bit just as well with the old algorithm and I don't really see a
accuracy benefit in the proposed approach? Maybe I'm missing something
entirely?


[1]: The basic trick I have in mind is that by putting flags, usagecount
and refcount into one integer (in separate bits) it is possible to do
CAS loops to manipulate each of those. E.g. to decrement usagecount you
can do something roughly like

do
{
val = pg_atomic_read_u32(desc->flag_and_counts);
if (BufferDescUsageCount(val) == 0)
   break;

newval = BufferDescChangeUsageCount(val, -1);
}
while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts,
 &val, newval))

Where BufferDescUsageCount just to some bit masking and shifting to
manipulate the right part. Same for refcount and most flags.

To deal with changing tags I'd introduced a flag value that was used to
say 'you have to spinlock this time'.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 7:59 PM, Jim Nasby  wrote:
> http://www.postgresql.org/message-id/ca+tgmoaemnolzmvbb8gvy69na8zw9bwpiz9+tlz-lnabozi...@mail.gmail.com
> has a WIP patch that goes the route of using a tuple flag to indicate
> frozen, but also raises a lot of concerns about visibility, because it means
> we'd stop using FrozenXID. That impacts a large amount of code. There were
> some followup patches as well as a bunch of discussion of how to make it
> visible that a tuple was frozen or not. That thread died in January 2014.

Actually, this change has already been made, so it's not so much of a
to-do as a was-done.  See commit
37484ad2aacef5ec794f4dd3d5cf814475180a78.  The immediate thing we got
out of that change is that when CLUSTER or VACUUM FULL rewrite a
table, they now freeze all of the tuples using this method.  See
commits 3cff1879f8d03cb729368722ca823a4bf74c0cac and
af2543e884db06c0beb75010218cd88680203b86.  Previously, CLUSTER or
VACUUM FULL would not freeze anything, which meant that people who
tried to use VACUUM FULL to recover from XID wraparound problems got
nowhere, and even people who knew when to use which tool could end up
having to VACUUM FULL and then VACUUM FREEZE afterward, rewriting the
table twice, an annoyance.

It's possible that we could use this infrastructure to freeze more
aggressively in other circumstances.  For example, perhaps VACUUM
should freeze any page it intends to mark all-visible.  That's not a
guaranteed win, because it might increase WAL volume: setting a page
all-visible does not emit an FPI for that page, but freezing any tuple
on it would, if the page hasn't otherwise been modified since the last
checkpoint.  Even if that were no issue, the freezing itself must be
WAL-logged.  But if we could somehow get to a place where all-visible
=> frozen, then autovacuum would never need to visit all-visible
pages, a huge win.

We could also attack the problem from the other end.  Instead of
trying to set the bits on the individual tuples, we could decide that
whenever a page is marked all-visible, we regard it as frozen
regardless of the bits set or not set on the individual tuples.
Anybody who wants to modify the page must freeze any unfrozen tuples
"for real" before clearing the visibility map bit.  This would have
the same end result as the previous idea: all-visible would
essentially imply frozen, and autovacuum could ignore those pages
categorically.

I'm not saying those ideas don't have problems, because they do.  But
I think they are worth further exploring.  The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches.  Clearly, though, we need to do something
about this.  Freezing is a big problem for lots of users.

All that having been said, I don't think adding a new fork is a good
approach.  We already have problems pretty commonly where our
customers complain about running out of inodes.  Adding another fork
for every table would exacerbate that problem considerably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row security violation error is misleading

2015-04-21 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
> On 7 April 2015 at 16:21, Stephen Frost  wrote:
> > Agreed and we actually have a patch from Dean already to address this,
> > it's just been waiting on me (with a couple of other ones).  It'd
> > certainly be great if you have time to take a look at those, though,
> > generally speaking, I feel prety happy about where those are and believe
> > they really just need to be reviewed/tested and maybe a bit of word-
> > smithing around the docs.
> 
> The first of those patches [1] has bit-rotted somewhat, due to the
> recent changes to the way rowmarks are handled, so here's an updated
> version. It wasn't a trivial merge, because commit
> cb1ca4d800621dcae67ca6c799006de99fa4f0a5 made a change to
> ExecBuildAuxRowMark() without a matching change to
> preprocess_targetlist(), and one of the new RLS-with-inheritance tests
> fell over that.

Thanks a lot for this.  Please take a look at the attached.  It still
includes the preprocess_targetlist() changes, so the regression tests
don't fail, but that'll be committed independently (hopefully soon).

I've taken an initial look at the second patch (actually, a few times)
and plan to do a thorough review soon.  I'd definitely like to get these
both committed and done with very shortly.  Again, apologies about the
delays; this past weekend was quite a bit busier than I originally
anticipated.

Thanks!

Stephen
From 33cbd926f935dbf3de2302c6c5bb5babebceaacf Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Sun, 19 Apr 2015 18:58:02 -0400
Subject: [PATCH] RLS fixes, new hooks, and new test module

In prepend_row_security_policies(), defaultDeny was always true, so if
there were any hook policies, the RLS policies on the table would just
get discarded.  Fixed to start off with defaultDeny as false and then
properly set later if we detect that only the default deny policy exists
for the internal policies.

The infinite recursion detection in fireRIRrules() didn't properly
manage the activeRIRs list in the case of WCOs, so it would incorrectly
report infinite recusion if the same relation with RLS appeared more
than once in the rtable, for example "UPDATE t ... FROM t ...".

Further, the RLS expansion code in fireRIRrules() was handling RLS in
the main loop through the rtable, which lead to RTEs being visited twice
if they contained sublink subqueries, which
prepend_row_security_policies() attempted to handle by exiting early if
the RTE already had securityQuals.  That doesn't work, however, since
if the query involved a security barrier view on top of a table with
RLS, the RTE would already have securityQuals (from the view) by the
time fireRIRrules() was invoked, and so the table's RLS policies would
be ignored.  This is fixed in fireRIRrules() by handling RLS in a
separate loop at the end, after dealing with any other sublink
subqueries, thus ensuring that each RTE is only visited once for RLS
expansion.

The inheritance planner code didn't correctly handle non-target
relations with RLS, which would get turned into subqueries during
planning. Thus an update of the form "UPDATE t1 ... FROM t2 ..." where
t1 has inheritance and t2 has RLS quals would fail.  Fix by making sure
to copy in and update the securityQuals when they exist for non-target
relations.

process_policies() was adding WCOs to non-target relations, which is
unnecessary, and could lead to a lot of wasted time in the rewriter and
the planner. Fix by only adding WCO policies when working on the result
relation.

Lastly, as noted by Dean, we were simply adding policies returned by the
hook provided to the list of quals being AND'd, meaning that they would
actually restrict records returned and there was no option to have
internal policies and hook-based policies work together permissively (as
all internal policies currently work).  Instead, explicitly add support
for both permissive and restrictive policies by having a hook for each
and combining the results appropriately.  To ensure this is all done
correctly, add a new test module (test_rls_hooks) to test the various
combinations of internal, permissive, and restrictive hook policies.

Largely from Dean Rasheed (thanks!):

caezatcvmfufuowwhnbtcgi6aquyjq0-1fykd0t3xbwjvn+x...@mail.gmail.com

Author: Dean Rasheed, though I added the new hooks and test module.
---
 src/backend/optimizer/plan/planner.c   |  45 ++-
 src/backend/optimizer/prep/preptlist.c |  34 +--
 src/backend/rewrite/rewriteHandler.c   | 127 ++---
 src/backend/rewrite/rowsecurity.c  | 248 +++--
 src/include/rewrite/rowsecurity.h  |   9 +-
 src/test/modules/Makefile  |   1 +
 src/test/modules/test_rls_hooks/.gitignore |   4 +
 src/test/modules/test_rls_hooks/Makefile   |  22 ++
 src/test/modules/test_rls_hooks/README |  16 ++
 .../test_rls_hooks/expected/test_rls_hooks.out | 193

[HACKERS] Reducing spinlock acquisition within clock sweep loop

2015-04-21 Thread Merlin Moncure
Background:
The main sources of contention, buffer_strategy_lock, has been removed
FWICT via 5d7962c6 and d72731a7.  However, during the sweep each tick
locks the buffer header via spinlock in order to to adjust
usage_count.  I believe that removing this lock is possible
optimization fruit.  I doubt that this is a major contention point for
most workloads (over the years, I've only seen one or two cases where
I suspected clock sweep was getting gummed up by buffer locks) but for
systems that are already under lock duress it's hard to argue that
spamming spinlocks in a tight loop makes things easier for the
hardware that is protecting cachelines.  In short, clock sweep heavy
workloads are issuing a lot of atomic ops.  Any savings here is good.

Right now usage_count ranges across a small number (currently 0-5),
incrementing upon allocation and decrementing on sweep.  Because of
this, each tick of the clock sweep spinlocks the buffer in preparation
of A. the buffer being returned or B. the buffer having a non zero
usage count so that it can be safely decremented.   B is potential
optimization fruit, I think.

The idea:
Instead, why not let usage_count rotate around on it's own in a
clock-like fashion?  the idea is that, instead of defining usage count
as a fixed offset from zero, it is based on the number of times the
buffer has been touched since the last sweep.  In other words, each
time the buffer clock crosses midnight, 'usage count' for the whole
pool drops by one, implicitly. Each time a buffer is touched, it's
usage_count is set to 'completePasses' + 1, but, just like today, is
never allowed to increment past completePasses + 5.  So, what used to
be usage_count becomes defined as the distance between the usage_count
and completePasses.

Because of usage_count tracking rotation, calculating what used to be
usage_count requires a bit of logic to arrive at the right number as
rotation happens.  I'm pretty sure all the rotation edge cases can be
handled but I'm glossing over them for now.

Clock sweep buffer acquisition currently looks like this:
loop:
tick();
spinlock();
in use? goto loop
usage_count > 0? yes: decrement, goto loop
return buffer;

Proposed buffer acquisition would look like:
loop:
tick();
victim < last_victim? yes: atomic re-fetch completePasses (see below)
usage_count > completePasses? yes: goto loop
try_spinlock(); no lock? goto loop
usage_count > completePasses? yes: goto loop
in use? goto loop
return buffer;

(try_spinlock is an adorned TAS() as a opposed to a full spin).  Usage
count is double checked after lock acquisition in case the local copy
is stale.

Bad stuff:
1. Complexity. Is it worth it?  Calculating proper 'usage_count'
distance in the face of two rotating numbers requires some clever
coding.

2. Unfortunately, since we've (for very good reasons) got a clock
sweep that can be concurrently engaged by many actors, it's no longer
safe to assume that completePasses can be safely examined outside of
the sweep loop but must intead be checked with every tick.  Doing this
with another atomic read would defeat the entire purpose of the patch,
but perhaps we can check it based on a heuristic: it's examined when
victim is less than the last_seen victim (which is locally saved off).

...which would add up to a tiny (zero?) risk of misjudging
usage_distance.  The fallout, if it were to happen, would be to
possibly to return a buffer that would have a usage_count = 1 with the
old method.

3. Since proper usage_count, aka distance is managed by being relative
to completePasses, which slightly changes the semantics due to the
fact the entire buffer pool drop in terms of usage_count at once.  I'm
doubtful this matters though.

Just spitballing here -- wondering if the idea has legs.  I think this
should compatible with Robert's usage_count tweaking with flags FWICT.
I'm inclined to work up a POC patch unless this is unworkable for some
reason.  The payoff, for workloads with high sweep activity, would be
drastically reduced spinlock activity and associated cache line
stress.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 10:54 AM, Andres Freund  wrote:
> On 2015-04-21 10:53:08 -0400, Robert Haas wrote:
>> On Tue, Apr 21, 2015 at 6:17 AM, Craig Ringer  wrote:
>> >> I don't really like the 'pid' field for pg_replication_slots. About
>> >> naming it 'active_in' or such?
>> >
>> > It was originally named active_pid, but changed based on feedback from
>> > others that 'pid' would be consistent with pg_stat_activity and
>> > pg_replication_slots. I have no strong opinion on the name, though I'd
>> > prefer it reflect that the field does in fact represent a process ID.
>>
>> Agreed.  I don't like the as-committed name of active_in either.  It's
>> not at all clear what that means.
>
> I like it being called active_*, that makes the correlation to active
> clear. active_pid then?

wfm

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 6:49 PM, Peter Geoghegan  wrote:
> I see that you're using git format-patch to generate this. But the
> patch is only patch 1/4. Is that intentional? Where are the other
> pieces?
>
> I think that the parallel seqscan patch, and the assessing parallel
> safety patch are intended to fit together with this patch, but I can't
> find a place where there is a high level overview explaining just how
> they fit together (I notice Amit's patch has an "#include
> "access/parallel.h", which is here, but that wasn't trivial to figure
> out). I haven't been paying too much attention to this patch series.

The intended order of application is:

- parallel mode/contexts
- assess parallel safety
- parallel heap scan
- parallel seq scan

The parallel heap scan patch should probably be merged into the
parallel seq scan patch, which should probably then be split into
several patches, one or more with general parallel query
infrastructure and then another with stuff specific to parallel
sequential scan.  But we're not quite there yet.  Until we can get
agreement on how to finalize the parallel mode/contexts patch, I
haven't been too worried about getting the other stuff sorted out
exactly right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-21 Thread Robert Haas
On Tue, Mar 24, 2015 at 11:56 PM, Alvaro Herrera
 wrote:
>> Well, it's not actually the same message.  They're all a bit
>> different.  Or mostly all of them.  And the variable part is not a
>> command name, as in the PreventTransactionChain() case, so it would
>> affect translatability if we did that.
>
> Three things that vary are 1) the fact that some check IsParallelWorker()
> and others check IsInParallelMode(), and 2) that some of them are
> ereports() while others are elog(), and 3) that some are ERROR and
> others are FATAL.  Is there a reason for these differences?

The message text also varies, because it states the particular
operation that is prohibited.

We should check IsParallelWorker() for operations that are allowed in
the master during parallel mode, but not allowed in the workers - e.g.
the master can scan its own temporary relations, but its workers
can't.  We should check IsInParallelMode() for operations that are
completely off-limits in parallel mode - i.e. writes.

We use ereport() where we expect that SQL could hit that check, and
elog() where we expect that only (buggy?) C code could hit that check.

We use FATAL for some of the checks in xact.c, for parity with other,
similar checks in xact.c that relate to checking the transaction
state.  I think this is because we assume that if the transaction
state is hosed, it's necessary to terminate the backend to recover.
In other cases, we use ERROR.

> (Note typo "restrction" in quoted paragraph above.)

Thanks, will fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-21 Thread Robert Haas
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund  wrote:
>> You'd need some kind of
>> API that says "pretend I'm waiting for this lock, but don't really
>> wait for it", and you'd need to be darn sure that you removed yourself
>> from the wait queue again before doing any other heavyweight lock
>> manipulation.  Do you have specific thoughts on how to implement this?
>
> I've thought some about this, and I think it's a bit easier to not do it
> on the actual lock waitqueues, but teach deadlock.c about that kind of
> blocking.
>
> deadlock.c is far from simple, and at least I don't find the control
> flow to be particularly clear. So it's not easy.  It'd be advantageous
> to tackle things at that level because it'd avoid the need to acquire
> locks on some lock's waitqueue when blocking; we're going to do that a
> lot.
>
> But It seems to me that it should be possible to suceed: In
> FindLockCycleRecurse(), in the case that we're not waiting for an actual
> lock (checkProc->links.next == NULL) we can add a case that considers
> the 'blocking parallelism' case. ISTM that that's just a
> FindLockCycleRecurse() call on the process that we're waiting for. We'd
> either have to find the other side's locktag for DEADLOCK_INFO or invent
> another field in there; but that seems like a solveable problem.

I (finally) had some time to look at this today.  Initially, it looked
good.  Unfortunately, the longer I looked at it, the less convinced I
got that we could solve the problem this way.  The core of the issue
is that the Funnel node in the parallel group leader basically does
this:

while (!local_scan_done || !remote_scan_done)
{
attempt a read from each remaining worker's tuple queue, blocking
only if local_scan_done;
if (we got a tuple)
return it;
else if (there are no remaining workers)
remote_scan_done = true;

attempt to produce a tuple just as if we were a worker ourselves;
if (we got a tuple)
return it;
else
local_scan_done = true;
}

Imagine that we're doing a parallel sequential scan; each worker
claims one page but goes into the tank before it has returned all of
the tuples on that page.  The master reads all the remaining pages but
must now wait for the workers to finish returning the tuples on the
pages they claimed.

So what this means is:

1. The master doesn't actually *wait* until the very end of the parallel phase.
2. When the master does wait, it waits for all of the parallel workers
at once, not each one individually.

So, I don't think anything as simplistic as teaching a blocking
shm_mq_receive() to tip off the deadlock detector that we're waiting
for the process on the other end of that particular queue can ever
work.  Because of point #2, that never happens.  When I first started
thinking about how to fix this, I said, well, that's no big deal, we
can just advertise the whole list of processes that we're waiting for
in shared memory, rather than just the one.  This is a bit tricky,
though. Any general API for any backend to advertise that it's waiting
for an arbitrary subset of the other backends would require O(n^2)
shared memory state.  That wouldn't be completely insane, but it's not
great, either.  For this particular case, we could optimize that down
to O(n) by just chaining all of the children of a given parallel group
leader in a linked whose nodes are inlined in their PGPROCs, but that
doesn't feel very general, because it forecloses the possibility of
the children ever using that API, and I think they might need to.  If
nothing else, they might need to advertise that they're blocking on
the master if they are trying to send data, the queue is full, and
they have to wait for the master to drain some of it before they can
proceed.

After thinking about it a bit more, I realized that even if we settle
on some solution to that problem, there's another issues: the
wait-edges created by this system don't really have the same semantics
as regular lock waits.  Suppose A is waiting on a lock held by B and B
is waiting for a lock held by A; that's a deadlock.  But if A is
waiting for B to write to a tuple queue and B is waiting for A to read
from a tuple queue, that's not a deadlock if the queues in question
are the same.  If they are different queues, it might be a deadlock,
but then again maybe not.  It may be that A is prepared to accept B's
message from one queue, and that upon fully receiving it, it will do
some further work that will lead it to write a tuple into the other
queue.  If so, we're OK; if not, it's a deadlock.  I'm not sure
whether you'll want to argue that that is an implausible scenario, but
I'm not too sure it is.  The worker could be saying "hey, I need some
additional piece of your backend-local state in order to finish this
computation", and the master could then provide it.  I don't have any
plans like that, but it's been suggested previously by others, so it's
not an obviously nonsensical thing to want to do.

A 

Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Bruce Momjian
On Tue, Apr 21, 2015 at 04:54:57PM +0200, Andres Freund wrote:
> On 2015-04-21 10:53:08 -0400, Robert Haas wrote:
> > On Tue, Apr 21, 2015 at 6:17 AM, Craig Ringer  wrote:
> > >> I don't really like the 'pid' field for pg_replication_slots. About
> > >> naming it 'active_in' or such?
> > >
> > > It was originally named active_pid, but changed based on feedback from
> > > others that 'pid' would be consistent with pg_stat_activity and
> > > pg_replication_slots. I have no strong opinion on the name, though I'd
> > > prefer it reflect that the field does in fact represent a process ID.
> > 
> > Agreed.  I don't like the as-committed name of active_in either.  It's
> > not at all clear what that means.
> 
> I like it being called active_*, that makes the correlation to active
> clear. active_pid then?

Let's call it active_procpid.  (Runs for cover!)
 

(For background, see 9.2 release note item:

Rename pg_stat_activity.procpid to pid, to match other system tables 
(Magnus
Hagander)

The 'p' in 'pid' stands for 'proc', so 'procpid' is redundant.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-21 Thread Andres Freund
On 2015-04-21 16:57:45 +0200, Andres Freund wrote:
> * I still think it's unacceptable to redefine
>   XLOG_HEAP_LAST_MULTI_INSERT as XLOG_HEAP_SPECULATIVE_TUPLE like you
>   did. I'll try to find something better.

I think we should "just" split this into different flag values for
insert/update/delete.

I.e. something like

/* flags for heap insert and multi insert */
#define XLH_INSERT_ALL_VISIBLE_CLEARED
#define XLH_INSERT_LAST_MULTI_INSERT
#define XLH_INSERT_IS_SPECULATIVE
#define XLH_INSERT_CONTAINS_NEW_TUPLE

/* flags for update */
#define XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED
#define XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED
#define XLH_UPDATE_CONTAINS_OLD_TUPLE
#define XLH_UPDATE_CONTAINS_OLD_KEY
#define XLH_UPDATE_CONTAINS_NEW_TUPLE
#define XLH_UPDATE_PREFIX_FROM_OLD
#define XLH_UPDATE_SUFFIX_FROM_OLD

/* flags for delete */
#define XLH_DELETE_ALL_VISIBLE_CLEARED
#define XLH_DELETE_CONTAINS_OLD_TUPLE
#define XLH_DELETE_CONTAINS_OLD_KEY

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preprocess_targetlist and inheiritance

2015-04-21 Thread Stephen Frost
Alvaro,

On Tuesday, April 21, 2015, Alvaro Herrera  wrote:

> Stephen Frost wrote:
> > Tom, all,
> >
> >   Looks like preprocess_targetlist() should have been adjusted with the
> >   changes to ExecBuildAuxRowMark() to support foreign tables being part
> >   of inheritance trees (cb1ca4d800621dcae67ca6c799006de99fa4f0a5) to
> >   also include the tableoid regardless of the rowMark type, if the
> >   relation is the parent of an inheritance tree.
>
> Uh, this patch was already submitted separately:
>   https://www.postgresql.org/message-id/552cf0b6.8010...@lab.ntt.co.jp


Oh, excellent. Looks like the discussion agrees that it makes sense to do
and the RLS case didn't involve any foreign tables yet still ran into the
issue.

Is that correct or am I missing something?

Apologies, on my mobile currently.

Thanks!

Stephen


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Andres Freund
On 2015-04-22 00:15:53 +0900, Sawada Masahiko wrote:
> On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund  wrote:
> > On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
> >> The page as frozen could have the dead tuple for now, but I think to change
> >> to that the frozen page guarantees that page is all frozen *and* all
> >> visible.
> >
> > It shouldn't. That'd potentially cause corruption after a wraparound. A
> > tuple's visibility might change due to that.
> 
> The page as frozen could have some dead tuples, right?

Well, we right now don't really freeze pages, but tuples. But in what
you described above that could happen.

> I think we should to clear a bit of FrozenMap (and flag of page
> header) on delete operation, and a bit is set only by vacuum.

Yes.

> So accordingly, the page as frozen guarantees that all frozen and all
> visible?

I think that's how it has to be, yes.

I *do* wonder if we shouldn't redefine the VM to also contain
information about the frozenness. Having two identically structured maps
that'll often both have to be touched at the same time isn't
nice. Neither is adding another fork.  Given the size of the files
pg_upgrade could be made to rewrite them.  The bigger question is
probably how bad that'd be for index-only efficiency.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix broken Install.bat when target directory contains a space

2015-04-21 Thread Asif Naeem
Thank you Michael, latest patch looks good to me. I have changed its status
to ready for committer.

Regards,
Muhammad Asif Naeem

On Tue, Apr 21, 2015 at 6:02 PM, Michael Paquier 
wrote:

>
>
> On Tue, Apr 21, 2015 at 4:33 PM, Asif Naeem  wrote:
>
>> The v2 patch looks good to me, just a minor concern on usage message i.e.
>>
>> C:\PG\postgresql\src\tools\msvc>install
>>> Invalid command line options.
>>> Usage: "install.bat  [installtype]"
>>> installtype: client
>>
>>
>> It seems that there are two install options i.e. client, all (any other
>> string other than client is being considered or treated as all), the
>> following install command works i.e.
>>
>> install C:\PG\postgresql\inst option_does_not_exist
>>
>>
>> As your patch effects this area of code, I thought to share these
>> findings with you,o BTW, it is a minor thing that can be handled in another
>> patch,
>>
>
> Well, that's the same behavior that this script has been having for ages.
> Let's just update the usage message to mention both "all" and "client". I
> see no point in breaking a behavior that has been like that for ages, and
> the main point of this patch is to fix the install path issue.
>
>
>> If you like please feel free to change status to ready for committer.
>>
>
> Well, I don't think that the patch author should do that. So I won't do it
> by myself.
>
> Attached is an updated patch.
> Regards,
> --
> Michael
>


Re: [HACKERS] preprocess_targetlist and inheiritance

2015-04-21 Thread Alvaro Herrera
Stephen Frost wrote:
> Tom, all,
> 
>   Looks like preprocess_targetlist() should have been adjusted with the
>   changes to ExecBuildAuxRowMark() to support foreign tables being part
>   of inheritance trees (cb1ca4d800621dcae67ca6c799006de99fa4f0a5) to
>   also include the tableoid regardless of the rowMark type, if the
>   relation is the parent of an inheritance tree.

Uh, this patch was already submitted separately:
  https://www.postgresql.org/message-id/552cf0b6.8010...@lab.ntt.co.jp

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Andres Freund
On 2015-04-21 12:20:42 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Catalog wise there's an actual table 'pg_replication_origin' that maps
> > between 'roident' and 'roname'. There's a pg_replication_progress view
> > (used to be named pg_replication_identifier_progress). I'm not sure if
> > the latter name isn't too generic? Maybe
> > pg_logical_replication_progress?
> 
> I think if we wanted "pg_logical_replication_progress" (and I don't
> really agree that we do) then we would add the "logical" bit to the
> names above as well.  This seems unnecessary.  pg_replication_progress
> seems okay to me.

Cool.

> > * pg_replication_progress_get : How far did replay progress for a
> >   certain origin
> > * pg_get_replication_progress : SRF returning the replay progress for
> >   all origin.
> 
> This combination seems confusing.  In some other thread not too long ago
> there was the argument that "all functions 'get' something, so that verb
> should not appear in the function name".

> That would call for "pg_replication_progress" on the singleton.

Hm. I don't like that. That'd e.g. clash with the above view. I think
it's good to distinguish between functions (that have a verb in the
name) and views/tables (that don't).

I agree that the above combination isn't optimal. Although pg_get (and
pg_stat_get) is what's used for a lot of other SRF backed views. Maybe
naming the SRF pg_get_all_replication_progress?

> > * pg_replication_progress_setup_tx_details : configure per transaction
> >   details (LSN and timestamp currently)
> 
> Not sure about the "tx" here.  We use "xact" as an abbreviation for
> "transaction" in most places.

Oh, yea. Xact is more consistent.

> If nowadays we don't like that term, maybe just spell out
> "transaction" in full.  I assume this function pairs up with
> pg_replication_progress_setup_origin, yes?

pg_replication_progress_setup_origin sets up the per session state,
setup_xact_details the "per replayed transaction" state.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Alvaro Herrera
Andres Freund wrote:

> I'm working on changing this (I've implemented the missing WAL
> bits). I'd like to discuss the new terms for a sec, before I go and
> revise the docs.
> 
> I'm now calling the feature 'replication progress tracking'. There's
> "replication origins" and there's progress tracking infrastructure that
> tracks how far data from a "replication origin" has replicated.

Sounds good to me.

> Catalog wise there's an actual table 'pg_replication_origin' that maps
> between 'roident' and 'roname'. There's a pg_replication_progress view
> (used to be named pg_replication_identifier_progress). I'm not sure if
> the latter name isn't too generic? Maybe
> pg_logical_replication_progress?

I think if we wanted "pg_logical_replication_progress" (and I don't
really agree that we do) then we would add the "logical" bit to the
names above as well.  This seems unnecessary.  pg_replication_progress
seems okay to me.

> I've now named the functions:
> 
> * pg_replication_origin_create
> * pg_replication_origin_drop
> * pg_replication_origin_get (map from name to id)
> * pg_replication_progress_setup_origin : configure session to replicate
>   from a specific origin
> * pg_replication_progress_reset_origin
> * pg_replication_progress_is_replaying : Is a origin configured for the 
> session
> * pg_replication_progress_advance : "manually" set the replication
>   progress to a value. Primarily useful for copying values from other
>   systems and such.

These all look acceptable to me.

> * pg_replication_progress_get : How far did replay progress for a
>   certain origin
> * pg_get_replication_progress : SRF returning the replay progress for
>   all origin.

This combination seems confusing.  In some other thread not too long ago
there was the argument that "all functions 'get' something, so that verb
should not appear in the function name".  That would call for
"pg_replication_progress" on the singleton.  Maybe to distinguish the
SRF, add "all" as a suffix?

> * pg_replication_progress_setup_tx_details : configure per transaction
>   details (LSN and timestamp currently)

Not sure about the "tx" here.  We use "xact" as an abbreviation for
"transaction" in most places.  If nowadays we don't like that term,
maybe just spell out "transaction" in full.  I assume this function
pairs up with pg_replication_progress_setup_origin, yes?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Sawada Masahiko
On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund  wrote:
> On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
>> The page as frozen could have the dead tuple for now, but I think to change
>> to that the frozen page guarantees that page is all frozen *and* all
>> visible.
>
> It shouldn't. That'd potentially cause corruption after a wraparound. A
> tuple's visibility might change due to that.

The page as frozen could have some dead tuples, right?
I think we should to clear a bit of FrozenMap (and flag of page
header) on delete operation, and a bit is set only by vacuum.
So accordingly, the page as frozen guarantees that all frozen and all visible?

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] preprocess_targetlist and inheiritance

2015-04-21 Thread Stephen Frost
Tom, all,

  Looks like preprocess_targetlist() should have been adjusted with the
  changes to ExecBuildAuxRowMark() to support foreign tables being part
  of inheritance trees (cb1ca4d800621dcae67ca6c799006de99fa4f0a5) to
  also include the tableoid regardless of the rowMark type, if the
  relation is the parent of an inheritance tree.

  This was noted by Dean Rasheed while working on RLS since it was
  causing one of the new RLS-with-inheritance regression tests to fail
  with: ERROR:  could not find junk tableoid1 column

  This does change the output a bit in the regression tests due to the
  change in ordering of the columns displayed by explain.

  Patch attached for your review.

  Thoughts?

  I'm happy to push this if no one has any issues with it, but also
  won't object if you'd prefer to.

Thanks!

Stephen
From c7af0f658666769f010643bba9c7467ddc42c13c Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Tue, 21 Apr 2015 09:52:09 -0400
Subject: [PATCH] Pull in tableoid for inheiritance with rowMarks

As noted by Dean Rasheed[1], cb1ca4d800621dcae67ca6c799006de99fa4f0a5
changed ExecBuildAuxRowMark() to always look for the tableoid in the
target list, but didn't also change preprocess_targetlist() to always
include the tableoid.  This resulted in errors with soon-to-be-added RLS
with inheritance tests, though I suspect there could be other issues
arising from this.

Pushing this independently as it's not directly related to the other RLS
changes which are coming.

Author: Dean Rasheed (extracted from his rls.v6.patch).

[1] caezatcvmfufuowwhnbtcgi6aquyjq0-1fykd0t3xbwjvn+x...@mail.gmail.com
---
 contrib/postgres_fdw/expected/postgres_fdw.out | 52 +-
 src/backend/optimizer/prep/preptlist.c | 34 -
 2 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 783cb41..93e9836 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3193,26 +3193,26 @@ select * from bar where f1 in (select f1 from foo) for update;
   QUERY PLAN  
 --
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.*
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
->  Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.*
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
  Hash Cond: (bar.f1 = foo.f1)
  ->  Append
->  Seq Scan on public.bar
- Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
->  Foreign Scan on public.bar2
- Output: bar2.f1, bar2.f2, bar2.ctid, bar2.tableoid, bar2.*
+ Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
  ->  Hash
-   Output: foo.ctid, foo.tableoid, foo.*, foo.f1
+   Output: foo.ctid, foo.*, foo.tableoid, foo.f1
->  HashAggregate
- Output: foo.ctid, foo.tableoid, foo.*, foo.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
  Group Key: foo.f1
  ->  Append
->  Seq Scan on public.foo
- Output: foo.ctid, foo.tableoid, foo.*, foo.f1
+ Output: foo.ctid, foo.*, foo.tableoid, foo.f1
->  Foreign Scan on public.foo2
- Output: foo2.ctid, foo2.tableoid, foo2.*, foo2.f1
+ Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
 (22 rows)
 
@@ -3230,26 +3230,26 @@ select * from bar where f1 in (select f1 from foo) for share;
   QUERY PLAN  
 --
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.*
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
->  Hash Join
- Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.*
+ Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
  Hash Cond: (bar.f1 = foo.f1)
  ->  Append
  

Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2015-04-21 Thread Andrew Gierth
> "Svenne" == Svenne Krap  writes:

 Svenne> I have the explains,

Can you post the explain analyze outputs?

If need be, you can anonymize the table and column names and any
identifiers by using the anonymization option of explain.depesz.com, but
please only do that if you actually need to.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-21 Thread Bruce Momjian
On Mon, Apr 20, 2015 at 07:13:38PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:
> > > Bruce Momjian wrote:
> > > 
> > > This seems simple to implement: keep two counters, where the second one
> > > is pages we skipped cleanup in.  Once that counter hits SOME_MAX_VALUE,
> > > reset the first counter so that further 5 pages will get HOT pruned.  5%
> > > seems a bit high though.  (In Simon's design, SOME_MAX_VALUE is
> > > essentially +infinity.)
> > 
> > This would tend to dirty non-sequential heap pages --- it seems best to
> > just clean as many as we are supposed to, then skip the rest, so we can
> > write sequential dirty pages to storage.
> 
> Keep in mind there's a disconnect between dirtying a page and writing it
> to storage.  A page could remain dirty for a long time in the buffer
> cache.  This writing of sequential pages would occur at checkpoint time
> only, which seems the wrong thing to optimize.  If some other process
> needs to evict pages to make room to read some other page in, surely
> it's going to try one page at a time, not write "many sequential dirty
> pages."

Yes, it might be too much optimization to try to get the checkpoint to
flush all those pages sequentially, but I was thinking of our current
behavior where, after an update of all rows, we effectively write out
the entire table because we have dirtied every page.  I guess with later
prune-based writes, we aren't really writing all the pages as we have
the pattern where pages with prunable content is kind of random. I guess
I was just wondering what value there is to your write-then-skip idea,
vs just writing the first X% of pages we find?  Your idea certainly
spreads out the pruning, and doesn't require knowing the size of the
table, though I though that information was easily determined.

One thing to consider is how we handle pruning of index scans that hit
multiple heap pages.  Do we still write X% of the pages in the table, or
%X of the heap pages we actually access via SELECT?  With the
write-then-skip approach, we would do X% of the pages we access, while
with the first-X% approach, we would probably prune all of them as we
would not be accessing most of the table.  I don't think we can do the
first first-X% of pages and have the percentage based on the number of
pages accessed as we have no way to know how many heap pages we will
access from the index.  (We would know for bitmap scans, but that
complexity doesn't seem worth it.)  That would argue, for consistency
with sequential and index-based heap access, that your approach is best.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Andres Freund
On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
> The page as frozen could have the dead tuple for now, but I think to change
> to that the frozen page guarantees that page is all frozen *and* all
> visible.

It shouldn't. That'd potentially cause corruption after a wraparound. A
tuple's visibility might change due to that.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Freeze avoidance of very large table.

2015-04-21 Thread Sawada Masahiko
On Tue, Apr 21, 2015 at 7:00 AM, Jim Nasby  wrote:
> On 4/20/15 2:45 AM, Sawada Masahiko wrote:
>>
>> Current patch adds new source file src/backend/access/heap/frozenmap.c
>> which is quite similar to visibilitymap.c. They have similar code but
>> are separated for now. I do refactoring these source code like adding
>> bitmap.c, if needed.
>

Thank you for having a look this patch.

>
> My feeling is we'd definitely want this refactored; it looks to be a whole
> lot of duplicated code. But before working on that we should get consensus
> that a FrozenMap is a good idea.

Yes, we need to get consensus about FrozenMap before starting work on.
In addition to comment you pointed out, I noticed that one problems I
should address, that a bit of FrozenMap need to be cleared on deletion and
(i.g. xmax is set).
The page as frozen could have the dead tuple for now, but I think to change
to that the frozen page guarantees that page is all frozen *and* all
visible.

> Are there any meaningful differences between the two, besides the obvious
> name changes?

No, there aren't.

> I think there's also a bunch of XLOG stuff that could be refactored too...

I agree with you.

>> Also, when skipping vacuum by visibility map, we can skip at least
>> SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
>> frozen map.
>
>
> That's probably something else that can be factored out, since it's
> basically the same logic. I suspect we just need to && some of the checks
so
> we're looking at both FM and VM at the same time.

FrozenMap is used to skip scan only when anti-wrapping vacuum or freezing
all tuples (i.g scan_all is true).
The normal vacuum uses only VM, doesn't use FM for now.

> Other comments...
>
> It would be nice if we didn't need another page bit for FM; do you see any
> reasonable way that could happen?

We may be able to remove page bit for FM from page header, but I'm not sure
we could do that.

> +* If we didn't pin the visibility(and frozen) map page and the
page
> has
> +* become all visible(and frozen) while we were busy locking the
> buffer,
> +* or during some subsequent window during which we had it
unlocked,
> +* we'll have to unlock and re-lock, to avoid holding the buffer
> lock
> +* across an I/O.  That's a bit unfortunate, especially since
we'll
> now
> +* have to recheck whether the tuple has been locked or updated
> under us,
> +* but hopefully it won't happen very often.
>  */
>
> s/(and frozen)/ or frozen/
>
>
> + * Reply XLOG_HEAP3_FROZENMAP record.
> s/Reply/Replay/

Understood.

>
> +   /*
> +* XLogReplayBufferExtended locked the buffer. But
> frozenmap_set
> +* will handle locking itself.
> +*/
> +   LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);
>
> Doesn't this create a race condition?
>
>
> Are you sure the bit in finish_heap_swap() is safe? If so, we should add
the
> same the same for the visibility map too (it certainly better be all
visible
> if it's frozen...)

We can not ensure page is all visible even if we execute VACUUM FULL,
because of dead tuple could be remained. e.g. the case when other process
does insert and update to same tuple in same transaction before VACUUM FULL.
I was thinking that the FrozenMap is free of the influence of delete
operation. But as I said at top of this mail, a bit of FrozenMap needs to
be cleared on deletion.
So I will remove these related code as you mentioned.

>
>
>
> +   /*
> +* Current block is all-visible.
> +* If frozen map represents that it's all frozen
and
> this
> +* function is called for freezing tuples, we can
> skip to
> +* vacuum block.
> +*/
>
> I would state this as "Even if scan_all is true, we can skip blocks that
are
> marked as frozen."
>
> +   if (frozenmap_test(onerel, blkno, &fmbuffer) &&
> scan_all)
>
> I suspect it's faster to reverse those tests (scan_all &&
> frozenmap_test())... but why do we even need to look at scan_all? AFAICT
if
> a block as frozen we can skip it unconditionally.

The tuple which is frozen and dead, could be remained in page is marked all
frozen, in currently patch.
i.g., There is possible to exist the page is not all visible but marked
frozen.
But I'm thinking to change that.

>
>
> +   /*
> +* If the un-frozen tuple is remaining in current
> page and
> +* current page is marked as ALL_FROZEN, we should
> clear it.
> +*/
>
> That needs to NEVER happen. If it does then we're going to consider tuples
> as visible/frozen that shouldn't be. We should probably throw an error
here,
> because it means the heap is now corrupted. At the minimum it needs to be
an
> assert().

I understood. I'll fix it.

> Note that I haven'

Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-21 Thread Andres Freund
On 2015-04-19 21:37:51 -0700, Peter Geoghegan wrote:
> Attached patch, V3.4, implements what I believe you and Heikki have in
> mind here.

I'm not 100% sure Heikki and I am on exactly the same page here :P

I'm looking at git diff $(git merge-base upstream/master HEAD).. where
HEAD is e1a5822d164db0.

* The logical stuff looks much saner.

* Please add tests for the logical decoding stuff. Probably both a plain
  regression and and isolationtester test in
  contrib/test_decoding. Including one that does spooling to disk.

* I don't like REORDER_BUFFER_CHANGE_INTERNAL_INSERT/DELETE as names. Why not
  _SPECINSERT and _SPECDELETE or such?

* Iff we're going to have the XLOG_HEAP_AFFIRM record, I'd rather have
  that guide the logical decoding code. Seems slightly cleaner.

* Still not a fan of the name 'arbiter' for the OC indexes.

* Gram.y needs a bit more discussion:
  * Can anybody see a problem with changing the precedence of DISTINCT &
ON to nonassoc? Right now I don't see a problem given both are
reserved keywords already.
The reason the conflict exists AFAICS is because something like
INSERT INTO foo SELECT DISTINCT ON CONFLICT IGNORE;
is allowed by the grammar. The need for the nonassoc could be
avoided by requiring DISTINCT to be followed by a column. We
currently *do* enforce that, just not in the parser (c.f.
transformDistinctClause). That requires one more production in
simple_select, and a nonoptional distinct clause.
I've queued up a commit cleaning this up in my repo, feel free to
merge and polish.
  * UpdateInsertStmt is a horrible name. OnConflictUpdateStmt maybe?
  * '(' index_params where_clause ')' is imo rather strange. The where
clause is inside the parens? That's quite different from the
original index clause.
* SPEC_IGNORE,  /* INSERT of "ON CONFLICT IGNORE" */ looks like
  a wrongly copied comment.
* The indentation in RoerderBufferCommit is clearly getting out of hand,
  I've queued up a commit cleaning this up in my repo, feel free to merge.
* I don't think we use the term 'ordinary table' in error messages so
  far.
* I still think it's unacceptable to redefine
  XLOG_HEAP_LAST_MULTI_INSERT as XLOG_HEAP_SPECULATIVE_TUPLE like you
  did. I'll try to find something better.
* I wonder if we now couldn't avoid changing heap_delete's API - we can
  always super delete if we find a speculative insertion now. It'd be
  nice not to break out of core callers if not necessary.
* breinbaas on IRC just mentioned that it'd be nice to have upsert as a
  link in the insert. Given that that's the pervasive term that doesn't
  seem absurd.

I think this is getting closer to a commit. Let's get this done.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Andres Freund
On 2015-04-21 10:53:08 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2015 at 6:17 AM, Craig Ringer  wrote:
> >> I don't really like the 'pid' field for pg_replication_slots. About
> >> naming it 'active_in' or such?
> >
> > It was originally named active_pid, but changed based on feedback from
> > others that 'pid' would be consistent with pg_stat_activity and
> > pg_replication_slots. I have no strong opinion on the name, though I'd
> > prefer it reflect that the field does in fact represent a process ID.
> 
> Agreed.  I don't like the as-committed name of active_in either.  It's
> not at all clear what that means.

I like it being called active_*, that makes the correlation to active
clear. active_pid then?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 6:17 AM, Craig Ringer  wrote:
>> I don't really like the 'pid' field for pg_replication_slots. About
>> naming it 'active_in' or such?
>
> It was originally named active_pid, but changed based on feedback from
> others that 'pid' would be consistent with pg_stat_activity and
> pg_replication_slots. I have no strong opinion on the name, though I'd
> prefer it reflect that the field does in fact represent a process ID.

Agreed.  I don't like the as-committed name of active_in either.  It's
not at all clear what that means.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-21 Thread Amit Kapila
On Mon, Apr 20, 2015 at 10:08 PM, Robert Haas  wrote:
>
> On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila 
wrote:
> > One disadvantage of retaining parallel-paths could be that it can
> > increase the number of combinations planner might need to evaluate
> > during planning (in particular during join path evaluation) unless we
> > do some special handling to avoid evaluation of such combinations.
>
> Yes, that's true.  But the overhead might not be very much.  In the
> common case, many baserels and joinrels will have no parallel paths
> because the non-parallel paths is known to be better anyway.  Also, if
> parallelism does seem to be winning, we're probably planning a query
> that involves accessing a fair amount of data,

Am I understanding right that by above you mean to say that retain the
parallel and non-parallel path only if parallel-path wins over non-parallel
path?

If yes, then I am able to understand the advantage of retaining both
parallel and non-parallel paths, else could you explain some more
why you think it is advantageous to retain parallel-path even when it
losses to serial path in the beginning?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Fix broken Install.bat when target directory contains a space

2015-04-21 Thread Michael Paquier
On Tue, Apr 21, 2015 at 4:33 PM, Asif Naeem  wrote:

> The v2 patch looks good to me, just a minor concern on usage message i.e.
>
> C:\PG\postgresql\src\tools\msvc>install
>> Invalid command line options.
>> Usage: "install.bat  [installtype]"
>> installtype: client
>
>
> It seems that there are two install options i.e. client, all (any other
> string other than client is being considered or treated as all), the
> following install command works i.e.
>
> install C:\PG\postgresql\inst option_does_not_exist
>
>
> As your patch effects this area of code, I thought to share these findings
> with you,o BTW, it is a minor thing that can be handled in another patch,
>

Well, that's the same behavior that this script has been having for ages.
Let's just update the usage message to mention both "all" and "client". I
see no point in breaking a behavior that has been like that for ages, and
the main point of this patch is to fix the install path issue.


> If you like please feel free to change status to ready for committer.
>

Well, I don't think that the patch author should do that. So I won't do it
by myself.

Attached is an updated patch.
Regards,
-- 
Michael
diff --git a/src/tools/msvc/install.bat b/src/tools/msvc/install.bat
index bed08f1..b84fef0 100644
--- a/src/tools/msvc/install.bat
+++ b/src/tools/msvc/install.bat
@@ -1,10 +1,11 @@
 @echo off
 REM src/tools/msvc/install.bat
 
-if NOT "%1"=="" GOTO RUN_INSTALL
+if NOT [%1]==[] GOTO RUN_INSTALL
 
 echo Invalid command line options.
-echo Usage: "install.bat "
+echo Usage: "install.bat  [installtype]"
+echo installtype: all client
 echo.
 REM exit fix for pre-2003 shell especially if used on buildfarm
 if "%XP_EXIT_FIX%" == "yes" exit 1
@@ -20,7 +21,7 @@ CALL bldenv.bat
 del bldenv.bat
 :nobuildenv
 
-perl install.pl "%1" %2
+perl install.pl %1 %2
 
 REM exit fix for pre-2003 shell especially if used on buildfarm
 if "%XP_EXIT_FIX%" == "yes" exit %ERRORLEVEL%
diff --git a/src/tools/msvc/install.pl b/src/tools/msvc/install.pl
index 97e297e..62ef21e 100755
--- a/src/tools/msvc/install.pl
+++ b/src/tools/msvc/install.pl
@@ -15,6 +15,6 @@ Install($target, $insttype);
 sub Usage
 {
 	print "Usage: install.pl  [installtype]\n";
-	print "installtype: client\n";
+	print "installtype: all client\n";
 	exit(1);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Andres Freund
On 2015-04-20 10:28:02 +0200, Andres Freund wrote:
> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> > I just realized that it talks about "replication identifier" as the new
> > fundamental concept. The system table is called "pg_replication_identifier".
> > But that's like talking about "index identifiers", instead of just indexes,
> > and calling the system table pg_index_oid.
> >
> > The important concept this patch actually adds is the *origin* of each
> > transaction. That term is already used in some parts of the patch. I think
> > we should roughly do a search-replace of "replication identifier" ->
> > "replication origin" to the patch. Or even "transaction origin".
>
> Sounds good to me.

I'm working on changing this (I've implemented the missing WAL
bits). I'd like to discuss the new terms for a sec, before I go and
revise the docs.

I'm now calling the feature 'replication progress tracking'. There's
"replication origins" and there's progress tracking infrastructure that
tracks how far data from a "replication origin" has replicated.

Catalog wise there's an actual table 'pg_replication_origin' that maps
between 'roident' and 'roname'. There's a pg_replication_progress view
(used to be named pg_replication_identifier_progress). I'm not sure if
the latter name isn't too generic? Maybe
pg_logical_replication_progress?

I've now named the functions:

* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
  from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
  details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
  progress to a value. Primarily useful for copying values from other
  systems and such.
* pg_replication_progress_get : How far did replay progress for a
  certain origin
* pg_get_replication_progress : SRF returning the replay progress for
  all origin.

Any comments?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-04-21 Thread Amit Kapila
On Tue, Apr 21, 2015 at 6:34 AM, Amit Langote  wrote:
> On 2015-04-21 AM 03:29, Robert Haas wrote:
> > On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote wrote:
> >> On 08-04-2015 PM 12:46, Amit Kapila wrote:
> >>> Going forward, I think we can improve the same if we decide not to
shutdown
> >>> parallel workers till postmaster shutdown once they are started and
> >>> then just allocate them during executor-start phase.
> >>
> >> I wonder if it makes sense to invent the notion of a global pool of
workers
> >> with configurable number of workers that are created at postmaster
start and
> >> destroyed at shutdown and requested for use when a query uses
parallelizable
> >> nodes.
> >
> > Short answer: Yes, but not for the first version of this feature.
> >
>
> Agreed.
>
> Perhaps, Amit has worked (is working) on "reuse the same workers for
> subsequent operations within the same query"
>

What I am planning to do is Destroy the resources (parallel context) once
we have fetched all the tuples from Funnel node, so that we don't block
all resources till end of execution.  We can't say that as reuse rather it
will allow multiple nodes in same statement to use workers when there
is a restriction on total number of workers (max_worker_processed) that
can be used.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Logical Decoding follows timelines

2015-04-21 Thread Simon Riggs
On 21 April 2015 at 05:49, Michael Paquier 
wrote:

> On Fri, Feb 13, 2015 at 4:57 PM, Michael Paquier wrote:
> > Moved patch to CF 2015-02 to not lose track of it, also because it does
> not
> > seem it received a proper review.
>
> This patch does not apply anymore, so attached is a rebased version.
> The comments mentioned here have not been addressed:
> http://www.postgresql.org/message-id/54a7bf61.9080...@vmware.com
> Also, what kind of tests have been done? Logical decoding cannot be
> used while a node is in recovery.
>

Returned with Feedback, I think. I have a new approach to be coded for next
release.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Parallel Seq Scan

2015-04-21 Thread Amit Kapila
On Tue, Apr 21, 2015 at 2:29 PM, David Rowley  wrote:
>
> I've also been thinking about how, instead of having to have a special
> PartialSeqScan node which contains a bunch of code to store tuples in a
> shared memory queue, could we not have a "TupleBuffer", or
> "ParallelTupleReader" node, one of which would always be the root node of a
> plan branch that's handed off to a worker process. This node would just try
> to keep it's shared tuple store full, and perhaps once it fills it could
> have a bit of a sleep and be woken up when there's a bit more space on the
> queue. When no more tuples were available from the node below this, then
> the worker could exit. (providing there was no rescan required)
>
> I think between the Funnel node and a ParallelTupleReader we could
> actually parallelise plans that don't even have parallel safe nodes Let
> me explain:
>
> Let's say we have a 4 way join, and the join order must be {a,b}, {c,d} =>
> {a,b,c,d}, Assuming the cost of joining a to b and c to d are around the
> same, the Parallelizer may notice this and decide to inject a Funnel and
> then ParallelTupleReader just below the node for c join d and have c join d
> in parallel. Meanwhile the main worker process could be executing the root
> node, as normal. This way the main worker wouldn't have to go to the
> trouble of joining c to d itself as the worker would have done all that
> hard work.
>
> I know the current patch is still very early in the evolution of
> PostgreSQL's parallel query, but how would that work with the current
> method of selecting which parts of the plan to parallelise?
>

The Funnel node is quite generic and can handle the case as
described by you if we add Funnel on top of join node (c join d).
It currently passes plannedstmt to worker which can contain any
type of plan (though we need to add some more code to make it
work if want to execute any node other than Result or PartialSeqScan
node.)


> I really think the plan needs to be a complete plan before it can be best
> analysed on how to divide the workload between workers, and also, it would
> be quite useful to know how many workers are going to be able to lend a
> hand in order to know best how to divide the plan up as evenly as possible.
>
>
I think there is some advantage of changing an already built plan
to parallel plan based on resources and there is some literature
about the same, but I think we will loose much more by not considering
parallelism during planning time.  If I remember correctly, then some
of the other databases do tackle this problem of shortage of resources
during execution as mentioned by me upthread, but I think for that it
is not necessary to have a Parallel Planner as a separate layer.
I believe it is important to have some way to handle shortage of resources
during execution, but it can be done at later stage.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] installcheck missing in src/bin/pg_rewind/Makefile

2015-04-21 Thread Heikki Linnakangas

On 04/21/2015 07:13 AM, Michael Paquier wrote:

Hi all,

As mentioned in $subject, the TAP tests of pg_rewind are currently not
run by buildfarm machines as the buildfarm client uses installcheck to
run the tests in src/bin. A patch is attached to correct the problem.


Thanks, applied.

(I left out the installcheck target originally because I somehow thought 
that it would run the tests against an existing cluster, which wouldn't 
work. But of course that's not how it works. It uses the installed 
binaries, but creates a new temporary cluster for the tests)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Heikki Linnakangas

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas  wrote:


Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.


But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.


I think it would be acceptable. If you want to maintain an 
up-to-the-second archive, you can use pg_receivexlog. Mind you, if the 
standby wasn't promoted, the partial segment would not be present in the 
archive anyway. And you can copy the WAL segment manually from 
000200XX to pg_xlog/000100XX before 
starting PITR.


Another thought is that we could archive the partial file, but with a 
different name to avoid confusing it with the full segment. For example, 
we could archive a partial 00010012 segment as 
"00020012.0128.partial", where 0128 indicates 
how far that file is valid (this naming is similar to how the backup 
history files are named). Recovery wouldn't automatically pick up those 
files, but the DBA could easily copy the partial file into pg_xlog with 
the full segment's name, if he wants to do PITR to that piece of WAL.



Are the use cases where you'd want that, rather than the new "shared"
mode? I wanted to keep the 'on' mode for backwards-compatibility, but if
that causes more problems, it might be better to just remove it and force
the admin to choose what kind of a setup he has, with "shared" or "always".


The 'on' mode is still useful IMO to get a behavior a maximum close to what
previous releases did.


But would you ever want the old behaviour, rather than the new shared or 
always behaviour?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Andres Freund
On April 21, 2015 1:17:32 PM GMT+03:00, Craig Ringer  
wrote:
>On 21 April 2015 at 15:19, Andres Freund  wrote:
>
>> On 2015-04-07 18:41:59 +0800, Craig Ringer wrote:
>> > @@ -331,8 +331,8 @@ ReplicationSlotAcquire(const char *name)
>> >   volatile ReplicationSlot *vslot = s;
>> >
>> >   SpinLockAcquire(&s->mutex);
>> > - active = vslot->active;
>> > - vslot->active = true;
>> > + active = vslot->active_pid != 0;
>> > + vslot->active_pid = MyProcPid;
>> >   SpinLockRelease(&s->mutex);
>> >   slot = s;
>> >   break;
>>
>> Uh. You're overwriting the existing pid here. Not good if the slot is
>> currently in use.
>>
>
>Isn't that the point? We're acquiring the slot there, per the comment:

Read the rest of the function. We're checking for conflicts...

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Craig Ringer
On 21 April 2015 at 15:19, Andres Freund  wrote:

> On 2015-04-07 18:41:59 +0800, Craig Ringer wrote:
> > @@ -331,8 +331,8 @@ ReplicationSlotAcquire(const char *name)
> >   volatile ReplicationSlot *vslot = s;
> >
> >   SpinLockAcquire(&s->mutex);
> > - active = vslot->active;
> > - vslot->active = true;
> > + active = vslot->active_pid != 0;
> > + vslot->active_pid = MyProcPid;
> >   SpinLockRelease(&s->mutex);
> >   slot = s;
> >   break;
>
> Uh. You're overwriting the existing pid here. Not good if the slot is
> currently in use.
>

Isn't that the point? We're acquiring the slot there, per the comment:

"Find a previously created slot and mark it as used by this backend."


>   namecpy(&plugin, &slot->data.plugin);
> >
> > - active = slot->active;
> > + active_pid = slot->active_pid != 0;
>
> That doesn't look right.
>

No, that's certainly not right. I also could've sworn I sorted it out, but
that must've been another site, because sure enough it's still there.

I don't really like the 'pid' field for pg_replication_slots. About
> naming it 'active_in' or such?
>

It was originally named active_pid, but changed based on feedback from
others that 'pid' would be consistent with pg_stat_activity and
pg_replication_slots. I have no strong opinion on the name, though I'd
prefer it reflect that the field does in fact represent a process ID.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


[HACKERS] Update docs in fdwhandler.sgml

2015-04-21 Thread Etsuro Fujita
Hi,

Since we now allow CHECK constraints to be placed on foreign tables, not
only NOT NULL, I think it'd be better to update docs on considerations
about constraints on foreign tables in fdwhandler.sgml, so as to provide
more general considerations.  Please find attached a patch.

Best regards,
Etsuro Fujita
*** a/doc/src/sgml/fdwhandler.sgml
--- b/doc/src/sgml/fdwhandler.sgml
***
*** 242,254  IterateForeignScan (ForeignScanState *node);
  
  
   Note that PostgreSQL's executor doesn't care
!  whether the rows returned violate any NOT NULL
!  constraints that were defined on the foreign table columns — but
!  the planner does care, and may optimize queries incorrectly if
!  NULL values are present in a column declared not to contain
!  them.  If a NULL value is encountered when the user has
!  declared that none should be present, it may be appropriate to raise an
!  error (just as you would need to do in the case of a data type mismatch).
  
  
  
--- 242,254 
  
  
   Note that PostgreSQL's executor doesn't care
!  whether the rows returned violate any constraints that were defined on
!  the foreign table — but the planner does care, and may optimize
!  queries incorrectly if there are rows visible in the foreign table that
!  do not satisfy a declared constraint.  If a constraint is violated when
!  the user has declared that the constraint should hold true, it may be
!  appropriate to raise an error (just as you would need to do in the case
!  of a data type mismatch).
  
  
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Michael Paquier
On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas  wrote:

> On 04/21/2015 09:53 AM, Michael Paquier wrote:
>
>> On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:
>>
>>> Oh, hang on, that's not necessarily true. On promotion, the standby
>>>
>> archives
>>
>>> the last, partial WAL segment from the old timeline. That's just wrong
>>> (http://www.postgresql.org/message-id/52fcd37c.3070...@vmware.com), and
>>> in
>>> fact I somehow thought I changed that already, but apparently not. So
>>>
>> let's
>>
>>> stop doing that.
>>>
>>
>> Er. Are you planning to prevent the standby from archiving the last
>> partial
>> segment from the old timeline at promotion?
>>
>
> Yes.
>
>  I thought from previous discussions that we should do it as master
>> (be it crashed, burned, burried or dead) may not have the occasion to
>> do it. By preventing its archiving you close the door to the case
>> where master did not have the occasion to archive it.
>>
>
> The current situation is a mess:
>
> 1. Even though we archive the last segment in the standby, there is no
> guarantee that the master had archived all the previous segments already.
>
2. If the master is not totally dead, it might try to archive the same file
> with more WAL in it, at the same time or just afterwards, or even just
> before the standby has completed promotion. Which copy do you keep in the
> archive? Having to deal with that makes the archive_command more
> complicated.
>
> Note that even though we don't archive the partial last segment on the
> previous timeline, the same WAL is copied to the first segment on the new
> timeline. So the WAL isn't lost.
>

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.


>
>  People may be surprised that a base backup taken from a node that has
>> archive_mode = on set (that's the case in a very large number of cases)
>> will not be able to work as-is as node startup will fail as follows:
>> FATAL:  archive_mode='on' cannot be used in archive recovery
>> HINT:  Use 'shared' or 'always' mode instead.
>>
>
> Hmm, good point.
>
>  One idea would be to simply ignore the fact that archive_mode = on on
>> nodes
>> in recovery instead of dropping an error. Note that I like the fact that
>> it
>> drops an error as that's clear, I just point the fact that people may be
>> surprised that base backups are not working anymore now in this case.
>>
>
> By "ignore", what behaviour do you mean? Would "on" be equivalent to
> "shared", "always", or something else?
>

I meant something backward-compatible, with files marked as .done when they
are finished replaying... But now my words *are* weird as on != off ;)

Or we could keep the current behaviour with archive_mode=on (except for the
> last segment thing, which is just wrong), where the standby only archives
> the new timeline, and nothing from the previous timelines.
>

I guess this would solve the issue here then, which is not a bad thing in
itself:
http://www.postgresql.org/message-id/20140918180734.361021e1@erg
We would need to check if the situation improves with the 'always' mode btw.


> Are the use cases where you'd want that, rather than the new "shared"
> mode? I wanted to keep the 'on' mode for backwards-compatibility, but if
> that causes more problems, it might be better to just remove it and force
> the admin to choose what kind of a setup he has, with "shared" or "always".
>

The 'on' mode is still useful IMO to get a behavior a maximum close to what
previous releases did.
Regards,
-- 
Michael


Re: [HACKERS] Parallel Seq Scan

2015-04-21 Thread David Rowley
On 21 April 2015 at 06:26, Robert Haas  wrote:

> On Wed, Apr 8, 2015 at 3:34 AM, David Rowley  wrote:
> > In summary it sounds like with my idea we get:
> >
> > Pros
> > * Optimal plan if no workers are available at execution time.
> > * Parallelism possible if the chosen optimal plan happens to support
> > parallelism, e.g not index scan.
> > * No planning overhead
>
> The third one isn't really true.  You've just moved some of the
> planning to execution time.
>
>
Hmm, sorry, I meant no planner overhead during normal planning.
I was more driving along the lines of the fact that low cost queries don't
have to pay the price for the planner considering parallel paths. This
"parallelizer" that I keep talking about would only be asked to do anything
if the root node's cost was above some GUC like parallel_cost_threshold,
and likely a default for this would be some cost that would translate into
a query that took, say roughly anything over 1 second. This way super fast
1 millisecond plans don't have to suffer from extra time taken to consider
parallel paths.  Once we're processing queries that are above this parallel
threshold then the cost of the parallelizer invocation would be drowned out
by the actual execution cost anyway.



> > Cons:
> > * The plan "Parallelizer" must make changes to the plan just before
> > execution time, which ruins the 1 to 1 ratio of plan/executor nodes by
> the
> > time you inject Funnel nodes.
> >
> > If we parallelise during planning time:
> >
> > Pros
> > * More chance of getting a parallel friendly plan which could end up
> being
> > very fast if we get enough workers at executor time.
>
> This, to me, is by far the biggest "con" of trying to do something at
> execution time.  If planning doesn't take into account the gains that
> are possible from parallelism, then you'll only be able to come up
> with the best parallel plan when it happens to be a parallelized
> version of the best serial plan.  So long as the only parallel
> operator is parallel seq scan, that will probably be a common
> scenario.  But once we assemble a decent selection of parallel
> operators, and a reasonably intelligent parallel query optimizer, I'm
> not so sure it'll still be true.
>
>
I agree with that. It's a tough one.
I was hoping that this might be offset by the fact that we won't have to
pay the high price when the planner spits out a parallel plan when the
executor has no spare workers to execute it as intended, and also the we
wouldn't have to be nearly as conservative with the max_parallel_degree
GUC, that could just be set to the number of logical CPUs in the machine,
and we could just use that value minus number of active backends during
execution.


> > Cons:
> > * May produce non optimal plans if no worker processes are available
> during
> > execution time.
> > * Planning overhead for considering parallel paths.
> > * The parallel plan may blow out buffer caches due to increased I/O of
> > parallel plan.
> >
> > Of course please say if I've missed any pro or con.
>
> I think I generally agree with your list; but we might not agree on
> the relative importance of the items on it.
>
>
I've also been thinking about how, instead of having to have a special
PartialSeqScan node which contains a bunch of code to store tuples in a
shared memory queue, could we not have a "TupleBuffer", or
"ParallelTupleReader" node, one of which would always be the root node of a
plan branch that's handed off to a worker process. This node would just try
to keep it's shared tuple store full, and perhaps once it fills it could
have a bit of a sleep and be woken up when there's a bit more space on the
queue. When no more tuples were available from the node below this, then
the worker could exit. (providing there was no rescan required)

I think between the Funnel node and a ParallelTupleReader we could actually
parallelise plans that don't even have parallel safe nodes Let me
explain:

Let's say we have a 4 way join, and the join order must be {a,b}, {c,d} =>
{a,b,c,d}, Assuming the cost of joining a to b and c to d are around the
same, the Parallelizer may notice this and decide to inject a Funnel and
then ParallelTupleReader just below the node for c join d and have c join d
in parallel. Meanwhile the main worker process could be executing the root
node, as normal. This way the main worker wouldn't have to go to the
trouble of joining c to d itself as the worker would have done all that
hard work.

I know the current patch is still very early in the evolution of
PostgreSQL's parallel query, but how would that work with the current
method of selecting which parts of the plan to parallelise? I really think
the plan needs to be a complete plan before it can be best analysed on how
to divide the workload between workers, and also, it would be quite useful
to know how many workers are going to be able to lend a hand in order to
know best how to divide the plan up as evenly as possible.

Apologies if this seems l

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-04-21 Thread Kyotaro HORIGUCHI
Hi, thank you. My understanding became a bit clearer.

At Tue, 21 Apr 2015 15:35:41 +0900, Etsuro Fujita  
wrote in <5535efbd.8030...@lab.ntt.co.jp>
> On 2015/04/21 10:07, Kyotaro HORIGUCHI wrote:
> > At Mon, 20 Apr 2015 16:40:52 +0900, Etsuro Fujita
> >  wrote in
> > <5534ad84.3020...@lab.ntt.co.jp>
> >> However, I'd
> >> like to propose to rename "Foreign Update" ("Foreign Delete") of
> >> ModifyTable simply to "Update" ("Delete") not only because (1) that
> >> solves the duplication problem but also because (2) ISTM that is
> >> consistent with the non-inherited updates in both of the
> >> non-pushed-down-update case and the pushed-down-update case.  Here are
> >> examples for (2).
> 
> > Update node without "Foreign" that runs "Remote SQL" looks to me
> > somewhat unusual..
> 
> I think that has a similarity with the existing EXPLAIN outputs for
> non-inherited non-pushed-down updates, as shown in the below exaple.
> 
> >> postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 +
> >> 1)::int;
> >>QUERY PLAN
> >>
> --
> >>   Update on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
> >> Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1
> >> ->  Foreign Scan on public.ft1  (cost=100.00..226.03 rows=2730 width=6)
> >>   Output: (trunc(((random() * '9'::double precision) +
> >> '1'::double precision)))::integer, ctid
> >>   Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE
> >> (5 rows)

Mmm.. It also looks confusing which needs to be fixed. Now
foreign tables are updated in two ways. One is ModifyTable on
foreign relation and the another is ForeignScan node of update
operation. Though I think that I understand the path to this
form, but I suppose they should confluent into one type of node,
perhaps ForegnScan node. Even if it is hardly archievable for
now, explain representation should be uniform.

Making ModifyTable on foreign relation have the representation
"Foreign Update", the explain results of the queries modifying
foreign tables are looks like,

Foreign Update on public.ft1 (...)
  Remote SQL: UPDATE public.t1 
  -> Foreign Scan on public.ft1...

Foreign Update on public.ft1 (...
  Foreign Update on public.ft1 (...
Remote SQL: ...

If Foreign Update has only one internal representation, the two
same Foreign Updates are (ideally) easily eliminated during
planning and explain would naturally shows the following result.

Foreign Update on public.ft1 (...
  Remote SQL: ...

But if not as is currently so, printing the result needs a bit
complicated calculation.


> > |  Update on public.p  (cost=0.00..360.08 rows=4311 width=14)
> > |Update on public.p
> > |->  Seq Scan on public.p  (cost=0.00..0.00 rows=1 width=14)
> > |  Output: p.a, (p.b + 1), p.ctid
> > |->  Foreign Update on public.ft1 (cost=100.00..180.04 rows=2155
> > |->  width=14)
> > |  Remote SQL: UPDATE public.t1 SET b = (b + 1)
> > |->  Foreign Update on public.ft2 (cost=100.00..180.04 rows=2155
> > |->  width=14)
> > |  Remote SQL: UPDATE public.t2 SET b = (b + 1)
> 
> On that point, I agree with Tom that that would cause the problem that
> the user has to guess at which of the child plans goes with which
> target relation of ModifyTable [1].
> 
> [1]
> http://www.postgresql.org/message-id/22505.1426986...@sss.pgh.pa.us

Yeah, that seems to make the plan to be understood
clerer. Combining Tom's suggestion and my suggestion together
would result in the following output of explain.

Update on public.p  (cost=0.00..360.08 rows=4311 width=14)
  Update on public.p
  ->  Seq Scan on public.p  (cost=0.00..0.00 rows=1 width=14)
Output: p.a, (p.b + 1), p.ctid
  Foreign Update on public.ft1 (cost=100.00..180.04 rows=2155 width=14)
 Remote SQL: UPDATE public.t1 SET b = (b + 1)
  Foreign Update on public.ft2 (cost=100.00..180.04 rows=2155  width=14)
 Remote SQL: UPDATE public.t2 SET b = (b + 1)

And when not pushed down it would look like,

Update on public.p  (cost=0.00..360.08 rows=4311 width=14)
  Update on public.p
->  Seq Scan on public.p  (cost=0.00..0.00 rows=1 width=14)
  Output: p.a, (p.b + 1), p.ctid
  Foreign Update on public.ft1 (cost=100.00..180.04 rows=2155 width=14)
 Remote SQL: UPDATE public.t1 SET b = $2 WHERE ctid = $1
 -> Foreign Scan on public.ft1 (cost=)
  Output: a, b, ctid
  Remote SQL: SELECT a, ctid FROM public.t1 FOR UPDATE
  Foreign Update on public.ft2 (cost=100.00..180.04 rows=2155  width=14)
 Remote SQL: UPDATE public.t2 SET b = (b + 1)
 -> Foreign Scan on public.ft2 (cost=)
  Output: a, b, ctid
  Remote SQL: SELECT a, ctid FROM public.t2 FOR UPDATE

These looks quite reasonable *for me* :)

Of course, the same discussion is applicable on Foreign Delete.

What do you think about this?

Any furt

Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Simon Riggs
On 20 April 2015 at 09:28, Andres Freund  wrote:

> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> > I just realized that it talks about "replication identifier" as the new
> > fundamental concept. The system table is called
> "pg_replication_identifier".
> > But that's like talking about "index identifiers", instead of just
> indexes,
> > and calling the system table pg_index_oid.
> >
> > The important concept this patch actually adds is the *origin* of each
> > transaction. That term is already used in some parts of the patch. I
> think
> > we should roughly do a search-replace of "replication identifier" ->
> > "replication origin" to the patch. Or even "transaction origin".
>
> Sounds good to me.
>

+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Streaming replication and WAL archive interactions

2015-04-21 Thread Heikki Linnakangas

On 04/21/2015 09:53 AM, Michael Paquier wrote:

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(http://www.postgresql.org/message-id/52fcd37c.3070...@vmware.com), and in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.


Er. Are you planning to prevent the standby from archiving the last partial
segment from the old timeline at promotion?


Yes.


I thought from previous discussions that we should do it as master
(be it crashed, burned, burried or dead) may not have the occasion to
do it. By preventing its archiving you close the door to the case
where master did not have the occasion to archive it.


The current situation is a mess:

1. Even though we archive the last segment in the standby, there is no 
guarantee that the master had archived all the previous segments already.


2. If the master is not totally dead, it might try to archive the same 
file with more WAL in it, at the same time or just afterwards, or even 
just before the standby has completed promotion. Which copy do you keep 
in the archive? Having to deal with that makes the archive_command more 
complicated.


Note that even though we don't archive the partial last segment on the 
previous timeline, the same WAL is copied to the first segment on the 
new timeline. So the WAL isn't lost.



People may be surprised that a base backup taken from a node that has
archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL:  archive_mode='on' cannot be used in archive recovery
HINT:  Use 'shared' or 'always' mode instead.


Hmm, good point.


One idea would be to simply ignore the fact that archive_mode = on on nodes
in recovery instead of dropping an error. Note that I like the fact that it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.


By "ignore", what behaviour do you mean? Would "on" be equivalent to 
"shared", "always", or something else?


Or we could keep the current behaviour with archive_mode=on (except for 
the last segment thing, which is just wrong), where the standby only 
archives the new timeline, and nothing from the previous timelines. Are 
the use cases where you'd want that, rather than the new "shared" mode? 
I wanted to keep the 'on' mode for backwards-compatibility, but if that 
causes more problems, it might be better to just remove it and force the 
admin to choose what kind of a setup he has, with "shared" or "always".



Creating a dependency between the pgstat machinery and the WAL sender looks
weak to me. For example with this patch a master cannot stop, as it waits
indefinitely:
LOG:  using stale statistics instead of current ones because stats
collector is not responding
LOG:  sending archival report:


Hmm, yeah, having walsender to wait for the stats file to appear is not 
good.



You could scan archive_status/ but that would be costly if there are many
entries to scan and I think that walsender should be highly responsive. Or
you could directly store the name of the lastly archived WAL segment marked
as .done in let's say archive_status/last_archived. An entry for that in
the control file does not seem the right place as a node may not have
archive_mode enabled that's why I am not mentioning it.


The ways that the archiver process can communicate with the rest of the 
system are limited, for the sake of robustness. Writing to the control 
file is definitely not OK. I think using the stats collector is OK for 
this, but we'll have to arrange it so that the walsender doesn't block 
on it, and should probably not force new stat file so often. A 5-10 
seconds old stats file would be perfectly fine for this purpose.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix broken Install.bat when target directory contains a space

2015-04-21 Thread Asif Naeem
The v2 patch looks good to me, just a minor concern on usage message i.e.

C:\PG\postgresql\src\tools\msvc>install
> Invalid command line options.
> Usage: "install.bat  [installtype]"
> installtype: client


It seems that there are two install options i.e. client, all (any other
string other than client is being considered or treated as all), the
following install command works i.e.

install C:\PG\postgresql\inst option_does_not_exist


As your patch effects this area of code, I thought to share these findings
with you, BTW, it is a minor thing that can be handled in another patch, If
you like please feel free to change status to ready for committer. Thanks.


On Fri, Apr 17, 2015 at 10:36 AM, Michael Paquier  wrote:

> On Thu, Apr 16, 2015 at 5:40 PM, Asif Naeem wrote:
> > Along with fixing the space in installation path, it is also changing the
> > behavior of install script, why not just "if NOT [%1]==[] GOTO
> RUN_INSTALL",
> > checking for "/?" seems good for help message but it seem not well
> handled
> > in the script, it is still saying "Invalid command line options.", along
> > with this, option "/?" seems not handled by any other .bat build script.
> > Other than this, with the patch applied, is it an acceptable behavior
> that
> > (2) shows usage message as 'Usage: install.pl 
> [installtype]' but
> > (3) shows usage message as 'Usage: "install.bat "'. Thanks.
>
> Thanks for your review!
>
> OK, let's remove the use of /? then, consistency with the other
> scripts is a good argument for its removal.Attached is an updated
> patch that does the following regarding missing arguments:
> >install
> Invalid command line options.
> Usage: "install.bat  [installtype]"
> installtype: client
> >install
> Installing version 9.5 for release in /?
> Copying build output files...Could not copy release\postgres\postgres.exe
> to /?\
> bin\postgres.exe
>  at Install.pm line 40.
> Install::lcopy("release\\postgres\\postgres.exe",
> "/?\\bin\\postgres.exe
> ") called at Install.pm line 324
> Install::CopySolutionOutput("release", "/?") called at Install.pm
> line 9
> 3
> Install::Install("/?", undef) called at install.pl line 13
>
> This patch fixes of course the issue with spaces included in the
> target path. I updated as well the Usage in install.bat to be
> consistent with install.pl.
> Regards,
> --
> Michael
>


Re: [HACKERS] PATCH: Add 'pid' column to pg_replication_slots

2015-04-21 Thread Andres Freund
On 2015-04-07 18:41:59 +0800, Craig Ringer wrote:
> @@ -331,8 +331,8 @@ ReplicationSlotAcquire(const char *name)
>   volatile ReplicationSlot *vslot = s;
>  
>   SpinLockAcquire(&s->mutex);
> - active = vslot->active;
> - vslot->active = true;
> + active = vslot->active_pid != 0;
> + vslot->active_pid = MyProcPid;
>   SpinLockRelease(&s->mutex);
>   slot = s;
>   break;

Uh. You're overwriting the existing pid here. Not good if the slot is
currently in use.

>   namecpy(&plugin, &slot->data.plugin);
>  
> - active = slot->active;
> + active_pid = slot->active_pid != 0;

That doesn't look right.

> --- a/src/include/replication/slot.h
> +++ b/src/include/replication/slot.h
> @@ -84,13 +84,15 @@ typedef struct ReplicationSlot
>   /* is this slot defined */
>   boolin_use;
>  
> - /* is somebody streaming out changes for this slot */
> - boolactive;
> +/* field 'active' removed in 9.5; see 'active_pid' instead */
>  
>   /* any outstanding modifications? */
>   booljust_dirtied;
>   booldirty;
>  
> + /* Who is streaming out changes for this slot? 0 for nobody */
> + pid_t   active_pid;
> +

That's a horrible idea. That way we end up with dozens of indirections
over time.

I don't really like the 'pid' field for pg_replication_slots. About
naming it 'active_in' or such?

Other than these I plan to push this soon.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers