date:20190322

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Tom Lane

David Rowley  writes:
> On Sat, 23 Mar 2019 at 05:40, Tom Lane  wrote:
>> BTW, another thing we could possibly do to answer this objection is to
>> give the ordered-Append node an artificially pessimistic startup cost,
>> such as the sum or the max of its children's startup costs.  That's
>> pretty ugly and unprincipled, but maybe it's better than not having the
>> ability to generate the plan shape at all?

> I admit to having thought of that while trying to get to sleep last
> night, but I was too scared to even suggest it.  It's pretty much how
> MergeAppend would cost it anyway.  I agree it's not pretty to lie
> about the startup cost, but it does kinda seem silly to fall back on a
> more expensive MergeAppend when we know fine well Append is cheaper.

Yeah.  I'm starting to think that this might actually be the way to go,
and here's why: my argument here is basically that a child plan that
has a large startup cost is going to screw up our ability to estimate
whether the parent Append is really a fast-start plan or not.  Now, if
we have to insert a Sort to make a child plan be correctly ordered, that
clearly is a case where the child could have a large startup cost ...
but what if a correctly-ordered child has a large startup cost for
some other reason?  Simply refusing to insert Sort nodes won't keep us
out of the weeds if that's true.  However, if we stick in a hack like
the one suggested above, that will keep us from being too optimistic
about the fast-start properties of the Append node no matter whether
the problem arises from an added Sort node or is intrinsic to the
child plan.

It may well be that as things stand today, this scenario is only
hypothetical, because we can only prove that a plain-Append plan
is correctly sorted if it's arising from a suitably partitioned table,
and the child plans in such cases will all be IndexScans with
minimal startup cost.  But we should look ahead to scenarios where
that's not true.  (mumble maybe a foreign table as a partition
is already a counterexample? mumble)

regards, tom lane

Re: Removing \cset from pgbench

2019-03-22 Thread Fabien COELHO



Hola Álvaro,


What I'm going to do now is to write a patch to remove the \cset part of
the commit and post it, intending to push at some point next week.


Per your request, here is a patch which removes \cset from pgbench. Sigh.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 24833f46bc..b45e2f0896 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -963,48 +963,6 @@ pgbench  options  d
   
 
   
-   
-
- \cset [prefix]
-
-
-
- 
-  This command may be used to end SQL queries, replacing an embedded
-  semicolon (\;) within a compound SQL command.
- 
-
- 
-  When this command is used, the preceding SQL query is expected to
-  return one row, the columns of which are stored into variables named after
-  column names, and prefixed with prefix if provided.
- 
-
- 
-  The following example sends four queries as one compound SQL command,
-  inducing one message sent at the protocol level.
-  The result of the first query is stored into variable one,
-  the results of the third query are stored into variables z_three
-  and z_four,
-  whereas the results of the other queries are discarded.
-
--- compound of four queries
-SELECT 1 AS one \cset
-SELECT 2 AS two \;
-SELECT 3 AS three, 4 AS four \cset z_
-SELECT 5;
-
- 
-
- 
-  
-\cset does not work when empty SQL queries appear
-within a compound SQL command.
-  
- 
-
-   
-

 
  \gset [prefix]
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4789ab92ee..43976d1d5d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -490,7 +490,6 @@ typedef enum MetaCommand
 	META_SETSHELL,/* \setshell */
 	META_SHELL,	/* \shell */
 	META_SLEEP,	/* \sleep */
-	META_CSET,	/* \cset */
 	META_GSET,	/* \gset */
 	META_IF,	/* \if */
 	META_ELIF,	/* \elif */
@@ -522,7 +521,7 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  *string itself.  For SQL commands, after post-processing
  *argv[0] is the same as 'lines' with variables substituted.
  * nqueries		In a multi-command SQL line, the number of queries.
- * varprefix 	SQL commands terminated with \gset or \cset have this set
+ * varprefix 	SQL commands terminated with \gset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
  * varprefix_max Allocated size of the varprefix array.
@@ -2614,8 +2613,6 @@ getMetaCommand(const char *cmd)
 		mc = META_ELSE;
 	else if (pg_strcasecmp(cmd, "endif") == 0)
 		mc = META_ENDIF;
-	else if (pg_strcasecmp(cmd, "cset") == 0)
-		mc = META_CSET;
 	else if (pg_strcasecmp(cmd, "gset") == 0)
 		mc = META_GSET;
 	else
@@ -4293,30 +4290,11 @@ free_command(Command *command)
 	}
 	/*
 	 * It should also free expr recursively, but this is currently not needed
-	 * as only \{g,c}set commands (which do not have an expression) are freed.
+	 * as only gset commands (which do not have an expression) are freed.
 	 */
 	pg_free(command);
 }
 
-/*
- * append "more" text to current compound command which had been interrupted
- * by \cset.
- */
-static void
-append_sql_command(Command *my_command, char *more, int numqueries)
-{
-	Assert(my_command->type == SQL_COMMAND && my_command->lines.len > 0);
-
-	more = skip_sql_comments(more);
-	if (more == NULL)
-		return;
-
-	/* append command text, embedding a ';' in place of the \cset */
-	appendPQExpBuffer(&my_command->lines, ";\n%s", more);
-
-	my_command->nqueries += numqueries;
-}
-
 /*
  * Once an SQL command is fully parsed, possibly by accumulating several
  * parts, complete other fields of the Command structure.
@@ -4549,7 +4527,7 @@ process_backslash_command(PsqlScanState sstate, const char *source)
 			syntax_error(source, lineno, my_command->first_line, my_command->argv[0],
 		 "unexpected argument", NULL, -1);
 	}
-	else if (my_command->meta == META_CSET || my_command->meta == META_GSET)
+	else if (my_command->meta == META_GSET)
 	{
 		if (my_command->argc > 2)
 			syntax_error(source, lineno, my_command->first_line, my_command->argv[0],
@@ -4637,7 +4615,6 @@ ParseScript(const char *script, const char *desc, int weight)
 	PQExpBufferData line_buf;
 	int			alloc_num;
 	int			index;
-	bool		saw_cset = false;
 	int			lineno;
 	int			start_offset;
 
@@ -4683,22 +4660,12 @@ ParseScript(const char *script, const char *desc, int weight)
 
 		semicolons = psql_scan_get_escaped_semicolons(sstate);
 
-		if (saw_cset)
-		{
-			/* the previous multi-line command ended with \cset */
-			append_sql_command(ps.commands[index - 1], line_buf.data,
-			   semicolons + 1);
-			saw_cset = false;
-		}
-		else
-		{
-			/* If we collected a new SQL command, process that */
-			command = create_sql_command(&line_buf, desc, semicolons + 1);
+		/* If we collected a new SQL command, process that */
+

reorder pg_rewind control file sync

2019-03-22 Thread Fabien COELHO



Bonjour Michaël,

On Sat, 23 Mar 2019, Michael Paquier wrote:

On Fri, Mar 22, 2019 at 03:18:26PM +0100, Fabien COELHO wrote:
Attached is a quick patch about "pg_rewind", so that the control file 
is updated after everything else is committed to disk.


Could you start a new thread about that please?  This one has already 
been used for too many things.


Here it is.

The attached patch reorders the cluster fsyncing and control file changes 
in "pg_rewind" so that the later is done after all data are committed to 
disk, so as to reflect the actual cluster status, similarly to what is 
done by "pg_checksums", per discussion in the thread about offline 
enabling of checksums:


https://www.postgresql.org/message-id/20181221201616.gd4...@nighthawk.caipicrew.dd-dns.de

--
Fabien.diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3dcadb9b40..c1e6d7cd07 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -351,9 +351,12 @@ main(int argc, char **argv)
 
 	progress_report(true);
 
-	pg_log(PG_PROGRESS, "\ncreating backup label and updating control file\n");
+	pg_log(PG_PROGRESS, "\ncreating backup label\n");
 	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
+	pg_log(PG_PROGRESS, "syncing target data directory\n");
+	syncTargetDirectory();
+
 	/*
 	 * Update control file of target. Make it ready to perform archive
 	 * recovery when restarting.
@@ -362,6 +365,7 @@ main(int argc, char **argv)
 	 * source server. Like in an online backup, it's important that we recover
 	 * all the WAL that was generated while we copied the files over.
 	 */
+	pg_log(PG_PROGRESS, "updating control file\n");
 	memcpy(&ControlFile_new, &ControlFile_source, sizeof(ControlFileData));
 
 	if (connstr_source)
@@ -377,11 +381,9 @@ main(int argc, char **argv)
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+
 	update_controlfile(datadir_target, progname, &ControlFile_new, do_sync);
 
-	pg_log(PG_PROGRESS, "syncing target data directory\n");
-	syncTargetDirectory();
-
 	printf(_("Done!\n"));
 
 	return 0;

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-22 Thread Alvaro Herrera

On 2019-Mar-23, Amit Kapila wrote:

> On Fri, Mar 22, 2019 at 10:04 AM Alvaro Herrera
>  wrote:

> > Not count them if they're implementation details; otherwise count them.
> > For example, IMO autovacuum transactions should definitely be counted
> > (as one transaction, even if they run parallel vacuum).
> 
> It appears to me that the definition of what we want to display in
> xact_commit/xact_rollback (for pg_stat_database view) is slightly
> vague.  For ex. does it mean that we will show only transactions
> started by the user or does it also includes the other transactions
> started internally (which you call implementation detail) to perform
> the various operations?  I think users would be more interested in the
> transactions initiated by them.

Yes, you're probably right.


> I think some users might also be interested in the write transactions
> happened in the system, basically, those have consumed xid.

Well, do they really want to *count* these transactions, or is it enough
to keep an eye on the "age" of some XID column?  Other than for XID
freezing purposes, I don't see such internal transactions as very
interesting.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-22 Thread Amit Kapila

On Fri, Mar 22, 2019 at 10:04 AM Alvaro Herrera
 wrote:
> On 2019-Mar-21, Robert Haas wrote:
>
> > I agree that it's a little funny to count the parallel worker commit
> > as a separate transaction, since in a certain sense it is part of the
> > same transaction.
>
> Right.  So you don't count it.  This seems crystal clear.  Doing the
> other thing is outright wrong, there's no doubt about that.
>

Agreed, this is a clear problem and we can just fix this and move
ahead, but it is better if we spend some more time to see if we can
come up with something better.  We might want to just fix this and
then deal with the second part of the problem separately along with
some other similar cases.

> > But if you do that, then as already noted you have to next decide what
> > to do about other transactions that parallel workers use internally.
>
> You don't count those ones either.
>

Yeah, we can do that but it is not as clear as the previous one.  The
reason is that we do similarly start/commit transaction for various
operations like autovacuum, cluster, drop index concurrently, etc.
So, it doesn't sound good to me to just change this for parallel query
and leave others as it is.

> > And then you have to decide what to do about other background
> > transactions.
>
> Not count them if they're implementation details; otherwise count them.
> For example, IMO autovacuum transactions should definitely be counted
> (as one transaction, even if they run parallel vacuum).
>

It appears to me that the definition of what we want to display in
xact_commit/xact_rollback (for pg_stat_database view) is slightly
vague.  For ex. does it mean that we will show only transactions
started by the user or does it also includes the other transactions
started internally (which you call implementation detail) to perform
the various operations?  I think users would be more interested in the
transactions initiated by them.  I think some users might also be
interested in the write transactions happened in the system,
basically, those have consumed xid.

I think we should decide what we want to do for all other internal
transactions that are started before fixing the second part of this
problem ("other transactions that parallel workers use internally").

Thanks, Robert and Alvaro to chime in as this needs some discussion to
decide what behavior we want to provide to users.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Ordered Partitioned Table Scans

2019-03-22 Thread David Rowley

On Sat, 23 Mar 2019 at 05:40, Tom Lane  wrote:
> BTW, another thing we could possibly do to answer this objection is to
> give the ordered-Append node an artificially pessimistic startup cost,
> such as the sum or the max of its children's startup costs.  That's
> pretty ugly and unprincipled, but maybe it's better than not having the
> ability to generate the plan shape at all?

I admit to having thought of that while trying to get to sleep last
night, but I was too scared to even suggest it.  It's pretty much how
MergeAppend would cost it anyway.  I agree it's not pretty to lie
about the startup cost, but it does kinda seem silly to fall back on a
more expensive MergeAppend when we know fine well Append is cheaper.
Probably the danger would be that someone pulls it out thinking its a
bug. So we'd need to clearly comment why we're doing it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Ordered Partitioned Table Scans

2019-03-22 Thread David Rowley

On Sat, 23 Mar 2019 at 04:56, Tom Lane  wrote:
>
> David Rowley  writes:
> > Append has the additional
> > saving of not having to determine to perform a sort on the top row
> > from each subpath.
>
> Uh, what?  sorted-Append and MergeAppend would need pre-sorts on
> exactly the same set of children.

I was talking about the binary heap code that MergeAppend uses to
decide which subplan to pull from next.

> In cases where, say, the first child requires no sort but also doesn't
> emit very many rows, while the second child requires an expensive sort,
> the planner will have a ridiculously optimistic opinion of the cost of
> fetching slightly more rows than are available from the first child.
> This might lead it to wrongly choose a merge join over a hash for example.

umm.. Yeah, that's a good point.  I seemed to have failed to consider
that the fast startup plan could lower the cost of a merge join with a
limit.  I agree with that concern. I also find it slightly annoying
since we already make other plan shapes that can suffer from similar
problems, e.g Index scan + filter + limit, but I agree we don't need
any more of those as they're pretty painful when they hit.

I'll change the patch around to pull out the code you've mentioned.

Thanks for spelling out your point to me.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: psql display of foreign keys

2019-03-22 Thread Alvaro Herrera

On 2019-Mar-22, Alvaro Herrera wrote:

> On 2019-Mar-05, Amit Langote wrote:
> 
> > On 2019/03/05 4:41, Alvaro Herrera wrote:
> > > Here's the patch I'm really interested about :-)
> > 
> > Thanks for the updated patch.  I applied it and rebased the
> > foreign-keys-referencing-partitioned-tables patch on top.  Here's
> > something I think you may have missed:
> 
> I missed that indeed!  Thanks for noticing.  Here's an updated and
> rebased version of this patch.

I forgot to "git add" the new changes to the expected file.  Here's v8
with that fixed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 1672b926da95e7b181efc3a2670ee979427fee0b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 22 Mar 2019 19:13:16 -0300
Subject: [PATCH v8] fix psql display of FKs

---
 src/bin/psql/describe.c   | 140 --
 src/test/regress/expected/foreign_key.out |  26 ++--
 2 files changed, 118 insertions(+), 48 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index fd8ebee8cd3..734a718af21 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1479,6 +1479,7 @@ describeOneTableDetails(const char *schemaname,
 		bool		rowsecurity;
 		bool		forcerowsecurity;
 		bool		hasoids;
+		bool		ispartition;
 		Oid			tablespace;
 		char	   *reloptions;
 		char	   *reloftype;
@@ -1502,7 +1503,25 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "false AS relhasoids, %s, c.reltablespace, "
+		  "false AS relhasoids, c.relispartition, %s, c.reltablespace, "
+		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
+		  "c.relpersistence, c.relreplident, am.amname\n"
+		  "FROM pg_catalog.pg_class c\n "
+		  "LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid)\n"
+		  "LEFT JOIN pg_catalog.pg_am am ON (c.relam = am.oid)\n"
+		  "WHERE c.oid = '%s';",
+		  (verbose ?
+		   "pg_catalog.array_to_string(c.reloptions || "
+		   "array(select 'toast.' || x from pg_catalog.unnest(tc.reloptions) x), ', ')\n"
+		   : "''"),
+		  oid);
+	}
+	else if (pset.sversion >= 11)
+	{
+		printfPQExpBuffer(&buf,
+		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
+		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
+		  "c.relhasoids, c.relispartition, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident, am.amname\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1520,7 +1539,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "c.relhasoids, %s, c.reltablespace, "
+		  "c.relhasoids, false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1537,7 +1556,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1554,7 +1573,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1571,7 +1590,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END\n"
 		  "FROM pg_catalog.pg_class c\n "
 		  "LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid)\n"
@@ -1587,7 +1606,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,

Re: Progress reporting for pg_verify_checksums

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 02:23:17PM +0100, Michael Banck wrote:
> The current version prints a newline when it progress reporting is
> toggled off. Do you mean there is a hazard that this happens right when
> we are printing the progress, so end up with a partly garbage line? I
> don't think that'd be so bad to warrant further code complexitiy, after
> all, the user explicitly wanted the progress to stop.
> 
> But maybe I am misunderstanding?

The latest patch does not apply because of mainly ed308d7.  Could you
send a rebase?

FWIW, I would remove the signal toggling stuff from the patch and keep
the logic simple at this stage.  Not having a solution which works on
Windows is perhaps not a strong reason to not include it, but it's a
sign that we could perhaps design something better, and that's
annoying.  Personally I think that I would just use --progress all the
time and not use the signaling part at all, my 2c.
--
Michael


signature.asc
Description: PGP signature

Re: Contribution to Perldoc for TestLib module in Postgres

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 04:59:58PM +0530, Ramanarayana wrote:
> Please find the first version of the patch for review. I was not sure what
> some of the functions are used for and marked them with TODO.

This is only adding some documentation to an internal perl module we
ship, so it is far from being a critical part and we could still get
that into v12, still there are many patches waiting for integration
into v12 and this has showed up very late.  Could you please register
this patch to the commit fest once you have a patch you think is fit
for merging?  Here is the next commit fest link:
https://commitfest.postgresql.org/23/
--
Michael


signature.asc
Description: PGP signature

Re: Special role for subscriptions

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 08:41:06PM +0800, Andrey Borodin wrote:
> 22 марта 2019 г., в 19:17, Petr Jelinek  
> написал(а):
>> I still don't like that we are running the subscription workers as
>> superuser even for subscriptions created by regular user. That has
>> plenty of privilege escalation issues in terms of how user functions are
>> run (we execute triggers, index expressions etc, in that worker).
>
> Yes, this is important concern, thanks! I think it is not a big deal
> to run worker without superuser privileges too.

FWIW, the argument from Petr is very scary.  So please let me think
that it is a pretty big deal.

> Yes, this patch is a pure security implication and nothing else.

And this is especially *why* it needs careful screening.

>> Independently from the willingness of any committer to work on this
>> at current CF, the topic of subscription security relaxation
>> really worth efforts. 

Perhaps, still it seems that we are still discussing about the concept
and that we have no clear agreement on what to do.  This is not a good
sign 8 days before the end of the last commit fest.
--
Michael


signature.asc
Description: PGP signature

Re: propagating replica identity to partitions

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 07:55:11PM +0100, Peter Eisentraut wrote:
> If you are operating on a partitioned table and set the replica identity
> to the primary key or a partitioned index of that partitioned table,
> then I think, by definition of what it means to be a partitioned index,
> that applies to the whole partition hierarchy.
> 
> Aside from that theoretical consideration, what would be the practical
> use of not doing that?

FWIW I agree about the part of inheriting the replica identity of the
parent when defining a partition on it.  That's also..  Instinctive.

> I'm slightly baffled that we would even allow having different owners on
> different partitions, but that seems to be a separate discussion.

Different owners can make sense for multiple layers of partitions
where the children have less restrictions than the children.  Imagine
for example a table listing the population of a country, with children
partitioned by regions, and grand-children partitioned by cities.  The
top-most parent could be owned by a minister, and lower levels apply
to the region administrator, down to the city administrators.
--
Michael


signature.asc
Description: PGP signature

Re: Introduce MIN/MAX aggregate functions to pg_lsn

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 04:49:57PM -0300, Fabrízio de Royes Mello wrote:
> So attached patch aims to introduce MIN/MAX aggregate functions to pg_lsn

Fine by me.  This looks helpful for monitoring.

Please make sure to register it to the next commit fest:
https://commitfest.postgresql.org/23/
It is too late for Postgres 12 unfortunately.
--
Michael


signature.asc
Description: PGP signature

Re: compiler warning in pgcrypto imath.c

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 08:20:53PM -0400, Jeff Janes wrote:
> PostgreSQL 12devel on aarch64-unknown-linux-gnu, compiled by gcc
> (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit

Adding Noah in CC as he has done the update of imath lately.

> The attached patch adds PG_USED_FOR_ASSERTS_ONLY to silence it.  Perhaps
> there is a better way, given that we want to change imath.c as little as
> possible from its upstream?

Maybe others have better ideas, but marking the variable with
PG_USED_FOR_ASSERTS_ONLY as you propose seems like the least invasive
method of all.
--
Michael


signature.asc
Description: PGP signature

compiler warning in pgcrypto imath.c

2019-03-22 Thread Jeff Janes

When compiling on an AWS 64 bit Arm machine, I get this compiler warning:

imath.c: In function 's_ksqr':
imath.c:2590:6: warning: variable 'carry' set but not used
[-Wunused-but-set-variable]
  carry;
  ^

With this version():

PostgreSQL 12devel on aarch64-unknown-linux-gnu, compiled by gcc
(Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit

The attached patch adds PG_USED_FOR_ASSERTS_ONLY to silence it.  Perhaps
there is a better way, given that we want to change imath.c as little as
possible from its upstream?

Cheers,

Jeff


pgcrypto_warning.patch
Description: Binary data

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 03:18:26PM +0100, Fabien COELHO wrote:
> Attached is a quick patch about "pg_rewind", so that the control file is
> updated after everything else is committed to disk.

Could you start a new thread about that please?  This one has already
been used for too many things.
--
Michael


signature.asc
Description: PGP signature

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Paquier

On Sat, Mar 23, 2019 at 08:16:07AM +0900, Michael Paquier wrote:
> And committed the main part.  I'll look after the --no-sync part in a
> bit.

--no-sync is committed as well now.
--
Michael


signature.asc
Description: PGP signature

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 07:02:36PM +0100, Fabien COELHO wrote:
> Indeed it does, and it is done in update_controlfile if the last argument is
> true. Basically update_controlfile latest version always fsync the control
> file, unless explicitely told not to do so. The options to do that are
> really there only to speed up non regression tests.

For the control file, it would not really matter much, and the cost
would be really coming from syncing the data directory, still for
correctness it is better to have a full all-or-nothing switch.  Small
buildfarm machines also like the --no-sync flavors a lot.
--
Michael


signature.asc
Description: PGP signature

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 02:59:31PM +0100, Fabien COELHO wrote:
> On write(), the error message is not translatable whereas it is for all
> others.

Fixed.

> I agree that a BIG STRONG warning is needed about not to start the cluster
> under pain of possible data corruption. I still think that preventing this
> is desirable, preferably before v12.

For now the docs mention that in a paragraph as Michael Banck has
suggested.  Not sure that this deserves a warning portion.

> Otherwise, my remaining non showstopper (committer's opinion matters more)
> issues:
> 
> Doc: A postgres cluster is like an Oracle instance. I'd use "replicated
> setup" instead of "cluster", and "cluster" instead of "instance. I'll try to
> improve the text, possibly over the week-end.

Right.  I have reworded that using your suggestions.

And committed the main part.  I'll look after the --no-sync part in a
bit.
--
Michael


signature.asc
Description: PGP signature

Re: Error message inconsistency

2019-03-22 Thread Fabrízio de Royes Mello

On Fri, Mar 22, 2019 at 2:25 PM Simon Riggs  wrote:
>
> As noted by a PostgreSQL user to me, error messages for NOT NULL
constraints are inconsistent - they do not mention the relation name in the
message, as all other variants of this message do. e.g.
>
> postgres=# create table nn (id integer not null);
> CREATE TABLE
> postgres=# insert into nn values (NULL);
> ERROR: null value in column "id" violates not-null constraint
> DETAIL: Failing row contains (null).
>
> postgres=# create table nn2 (id integer check (id is not null));
> CREATE TABLE
> postgres=# insert into nn2 values (NULL);
> ERROR: new row for relation "nn2" violates check constraint "nn2_id_check"
> DETAIL: Failing row contains (null).
>
> I propose the attached patch as a fix, changing the wording (of the first
case) to
> ERROR: null value in column "id" for relation "nn" violates not-null
constraint
>
> It causes breakage in multiple tests, which is easy to fix once/if we
agree to change.
>

I totally agree with that change because I already get some negative
feedback from users about this message too.

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Fix foreign key constraint check for partitioned tables

2019-03-22 Thread Hadi Moshayedi

Yesterday while doing some tests, I noticed that the following doesn't work
properly:

create role test_role with login;
create table ref(a int primary key);
grant references on ref to test_role;
set role test_role;
create table t1(a int, b int) partition by list (a);
alter table t1 add constraint t1_b_key foreign key (b) references ref(a);

In postgres 11.2, this results in the following error:

ERROR:  could not open file "base/12537/16390": No such file or directory

and in the master branch it simply crashes.

It seems that validateForeignKeyConstraint() in tablecmds.c cannot
use RI_Initial_Check() to check the foreign key constraint, so it tries to
open the relation and scan it and verify each row by a call
to RI_FKey_check_ins(). Opening and scanning the relation fails, because it
is a partitioned table and has no storage.

The attached patch fixes the problem by skipping foreign key constraint
check for relations with no storage. In partitioned table case, it will be
verified by scanning the partitions, so we are safe to skip the parent
table.

-- Hadi


fix-foreign-key-check.patch
Description: Binary data

RE: Planning counters in pg_stat_statements (using pgss_store)

2019-03-22 Thread legrand legrand

Hi,
Here is a rebased and corrected version .

Columns naming has not been modified, I would propose to change it to:
 - plans: ok
 - planning_time --> plan_time
 - calls: ok
 - total_time --> exec_time
 - {min,max,mean,stddev}_time: ok
 - new total_time (being the sum of plan_time and exec_time)

Regards
PAscal




pgss_add_planning_counters_v1.diff
Description: pgss_add_planning_counters_v1.diff

Re: psql display of foreign keys

2019-03-22 Thread Alvaro Herrera

On 2019-Mar-05, Amit Langote wrote:

> On 2019/03/05 4:41, Alvaro Herrera wrote:
> > Here's the patch I'm really interested about :-)
> 
> Thanks for the updated patch.  I applied it and rebased the
> foreign-keys-referencing-partitioned-tables patch on top.  Here's
> something I think you may have missed:

I missed that indeed!  Thanks for noticing.  Here's an updated and
rebased version of this patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 3e523dd2b08569324874b0e3fd848beea6e7779b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 22 Mar 2019 19:13:16 -0300
Subject: [PATCH v6] fix psql display of FKs

---
 src/bin/psql/describe.c   | 140 --
 src/test/regress/expected/foreign_key.out |  10 +-
 2 files changed, 110 insertions(+), 40 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index fd8ebee8cd3..734a718af21 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1479,6 +1479,7 @@ describeOneTableDetails(const char *schemaname,
 		bool		rowsecurity;
 		bool		forcerowsecurity;
 		bool		hasoids;
+		bool		ispartition;
 		Oid			tablespace;
 		char	   *reloptions;
 		char	   *reloftype;
@@ -1502,7 +1503,25 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "false AS relhasoids, %s, c.reltablespace, "
+		  "false AS relhasoids, c.relispartition, %s, c.reltablespace, "
+		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
+		  "c.relpersistence, c.relreplident, am.amname\n"
+		  "FROM pg_catalog.pg_class c\n "
+		  "LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid)\n"
+		  "LEFT JOIN pg_catalog.pg_am am ON (c.relam = am.oid)\n"
+		  "WHERE c.oid = '%s';",
+		  (verbose ?
+		   "pg_catalog.array_to_string(c.reloptions || "
+		   "array(select 'toast.' || x from pg_catalog.unnest(tc.reloptions) x), ', ')\n"
+		   : "''"),
+		  oid);
+	}
+	else if (pset.sversion >= 11)
+	{
+		printfPQExpBuffer(&buf,
+		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
+		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
+		  "c.relhasoids, c.relispartition, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident, am.amname\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1520,7 +1539,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "c.relhasoids, %s, c.reltablespace, "
+		  "c.relhasoids, false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1537,7 +1556,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1554,7 +1573,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1571,7 +1590,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END\n"
 		  "FROM pg_catalog.pg_class c\n "
 		  "LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid)\n"
@@ -1587,7 +1606,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace\

Re: speeding up planning with partitions

2019-03-22 Thread Tom Lane

I wrote:
> ... I also
> don't like the undocumented way that you've got build_base_rel_tlists
> working on something that's not the final tlist, and then going back
> and doing more work of the same sort later.  I wonder whether we can
> postpone build_base_rel_tlists until after the other stuff happens,
> so that it can continue to do all of the tlist-distribution work.

On further reflection: we don't really need the full build_base_rel_tlists
pushups for these extra variables.  The existence of the original rowmark
variable will be sufficient, for example, to make sure that we don't
decide the parent partitioned table can be thrown away as a useless join.
All we need to do is add the new Var to the parent's reltarget and adjust
its attr_needed, and we can do that very late in query_planner.  So now
I'm thinking leave build_base_rel_tlists() where it is, and then fix up
the tlist/reltarget data on the fly when we discover that new child rels
need more rowmark types than we had before.  (This'd be another reason to
keep the tlist in root->processed_tlist throughout, likely.)

regards, tom lane

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Julien Rouhaud

On Fri, Mar 22, 2019 at 7:19 PM Robert Haas  wrote:
>
> On Fri, Mar 22, 2019 at 12:40 PM Tom Lane  wrote:
> > Robert Haas  writes:
> > > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane  wrote:
> > >> In cases where, say, the first child requires no sort but also doesn't
> > >> emit very many rows, while the second child requires an expensive sort,
> > >> the planner will have a ridiculously optimistic opinion of the cost of
> > >> fetching slightly more rows than are available from the first child.
> > >> This might lead it to wrongly choose a merge join over a hash for 
> > >> example.
> >
> > > I think this is very much a valid point, especially in view of the
> > > fact that we already choose supposedly fast-start plans too often.  I
> > > don't know whether it's a death sentence for this patch, but it should
> > > at least make us stop and think hard.
> >
> > BTW, another thing we could possibly do to answer this objection is to
> > give the ordered-Append node an artificially pessimistic startup cost,
> > such as the sum or the max of its children's startup costs.  That's
> > pretty ugly and unprincipled, but maybe it's better than not having the
> > ability to generate the plan shape at all?
>
> Yeah, I'm not sure whether that's a good idea or not.  I think one of
> the problems with a cost-based optimizer is that once you start
> putting things in with the wrong cost because you think it will give
> the right answer, you're sorta playing with fire, because you can't
> necessarily predict how things are going are going to turn out in more
> complex scenarios.  On the other hand, it may sometimes be the right
> thing to do.

I've been bitten too many times with super inefficient plans of the
form "let's use the wrong index instead of the good one because I'll
probably find there the tuple I want very quickly", due to LIMIT
assuming an even distribution.   Since those queries can end up taking
dozens of minutes instead of less a ms, without a lot of control to
fix this kind of problem I definitely don't want to introduce another
similar source of pain for users.

However, what we're talking here is still a corner case.  People
having partitioned tables with a mix of partitions with and without
indexes suitable for ORDER BY x LIMIT y queries should already have at
best average performance.  And trying to handle this case cannot hurt
cases where all partitions have suitable indexes, so that may be an
acceptable risk?

I also have mixed feeling about this artificial startup cost penalty,
but if we go this way we can for sure cumulate the startup cost of all
sorts that we think we'll have to perform (according to each path's
estimated rows and the given limit_tuples).  That probably won't be
enough though, especially with fractional paths.

Re: rename labels in heapam.c?

2019-03-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-03-22 17:09:23 -0400, Tom Lane wrote:
>> Is it practical to get rid of the goto's altogether?  If not,
>> renaming would be an improvement.

> I don't think it'd be easy.

Fair enough.  I just wanted to be sure we considered getting rid of
the pig before we put lipstick on it.  I concur that it's not worth
major refactoring to do so.

regards, tom lane

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-22 Thread Peter Geoghegan

On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan  wrote:
> I've committed the first 4 patches. Many thanks to Heikki for his very
> valuable help! Thanks also to the other reviewers.
>
> I'll likely push the remaining two patches on Sunday or Monday.

I noticed that if I initidb and run "make installcheck" with and
without the "split after new tuple" optimization patch, the largest
system catalog indexes shrink quite noticeably:

Master
==
pg_depend_depender_index 1456 kB
pg_depend_reference_index 1416 kB
pg_class_tblspc_relfilenode_index 224 kB

Patch
=
pg_depend_depender_index 1088 kB   -- ~25% smaller
pg_depend_reference_index 1136 kB   -- ~20% smaller
pg_class_tblspc_relfilenode_index 160 kB -- 28% smaller

This is interesting to me because it is further evidence that the
problem that the patch targets is reasonably common. It's also
interesting to me because we benefit despite the fact there are a lot
of duplicates in parts of these indexes; we vary our strategy at
different parts of the key space, which works well. We pack pages
tightly where they're full of duplicates, using the "single value"
strategy that I've already committed, whereas the apply the "split
after new tuple" optimization in parts of the index with localized
monotonically increasing insertions. If there were no duplicates in
the indexes, then they'd be about 40% smaller, which is exactly what
we see with the TPC-C indexes (they're all unique indexes, with very
few physical duplicates). Looks like the duplicates are mostly
bootstrap mode entries. Lots of the pg_depend_depender_index
duplicates look like "(classid, objid, objsubid)=(0, 0, 0)", for
example.

I also noticed one further difference: the pg_shdepend_depender_index
index grew from 40 kB to 48 kB. I guess that might count as a
regression, though I'm not sure that it should. I think that we would
do better if the volume of data in the underlying table was greater.
contrib/pageinspect shows that a small number of the leaf pages in the
improved cases are not very filled at all, which is more than made up
for by the fact that many more pages are packed as if they were
created by a rightmost split (262 items 24 byte tuples is exactly
consistent with that). IOW, I suspect that the extra page in
pg_shdepend_depender_index is due to a "local minimum".

-- 
Peter Geoghegan

Re: rename labels in heapam.c?

2019-03-22 Thread Andres Freund

Hi,

On 2019-03-22 17:09:23 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > For the umpteenth time I was annoyed by the names of labels in
> > heapam.c. It's really not useful to see a 'goto l1;' etc.
> 
> Yeah, those label names are uninformative as can be.
> 
> > How about renaming l1 to retry_delete_locked, l2 to retry_update_locked,
> > l3 to retry_lock_tuple_locked etc? Especially with the subsidiary
> > functions for updates and locking, it's not always clear from context
> > where the goto jumps to.
> 
> Is it practical to get rid of the goto's altogether?  If not,
> renaming would be an improvement.

I don't think it'd be easy. We could probably split
heap_{insert,delete,update} into sub-functions and then have the
toplevel function just loop over invocations of those, but that seems
like a pretty significant refactoring of the code.  Since renaming the
labels isn't going to make that harder, I'm inclined to do that, rather
than wait for a refactoring that, while a good idea, isn't likely to
happen that soon.

Greetings,

Andres Freund

Re: rename labels in heapam.c?

2019-03-22 Thread Tom Lane

Andres Freund  writes:
> For the umpteenth time I was annoyed by the names of labels in
> heapam.c. It's really not useful to see a 'goto l1;' etc.

Yeah, those label names are uninformative as can be.

> How about renaming l1 to retry_delete_locked, l2 to retry_update_locked,
> l3 to retry_lock_tuple_locked etc? Especially with the subsidiary
> functions for updates and locking, it's not always clear from context
> where the goto jumps to.

Is it practical to get rid of the goto's altogether?  If not,
renaming would be an improvement.

regards, tom lane

Re: speeding up planning with partitions

2019-03-22 Thread Tom Lane

Amit Langote  writes:
> [ v34 patch set ]

I had a bit of a look through this.  I went ahead and pushed 0008 and
0009, as they seem straightforward and independent of the rest.  (BTW,
0009 makes 0003's dubious optimization in set_relation_partition_info
quite unnecessary.)  As for the rest:


0001: OK in principle, but I wonder why you implemented it by adding
another recursive scan of the jointree rather than just iterating
over the baserel array, as in make_one_rel() for instance.  Seems
like this way is more work, and more code that we'll have to touch
if we ever change the jointree representation.

I also feel like you used a dartboard while deciding where to insert the
call in query_planner(); dropping it into the middle of a sequence of
equivalence-class-related operations seems quite random.  Maybe we could
delay that step all the way to just before make_one_rel, since the other
stuff in between seems to only care about baserels?  For example,
it'd certainly be better if we could run remove_useless_joins before
otherrel expansion, so that we needn't do otherrel expansion at all
on a table that gets thrown away as being a useless join.

BTW, it strikes me that we could take advantage of the fact that
baserels must all appear before otherrels in the rel array, by
having loops over that array stop early if they're only interested
in baserels.  We could either save the index of the last baserel,
or just extend the loop logic to fall out upon hitting an otherrel.
Seems like this'd save some cycles when there are lots of partitions,
although perhaps loops like that would be fast enough to not matter.


0002: I seriously dislike this from a system-structure viewpoint.
For one thing it makes root->processed_tlist rather moot, since it
doesn't get set till after most of the work is done.  (I don't know
if there are any FDWs out there that look at processed_tlist, but
they'd be broken by this if so.  It'd be better to get rid of the
field if we can, so that at least such breakage is obvious.  Or,
maybe, use root->processed_tlist as the communication field rather
than having the tlist be a param to query_planner?)  I also
don't like the undocumented way that you've got build_base_rel_tlists
working on something that's not the final tlist, and then going back
and doing more work of the same sort later.  I wonder whether we can
postpone build_base_rel_tlists until after the other stuff happens,
so that it can continue to do all of the tlist-distribution work.


0003: I haven't studied this in great detail, but it seems like there's
some pretty random things in it, like this addition in
inheritance_planner:

+   /* grouping_planner doesn't need to see the target children. */
+   subroot->append_rel_list = list_copy(orig_append_rel_list);

That sure seems like a hack unrelated to the purpose of the patch ... and
since list_copy is a shallow copy, I'm not even sure it's safe.  Won't
we be mutating the contained (and shared) AppendRelInfos as we go along?


0004 and 0005: I'm not exactly thrilled about loading more layers of
overcomplication into inheritance_planner, especially not when the
complication then extends out into new global planner flags with
questionable semantics.  We should be striving to get rid of that
function, as previously discussed, and I fear this is just throwing
more roadblocks in the way of doing so.  Can we skip these and come
back to the question after we replace inheritance_planner with
expand-at-the-bottom?


0006: Seems mostly OK, but I'm not very happy about the additional
table_open calls it's throwing into random places in the planner.
Those can be rather expensive.  Why aren't we collecting the appropriate
info during the initial plancat.c consultation of the relcache?


0007: Not really sold on this; it seems like it's going to be a net
negative for cases where there aren't a lot of partitions.

regards, tom lane

rename labels in heapam.c?

2019-03-22 Thread Andres Freund

Hi,

For the umpteenth time I was annoyed by the names of labels in
heapam.c. It's really not useful to see a 'goto l1;' etc.

How about renaming l1 to retry_delete_locked, l2 to retry_update_locked,
l3 to retry_lock_tuple_locked etc? Especially with the subsidiary
functions for updates and locking, it's not always clear from context
where the goto jumps to.

Greetings,

Andres Freund

Re: [PATCH v20] GSSAPI encryption support

2019-03-22 Thread Robbie Harwood

Stephen Frost  writes:

> One of the things that I really didn't care for in this patch was the
> use of the string buffers, without any real checks (except for "oh,
> you tried to allocated over 1G"...) to make sure that the other side
> of the connection wasn't feeding us ridiculous packets, and with the
> resizing of the buffers, etc, that really shouldn't be necessary.
> After chatting with Robbie about these concerns while reading through
> the code, we agreed that we should be able to use fixed buffer sizes
> and use the quite helpful gss_wrap_size_limit() to figure out how much
> data we can encrypt without going over our fixed buffer size.  As
> Robbie didn't have time to implement those changes this past week, I
> did so, and added a bunch more comments and such too, and have now
> gone through and done more testing.  Robbie has said that he should
> have time this upcoming week to review the changes that I made, and
> I'm planning to go back and review other parts of the patch more
> closely now as well.

In general this looks good - there are a couple minor comments inline,
but it's fine.

I wanted to note a couple things about this approach.  It now uses one
more buffer than before (in contrast to the previous approach, which
reused a buffer for received data that was encrypted and decrypted).
Since these are static fixed size buffers, this increases the total
steady-state memory usage by 16k as opposed to re-using the buffer.
This may be fine; I don't know how tight RAM is here.

> Note that there's an issue with exporting the context to get the
> encryption algorithm used that I've asked Robbie to look into, so
> that's no longer done and instead we just print that the connection is
> encrypted, if it is.  If we can't figure out a way to make that work
> then obviously I'll pull out that code, and if we can get it to work
> then I'll update it to be done through libpq properly, as I had
> suggested earlier.  That's really more of a nice to have in any case
> though, so I may just exclude it for now anyway if it ends up adding
> any complications.

Correct.  Unfortunately I'd overlooked that the lucid interface won't
meet our needs (destroys the context).  So the two options here are:
SASL SSF (and I'll separately push more mechs to add support for that),
or nothing at all.

If you want a patch for that I can make one, but I think there was code
already... just needed a ./configure check program for whether the OID
is defined.

> +ssize_t
> +be_gssapi_write(Port *port, void *ptr, size_t len)
> +{
> + size_t  bytes_to_encrypt = len;
> + size_t  bytes_encrypted = 0;
> +
> + /*
> +  * Loop through encrypting data and sending it out until
> +  * secure_raw_write() complains (which would likely mean that the socket
> +  * is non-blocking and the requested send() would block, or there was 
> some
> +  * kind of actual error) and then return.
> +  */
> + while (bytes_to_encrypt || PqGSSSendPointer)
> + {

I guess it's not a view everyone will share, but I think this block is
too long.  Maybe a helper function around secure_raw_write() would help?
(The check-and-send part at the start.)

I have similar nits about the other functions that don't fit on my
(86-line high) screen, though I guess a lot of this is due to project
style using a lot of vertical space.

> + if (GSS_ERROR(major))
> + pg_GSS_error(FATAL, gettext_noop("GSSAPI context error"),

I'd prefer this to be a different error message than the init/accept
errors - maybe something like "GSSAPI size check error"?

> + if (GSS_ERROR(major))
> + pg_GSS_error(libpq_gettext("GSSAPI context error"), 
> conn,

Same here.

Again, these are nits, and I think I'm okay with the changes.

Thanks,
--Robbie

signature.asc
Description: PGP signature

Introduce MIN/MAX aggregate functions to pg_lsn

2019-03-22 Thread Fabrízio de Royes Mello

Hi all,

Before we introduce pg_lsn datatype the LSN was expressed as a TEXT type,
so a simple query using MIN/MAX functions works as expected. Query like:

SELECT min(restart_lsn) FROM pg_replication_slots;
SELECT min(sent_lsn) FROM pg_stat_replication ;

So attached patch aims to introduce MIN/MAX aggregate functions to pg_lsn
datatype.

Regards,

-- 
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1a01473..490f3a8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14663,7 +14663,7 @@ NULL baz(3 rows)

max(expression)
   
-  any numeric, string, date/time, network, or enum type,
+  any numeric, string, date/time, network, lsn, or enum type,
  or arrays of these types
   same as argument type
   Yes
@@ -14681,7 +14681,7 @@ NULL baz(3 rows)

min(expression)
   
-  any numeric, string, date/time, network, or enum type,
+  any numeric, string, date/time, network, lsn, or enum type,
  or arrays of these types
   same as argument type
   Yes
diff --git a/src/backend/utils/adt/pg_lsn.c b/src/backend/utils/adt/pg_lsn.c
index 7242d3c..ab393bc 100644
--- a/src/backend/utils/adt/pg_lsn.c
+++ b/src/backend/utils/adt/pg_lsn.c
@@ -155,6 +155,30 @@ pg_lsn_ge(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(lsn1 >= lsn2);
 }
 
+Datum
+pg_lsn_larger(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn1 = PG_GETARG_LSN(0);
+	XLogRecPtr	lsn2 = PG_GETARG_LSN(1);
+	XLogRecPtr	result;
+
+	result = ((lsn1 > lsn2) ? lsn1 : lsn2);
+
+	PG_RETURN_LSN(result);
+}
+
+Datum
+pg_lsn_smaller(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn1 = PG_GETARG_LSN(0);
+	XLogRecPtr	lsn2 = PG_GETARG_LSN(1);
+	XLogRecPtr	result;
+
+	result = ((lsn1 < lsn2) ? lsn1 : lsn2);
+
+	PG_RETURN_LSN(result);
+}
+
 /* btree index opclass support */
 Datum
 pg_lsn_cmp(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 044695a..242d843 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -146,6 +146,9 @@
 { aggfnoid => 'max(inet)', aggtransfn => 'network_larger',
   aggcombinefn => 'network_larger', aggsortop => '>(inet,inet)',
   aggtranstype => 'inet' },
+{ aggfnoid => 'max(pg_lsn)', aggtransfn => 'pg_lsn_larger',
+  aggcombinefn => 'pg_lsn_larger', aggsortop => '>(pg_lsn,pg_lsn)',
+  aggtranstype => 'pg_lsn' },
 
 # min
 { aggfnoid => 'min(int8)', aggtransfn => 'int8smaller',
@@ -208,6 +211,9 @@
 { aggfnoid => 'min(inet)', aggtransfn => 'network_smaller',
   aggcombinefn => 'network_smaller', aggsortop => '<(inet,inet)',
   aggtranstype => 'inet' },
+{ aggfnoid => 'min(pg_lsn)', aggtransfn => 'pg_lsn_smaller',
+  aggcombinefn => 'pg_lsn_smaller', aggsortop => '<(pg_lsn,pg_lsn)',
+  aggtranstype => 'pg_lsn' },
 
 # count
 { aggfnoid => 'count(any)', aggtransfn => 'int8inc_any',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index acf1131..cfc9b86 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6189,6 +6189,9 @@
 { oid => '3564', descr => 'maximum value of all inet input values',
   proname => 'max', prokind => 'a', proisstrict => 'f', prorettype => 'inet',
   proargtypes => 'inet', prosrc => 'aggregate_dummy' },
+{ oid => '8125', descr => 'maximum value of all pg_lsn input values',
+  proname => 'max', prokind => 'a', proisstrict => 'f', prorettype => 'pg_lsn',
+  proargtypes => 'pg_lsn', prosrc => 'aggregate_dummy' },
 
 { oid => '2131', descr => 'minimum value of all bigint input values',
   proname => 'min', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
@@ -6253,6 +6256,9 @@
 { oid => '3565', descr => 'minimum value of all inet input values',
   proname => 'min', prokind => 'a', proisstrict => 'f', prorettype => 'inet',
   proargtypes => 'inet', prosrc => 'aggregate_dummy' },
+{ oid => '8126', descr => 'minimum value of all pg_lsn input values',
+  proname => 'min', prokind => 'a', proisstrict => 'f', prorettype => 'pg_lsn',
+  proargtypes => 'pg_lsn', prosrc => 'aggregate_dummy' },
 
 # count has two forms: count(any) and count(*)
 { oid => '2147',
@@ -8355,6 +8361,12 @@
 { oid => '3413', descr => 'hash',
   proname => 'pg_lsn_hash_extended', prorettype => 'int8',
   proargtypes => 'pg_lsn int8', prosrc => 'pg_lsn_hash_extended' },
+{ oid => '8123', descr => 'larger of two',
+  proname => 'pg_lsn_larger', prorettype => 'pg_lsn',
+  proargtypes => 'pg_lsn pg_lsn', prosrc => 'pg_lsn_larger' },
+{ oid => '8124', descr => 'smaller of two',
+  proname => 'pg_lsn_smaller', prorettype => 'pg_lsn',
+  proargtypes => 'pg_lsn pg_lsn', prosrc => 'pg_lsn_smaller' },
 
 # enum related procs
 { oid => '3504', descr => 'I/O',
diff --git a/src/test/regress/expected/pg_lsn.out b/src/test/regress/expected/pg_lsn.out
index 2854cfd..64d41df 100644
--- a/sr

Re: propagating replica identity to partitions

2019-03-22 Thread Peter Eisentraut

On 2019-03-22 17:52, Alvaro Herrera wrote:
> To recap: my proposed change is to make
> ALTER TABLE ... REPLICA IDENTITY
> when applied on a partitioned table affect all of its partitions instead
> of expecting the user to invoke the command for each partition.

If you are operating on a partitioned table and set the replica identity
to the primary key or a partitioned index of that partitioned table,
then I think, by definition of what it means to be a partitioned index,
that applies to the whole partition hierarchy.

Aside from that theoretical consideration, what would be the practical
use of not doing that?

> At the
> same time, I am proposing not to change to have recursive behavior other
> forms of ALTER TABLE in one commit, such as TABLESPACE and OWNER TO,
> which currently do not have recursive behavior.

I'm slightly baffled that we would even allow having different owners on
different partitions, but that seems to be a separate discussion.

In general, it seems sensible that if you operate on a partitioned
table, the whole partition hierarchy is affected unless told otherwise.
There may be sensible exceptions, but it seems useful as the default.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: shared-memory based stats collector

2019-03-22 Thread Andres Freund

Ping?  Unless there's a new version pretty soon, we're going to have to
move this to the next CF, I think.

Re: [HACKERS] CLUSTER command progress monitor

2019-03-22 Thread Robert Haas

On Tue, Mar 19, 2019 at 2:47 PM Robert Haas  wrote:
> how close you were getting to rewriting the entire heap.  This is the
> one thing I found but did not fix; any chance you could make this
> change and update the documentation to match?

Hi, is anybody working on this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Psql patch to show access methods info

2019-03-22 Thread Sergey Cherkashin

Taking into account the wishes of all the reviewers, the current
position of the patch is as follows:

The \dA command displays a list of access methods.

# \dA
List of access methods
  Name  | Type  |   Handler
+---+--
 brin   | index | brinhandler
 btree  | index | bthandler
 gin| index | ginhandler
 gist   | index | gisthandler
 hash   | index | hashhandler
 heap   | table | heap_tableam_handler
 spgist | index | spghandler
(7 rows)

With + it shows description:
# \dA+
 List of access methods
  Name  |
Type  |   Handler|  Description   
+---+--+---
-
 brin   | index | brinhandler  | block range index (BRIN)
access method
 btree  | index | bthandler| b-tree index access method
 gin| index | ginhandler   | GIN index access method
 gist   | index | gisthandler  | GiST index access method
 hash   | index | hashhandler  | hash index access method
 heap   | table | heap_tableam_handler | heap table access method
 spgist | index | spghandler   | SP-GiST index access method
(7 rows)

The functionality of the \dAp command has been moved to \dA NAME.
Now the user can query the properties of a particular AM (or several,
using the search pattern) as follows:

# \dA h*
 Index access
method properties
  AM  | Can order | Support unique indexes | Support indexes with
multiple columns | Support exclusion constraints | Can include non-key
columns 
--+---++---
+---+
-
 hash | no| no |
no| yes   |
no
(1 row)

 Table access method properties
 Name | Type  |   Handler|   Description
--+---+--+--
 heap | table | heap_tableam_handler | heap table access method
(1 row)

Note that for heap, as well as for future table AM, a separate table is
displayed, since it is not clear which properties can be displayed for
them.

The \dAoc command has been renamed to \dAc.
The command displays information about operator classes. The "Input
type" field was left, because the user may first be interested in what
type of data opclass can work with,
and in the second - how it will keep this type inside. Nikita also
chose to leave the opfamily field as additional information.

# \dAc btree name
 Index access method operator classes
  AM   | Input type | Storage type | Operator class | Default? 
---++--++--
 btree | name   | cstring  | name_ops   | yes
(1 row)

# \dAc+ btree record
Index access method operator classes
  AM   | Input type | Storage type |  Operator class  | Default? |
Operator family  | Owner 
---++--+--+--+-
-+---
 btree | record |  | record_image_ops | no   |
record_image_ops | zloj
 btree | record |  | record_ops   | yes  |
record_ops   | zloj
(2 rows)

The \dAfo command has been renamed to \dAo.
\dAo displays information about operators as follows:

# \dAo gin jsonb_ops
 List operators of family related to access method
 AM  | Opfamily Schema | Opfamily Name |  Operator  
-+-+---+
 gin | pg_catalog  | jsonb_ops | @> (jsonb, jsonb)
 gin | pg_catalog  | jsonb_ops | ? (jsonb, text)
 gin | pg_catalog  | jsonb_ops | ?| (jsonb, text[])
 gin | pg_catalog  | jsonb_ops | ?& (jsonb, text[])
(4 rows)

# \dAo+ gist circle_ops
 List operators of family related to access
method
  AM  | Opfamily Schema | Opfamily Name |   Operator   |
Strategy | Purpose  | Sort family 
--+-+---+--+---
---+--+-
 gist | pg_catalog  | circle_ops| << (circle,
circle)  |1 | search   | 
 ... 
 gist | pg_catalog  | circle_ops| <-> (circle,
point)  |   15 | ordering | float_ops

The \dAop command has been renamed to \dAp.
It displays list of support procedures associated with access method
operator families.
# \dAp hash array_ops 
List of operator family procedures
  AM  | Family schema | Family name |   Left   |  Right   | Number 
--+---+-+--+--+
 hash | pg_catalog| array_ops   | anyarray | anyarray |  1
 hash | pg_catalog| array_ops   | anyarray | anyarray |  2
(2 rows)

# \dAp+ hash array_ops 
   List of operator

Re: Concurrency bug with vacuum full (cluster) and toast

2019-03-22 Thread Robert Haas

On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov
 wrote:
> Thank you for pointing, but none of the threads you pointed describe
> this exact problem.  Now I see this bug have a set of cute siblings :)

Yeah.  I really thought this precise issue -- the interlocking between
the VACUUM of the main table and the VACUUM of the TOAST table -- had
been discussed somewhere before.  But I couldn't find that discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New vacuum option to do only freezing

2019-03-22 Thread Robert Haas

On Fri, Mar 8, 2019 at 12:14 AM Masahiko Sawada  wrote:
> IIUC we've discussed the field-and-value style vacuum option. I
> suggested that since we have already the disable_page_skipping option
> the disable_page_skipping option would be more natural style and
> consistent. I think "VACUUM (INDEX_CLEANUP false)" seems consistent
> with its reloption but not with other vacuum options. So why does only
> this option (and probably up-coming new options) need to support new
> style? Do we need the same change to the existing options?

Well, it's too late to change to change DISABLE_PAGE_SKIPPING to work
some other way; it's been released, and we're stuck with it at this
point.  However, I generally believe that it is preferable to phrase
options positively then negatively, so that for example one writes
EXPLAIN (ANALYZE, TIMING OFF) not EXPLAIN (ANALYZE, NO_TIMING).  So
I'd like to do it that way for the new options that we're proposing to
add.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Robert Haas

On Fri, Mar 22, 2019 at 12:40 PM Tom Lane  wrote:
> Robert Haas  writes:
> > On Fri, Mar 22, 2019 at 11:56 AM Tom Lane  wrote:
> >> In cases where, say, the first child requires no sort but also doesn't
> >> emit very many rows, while the second child requires an expensive sort,
> >> the planner will have a ridiculously optimistic opinion of the cost of
> >> fetching slightly more rows than are available from the first child.
> >> This might lead it to wrongly choose a merge join over a hash for example.
>
> > I think this is very much a valid point, especially in view of the
> > fact that we already choose supposedly fast-start plans too often.  I
> > don't know whether it's a death sentence for this patch, but it should
> > at least make us stop and think hard.
>
> BTW, another thing we could possibly do to answer this objection is to
> give the ordered-Append node an artificially pessimistic startup cost,
> such as the sum or the max of its children's startup costs.  That's
> pretty ugly and unprincipled, but maybe it's better than not having the
> ability to generate the plan shape at all?

Yeah, I'm not sure whether that's a good idea or not.  I think one of
the problems with a cost-based optimizer is that once you start
putting things in with the wrong cost because you think it will give
the right answer, you're sorta playing with fire, because you can't
necessarily predict how things are going are going to turn out in more
complex scenarios.  On the other hand, it may sometimes be the right
thing to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Fabien COELHO




Hello Christoph,


-   pg_log(PG_PROGRESS, "syncing target data directory\n");
-   syncTargetDirectory();


Doesn't the control file still need syncing?


Indeed it does, and it is done in update_controlfile if the last argument 
is true. Basically update_controlfile latest version always fsync the 
control file, unless explicitely told not to do so. The options to do that 
are really there only to speed up non regression tests.


--
Fabien.

Re: Problem with default partition pruning

2019-03-22 Thread Thibaut Madelaine



Le 22/03/2019 à 07:38, Amit Langote a écrit :
> Hosoya-san,
>
> On 2019/03/22 15:02, Yuzuko Hosoya wrote:
>> I understood Amit's proposal.  But I think the issue Thibaut reported would 
>> occur regardless of whether clauses have OR clauses or not as follows.
>> I tested a query which should output "One-Time Filter: false".
>>
>> # explain select * from test2_0_20 where id = 25;
>>   QUERY PLAN   
>> ---
>>  Append  (cost=0.00..25.91 rows=6 width=36)
>>->  Seq Scan on test2_10_20_def  (cost=0.00..25.88 rows=6 width=36)
>>  Filter: (id = 25)
>>
> Good catch, thanks.
>
>> As Amit described in the previous email, id = 25 contradicts test2_0_20's
>> partition constraint, so I think this clause should be ignored and we can
>> also handle this case in the similar way as Amit proposal.
>>
>> I attached v1-delta-2.patch which fix the above issue.  
>>
>> What do you think about it?
> It looks fine to me.  You put the code block to check whether a give
> clause contradicts the partition constraint in its perfect place. :)
>
> Maybe we should have two patches as we seem to be improving two things:
>
> 1. Patch to fix problems with default partition pruning originally
> reported by Hosoya-san
>
> 2. Patch to determine if a given clause contradicts a sub-partitioned
> table's partition constraint, fixing problems unearthed by Thibaut's tests
>
> About the patch that Horiguchi-san proposed upthread, I think it has merit
> that it will make partprune.c code easier to reason about, but I think we
> should pursue it separately.
>
> Thanks,
> Amit

Hosoya-san, very good idea to run queries directly on tables partitions!

I tested your last patch and if I didn't mix up patches on the end of a
too long week, I get a problem when querying the sub-sub partition:

test=# explain select * from test2_0_10 where id = 25;
 QUERY PLAN

 Seq Scan on test2_0_10  (cost=0.00..25.88 rows=6 width=36)
   Filter: (id = 25)
(2 rows)


Cordialement,

Thibaut

Error message inconsistency

2019-03-22 Thread Simon Riggs

As noted by a PostgreSQL user to me, error messages for NOT NULL
constraints are inconsistent - they do not mention the relation name in the
message, as all other variants of this message do. e.g.

postgres=# create table nn (id integer not null);
CREATE TABLE
postgres=# insert into nn values (NULL);
ERROR: null value in column "id" violates not-null constraint
DETAIL: Failing row contains (null).

postgres=# create table nn2 (id integer check (id is not null));
CREATE TABLE
postgres=# insert into nn2 values (NULL);
ERROR: new row for relation "nn2" violates check constraint "nn2_id_check"
DETAIL: Failing row contains (null).

I propose the attached patch as a fix, changing the wording (of the first
case) to
ERROR: null value in column "id" for relation "nn" violates not-null
constraint

It causes breakage in multiple tests, which is easy to fix once/if we agree
to change.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


rationalize_constraint_error_messages.v1.patch
Description: Binary data

Re: Enable data checksums by default

2019-03-22 Thread Andres Freund

On 2019-03-22 18:01:32 +0100, Tomas Vondra wrote:
> On 3/22/19 5:41 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
> >> On 3/22/19 5:10 PM, Andres Freund wrote:
> >>> IDK, being able to verify in some form that backups aren't corrupted on
> >>> an IO level is mighty nice. That often does allow to detect the issue
> >>> while one still has older backups around.
> >>>
> >>
> >> Yeah, I agree that's a valuable capability. I think the question is how
> >> effective it actually is considering how much the storage changed over
> >> the past few years (which necessarily affects the type of failures
> >> people have to deal with).
> > 
> > I'm not sure I understand? How do the changes around storage
> > meaningfully affect the need to have some trust in backups and
> > benefiting from earlier detection?
> > 
> 
> Having trusted in backups is still desirable - nothing changes that,
> obviously. The question I was posing was rather "Are checksums still
> effective on current storage systems?"
> 
> I'm wondering if the storage systems people use nowadays may be failing
> in ways that are not reliably detectable by checksums. I don't have any
> data to either support or reject that hypothesis, though.

I don't think it's useful to paint unsubstantiated doom-and-gloom
pictures.


> >> It's not clear to me what can checksums do about zeroed pages (and/or
> >> truncated files) though.
> > 
> > Well, there's nothing fundamental about needing added pages be
> > zeroes. We could expand them to be initialized with actual valid
> > checksums instead of
> > /* new buffers are zero-filled */
> > MemSet((char *) bufBlock, 0, BLCKSZ);
> > /* don't set checksum for all-zero page */
> > smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
> > 
> > the problem is that it's hard to do so safely without adding a lot of
> > additional WAL logging. A lot of filesystems will journal metadata
> > changes (like the size of the file), but not contents. So after a crash
> > the tail end might appear zeroed out, even if we never wrote
> > zeroes. That's obviously solvable by WAL logging, but that's not cheap.
> > 
> 
> Hmmm. I'd say a filesystem that does not guarantee having all the data
> after an fsync is outright broken, but maybe that's what checksums are
> meant to protect against.

There's no fsync here. smgrextend(with-valid-checksum);crash; - the OS
will probably have journalled the file size change, but not the
contents. After a crash it's thus likely that the data page will appear
zeroed.  Which prevents us from erroring out when encountering a zeroed
page, even though that'd be very good for error detection capabilities,
because storage systems will show corrupted data as zeroes in a number
of cases.

Greetings,

Andres Freund

Re: Enable data checksums by default

2019-03-22 Thread Tomas Vondra

On 3/22/19 5:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
>> On 3/22/19 5:10 PM, Andres Freund wrote:
>>> IDK, being able to verify in some form that backups aren't corrupted on
>>> an IO level is mighty nice. That often does allow to detect the issue
>>> while one still has older backups around.
>>>
>>
>> Yeah, I agree that's a valuable capability. I think the question is how
>> effective it actually is considering how much the storage changed over
>> the past few years (which necessarily affects the type of failures
>> people have to deal with).
> 
> I'm not sure I understand? How do the changes around storage
> meaningfully affect the need to have some trust in backups and
> benefiting from earlier detection?
> 

Having trusted in backups is still desirable - nothing changes that,
obviously. The question I was posing was rather "Are checksums still
effective on current storage systems?"

I'm wondering if the storage systems people use nowadays may be failing
in ways that are not reliably detectable by checksums. I don't have any
data to either support or reject that hypothesis, though.

> 
>> It's not clear to me what can checksums do about zeroed pages (and/or
>> truncated files) though.
> 
> Well, there's nothing fundamental about needing added pages be
> zeroes. We could expand them to be initialized with actual valid
> checksums instead of
>   /* new buffers are zero-filled */
>   MemSet((char *) bufBlock, 0, BLCKSZ);
>   /* don't set checksum for all-zero page */
>   smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
> 
> the problem is that it's hard to do so safely without adding a lot of
> additional WAL logging. A lot of filesystems will journal metadata
> changes (like the size of the file), but not contents. So after a crash
> the tail end might appear zeroed out, even if we never wrote
> zeroes. That's obviously solvable by WAL logging, but that's not cheap.
> 

Hmmm. I'd say a filesystem that does not guarantee having all the data
after an fsync is outright broken, but maybe that's what checksums are
meant to protect against.

> It might still be a good idea to just write a page with an initialized
> header / checksum at that point, as that ought to still detect a number
> of problems we can't detect right now.
> 

Sounds reasonable.

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: propagating replica identity to partitions

2019-03-22 Thread Alvaro Herrera

On 2019-Mar-22, Robert Haas wrote:

> On Thu, Mar 21, 2019 at 5:01 PM Alvaro Herrera  
> wrote:
> > I already argued that TABLESPACE and OWNER TO are documented to work
> > that way, and have been for a long time, whereas REPLICA IDENTITY has
> > never been.  If you want to change long-standing behavior, be my guest,
> > but that's not my patch.  On the other hand, there's no consensus that
> > those should be changed, whereas there no opposition specifically
> > against changing this one, and in fact it was reported as a bug to me by
> > actual users.
> 
> Well, you have a commit bit, and I cannot prevent you from using it,
> and nobody else is backing me up here, but it doesn't change my
> opinion.

Can I ask for more opinions here?  Should I apply this behavior change
to pg12 or not?  If there's no consensus that it should be changed, I
wouldn't change it.

To recap: my proposed change is to make
ALTER TABLE ... REPLICA IDENTITY
when applied on a partitioned table affect all of its partitions instead
of expecting the user to invoke the command for each partition.  At the
same time, I am proposing not to change to have recursive behavior other
forms of ALTER TABLE in one commit, such as TABLESPACE and OWNER TO,
which currently do not have recursive behavior.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Enable data checksums by default

2019-03-22 Thread Andres Freund

Hi,

On 2019-03-22 17:32:10 +0100, Tomas Vondra wrote:
> On 3/22/19 5:10 PM, Andres Freund wrote:
> > IDK, being able to verify in some form that backups aren't corrupted on
> > an IO level is mighty nice. That often does allow to detect the issue
> > while one still has older backups around.
> > 
> 
> Yeah, I agree that's a valuable capability. I think the question is how
> effective it actually is considering how much the storage changed over
> the past few years (which necessarily affects the type of failures
> people have to deal with).

I'm not sure I understand? How do the changes around storage
meaningfully affect the need to have some trust in backups and
benefiting from earlier detection?

> It's not clear to me what can checksums do about zeroed pages (and/or
> truncated files) though.

Well, there's nothing fundamental about needing added pages be
zeroes. We could expand them to be initialized with actual valid
checksums instead of
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
/* don't set checksum for all-zero page */
smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);

the problem is that it's hard to do so safely without adding a lot of
additional WAL logging. A lot of filesystems will journal metadata
changes (like the size of the file), but not contents. So after a crash
the tail end might appear zeroed out, even if we never wrote
zeroes. That's obviously solvable by WAL logging, but that's not cheap.

It might still be a good idea to just write a page with an initialized
header / checksum at that point, as that ought to still detect a number
of problems we can't detect right now.

Greetings,

Andres Freund

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Tom Lane

Robert Haas  writes:
> On Fri, Mar 22, 2019 at 11:56 AM Tom Lane  wrote:
>> In cases where, say, the first child requires no sort but also doesn't
>> emit very many rows, while the second child requires an expensive sort,
>> the planner will have a ridiculously optimistic opinion of the cost of
>> fetching slightly more rows than are available from the first child.
>> This might lead it to wrongly choose a merge join over a hash for example.

> I think this is very much a valid point, especially in view of the
> fact that we already choose supposedly fast-start plans too often.  I
> don't know whether it's a death sentence for this patch, but it should
> at least make us stop and think hard.

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs.  That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

regards, tom lane

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Robert Haas

On Fri, Mar 22, 2019 at 12:21 PM Tom Lane  wrote:
> Once again: this objection is not a "death sentence for this patch".
> I simply wish to suppress the option to generate an ordered Append
> when some of the inputs would require an added sort step.  As long
> as we have pre-ordered paths for all children, go for it.

I stand corrected.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: propagating replica identity to partitions

2019-03-22 Thread Robert Haas

On Thu, Mar 21, 2019 at 5:01 PM Alvaro Herrera  wrote:
> I already argued that TABLESPACE and OWNER TO are documented to work
> that way, and have been for a long time, whereas REPLICA IDENTITY has
> never been.  If you want to change long-standing behavior, be my guest,
> but that's not my patch.  On the other hand, there's no consensus that
> those should be changed, whereas there no opposition specifically
> against changing this one, and in fact it was reported as a bug to me by
> actual users.

Well, you have a commit bit, and I cannot prevent you from using it,
and nobody else is backing me up here, but it doesn't change my
opinion.

I think it is HIGHLY irresponsible for you to try to characterize
clear behavior changes in this area as "bug fixes."  The fact that a
user say that something is a bug does not make it a bug -- and you
have been a committer long enough to know the difference between
repairing a defect and inventing entirely new behavior.  Yet you keep
doing the latter and calling the former.

Commit 33e6c34c32677a168bee4bc6c335aa8d73211a56 is a clear behavior
change for partitioned indexes and yet has the innocuous subject line
"Fix tablespace handling for partitioned indexes."  Unbelievably, it
was back-patched into 11.1.  Everyone except you agreed that it
created an inconsistency between tables and indexes, so commit
ca4103025dfe26eaaf6a500dec9170fbb176eebc repaired that by doing the
same thing for tables.  That one wasn't back-patched, but it was still
described as "Fix tablespace handling for partitioned tables", even
though I said repeatedly that it wasn't a fix, because the actual
behavior was the design behavior, and as the major reviewer and
committer of those patches I ought to know.  That latter patch also
broke stuff which it looks like you haven't fixed yet:

https://www.postgresql.org/message-id/cakjs1f_1c260not_vbj067az3jxptxvrohdvmlebmudx1ye...@mail.gmail.com

That email thread even includes clear definitional concerns about
whether this behavior is even properly designed:

https://www.postgresql.org/message-id/20190306161744.22jdkg37fyi2zyke%40alap3.anarazel.de

But I assume that's not going to stop you from propagating the same
kinds of problems into more places.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Enable data checksums by default

2019-03-22 Thread Tomas Vondra

On 3/22/19 5:10 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-22 12:07:22 -0400, Tom Lane wrote:
>> Christoph Berg  writes:
>>> I think, the next step in that direction would be to enable data
>>> checksums by default. They make sense in most setups,
>>
>> Well, that is exactly the point that needs some proof, not just
>> an unfounded assertion.
>>
>> IMO, the main value of checksums is that they allow the Postgres
>> project to deflect blame.  That's nice for us but I'm not sure
>> that it's a benefit for users.  I've seen little if any data to
>> suggest that checksums actually catch enough problems to justify
>> the extra CPU costs and the risk of false positives.
> 

I'm not sure about checksums being an effective tool to deflect blame.
Considering the recent fsync retry issues - due to the assumption that
we can just retry fsync we might have lost some of the writes, resulting
in torn pages and checksum failures. I'm sure we could argue about how
much sense the fsync behavior makes, but I doubt checksum failures are
enough to deflect blame here.

> IDK, being able to verify in some form that backups aren't corrupted on
> an IO level is mighty nice. That often does allow to detect the issue
> while one still has older backups around.
> 

Yeah, I agree that's a valuable capability. I think the question is how
effective it actually is considering how much the storage changed over
the past few years (which necessarily affects the type of failures
people have to deal with).

> My problem is more that I'm not confident the checks are mature
> enough. The basebackup checks are atm not able to detect random data,
> and neither basebackup nor backend checks detect zeroed out files/file
> ranges.
> 

Yep :-( The pg_basebackup vulnerability to random garbage in a page
header is unfortunate, we better improve that.

It's not clear to me what can checksums do about zeroed pages (and/or
truncated files) though.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Tom Lane

Robert Haas  writes:
> On Fri, Mar 22, 2019 at 11:56 AM Tom Lane  wrote:
>> In cases where, say, the first child requires no sort but also doesn't
>> emit very many rows, while the second child requires an expensive sort,
>> the planner will have a ridiculously optimistic opinion of the cost of
>> fetching slightly more rows than are available from the first child.
>> This might lead it to wrongly choose a merge join over a hash for example.

> I think this is very much a valid point, especially in view of the
> fact that we already choose supposedly fast-start plans too often.  I
> don't know whether it's a death sentence for this patch, but it should
> at least make us stop and think hard.

Once again: this objection is not a "death sentence for this patch".
I simply wish to suppress the option to generate an ordered Append
when some of the inputs would require an added sort step.  As long
as we have pre-ordered paths for all children, go for it.

regards, tom lane

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Robert Haas

On Fri, Mar 22, 2019 at 11:56 AM Tom Lane  wrote:
> In cases where, say, the first child requires no sort but also doesn't
> emit very many rows, while the second child requires an expensive sort,
> the planner will have a ridiculously optimistic opinion of the cost of
> fetching slightly more rows than are available from the first child.
> This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often.  I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Enable data checksums by default

2019-03-22 Thread Andres Freund

Hi,

On 2019-03-22 12:07:22 -0400, Tom Lane wrote:
> Christoph Berg  writes:
> > I think, the next step in that direction would be to enable data
> > checksums by default. They make sense in most setups,
> 
> Well, that is exactly the point that needs some proof, not just
> an unfounded assertion.
> 
> IMO, the main value of checksums is that they allow the Postgres
> project to deflect blame.  That's nice for us but I'm not sure
> that it's a benefit for users.  I've seen little if any data to
> suggest that checksums actually catch enough problems to justify
> the extra CPU costs and the risk of false positives.

IDK, being able to verify in some form that backups aren't corrupted on
an IO level is mighty nice. That often does allow to detect the issue
while one still has older backups around.

My problem is more that I'm not confident the checks are mature
enough. The basebackup checks are atm not able to detect random data,
and neither basebackup nor backend checks detect zeroed out files/file
ranges.

Greetings,

Andres Freund

Re: Enable data checksums by default

2019-03-22 Thread Tom Lane

Christoph Berg  writes:
> I think, the next step in that direction would be to enable data
> checksums by default. They make sense in most setups,

Well, that is exactly the point that needs some proof, not just
an unfounded assertion.

IMO, the main value of checksums is that they allow the Postgres
project to deflect blame.  That's nice for us but I'm not sure
that it's a benefit for users.  I've seen little if any data to
suggest that checksums actually catch enough problems to justify
the extra CPU costs and the risk of false positives.

regards, tom lane

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Tom Lane

David Rowley  writes:
> Thanks for explaining.  I see where you're coming from now.  I think
> this point would carry more weight if using the Append instead of the
> MergeAppend were worse in some cases as we could end up using an
> inferior plan accidentally.  However, that's not the case. The Append
> plan should always perform better both for startup and pulling a
> single row all the way to pulling the final row.  The underlying
> subplans are the same in each case, but Append has the additional
> saving of not having to determine to perform a sort on the top row
> from each subpath.

Uh, what?  sorted-Append and MergeAppend would need pre-sorts on
exactly the same set of children.  It's true that the Append path
might not have to actually execute some of those sorts, if it's
able to stop in an earlier child.  The problem here is basically
that it's hard to predict whether that will happen.

> Append always wins over MergeAppend, so I
> don't quite understand your reasoning here.

The problem is that the planner is likely to favor a "fast-start"
Append *too much*, and prefer it over some other plan altogether.

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

Yes, there are cases where Append-with-some-sorts is preferable to
MergeAppend-with-some-sorts, and maybe I'd even believe that it
always is.  But I don't believe that it's necessarily preferable
to plans that don't require a sort at all, and I'm afraid that we
are likely to find the planner making seriously bad choices when
it's presented with such situations.  I'd rather we leave that
case out for now, until we have some better way of modelling it.

The fact that the patch also requires a lot of extra hacking just
to support this case badly doesn't make me any more favorably
disposed.

regards, tom lane

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Simon Riggs

On Fri, 22 Mar 2019 at 11:39, Tom Lane  wrote:

> Simon Riggs  writes:
> > I agree that the issue of mixing sorts at various points will make
> nonsense
> > of the startup cost/total cost results.
>
> Right.
>
> > I don't see LIMIT costing being broken as a reason to restrict this
> > optimization. I would ask that we allow improvements to the important use
> > case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.
>
> There's not time to reinvent LIMIT costing for v12.  I'd be happy to
> see some work done on that in the future, and when it does get done,
> I'd be happy to see Append planning extended to allow this case.
> I just don't think it's wise to ship one without the other.
>

I was hoping to motivate you to look at this personally, and soon. LIMIT is
so broken that any improvements count as bug fixes in my book.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Tom Lane

Simon Riggs  writes:
> I agree that the issue of mixing sorts at various points will make nonsense
> of the startup cost/total cost results.

Right.

> I don't see LIMIT costing being broken as a reason to restrict this
> optimization. I would ask that we allow improvements to the important use
> case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

There's not time to reinvent LIMIT costing for v12.  I'd be happy to
see some work done on that in the future, and when it does get done,
I'd be happy to see Append planning extended to allow this case.
I just don't think it's wise to ship one without the other.

regards, tom lane

Re: [proposal] Add an option for returning SQLSTATE in psql error message

2019-03-22 Thread Tom Lane

Peter Eisentraut  writes:
> But now that I read the patch again, I'm not sure why this needs to
> touch libpq.  The formatting of error messages in psql should be handled
> in psql.

Maybe in an ideal world that'd be the case, but psql has always just
depended on PQerrorMessage().  I don't think this patch should be
tasked with changing that.  I'm not even sure that doing so would be
a good idea: I think the idea there is that whatever we believe is a
nice error-reporting option for psql ought to be easily available to
other libpq clients too.

(Note: I haven't read the patch and don't mean to be saying that it's
necessarily OK.  But I'm fine with the idea that something like this
involves touching libpq more than psql.)

regards, tom lane

Re: Ordered Partitioned Table Scans

2019-03-22 Thread David Rowley

On Sat, 23 Mar 2019 at 04:12, Tom Lane  wrote:
> The reason why I'm skeptical about LIMIT with a plan of the form
> append-some-sorts-together is that there are going to be large
> discontinuities in the cost-vs-number-of-rows-returned graph,
> ie, every time you finish one child plan and start the next one,
> there'll be a hiccup while the sort happens.  This is something
> that our plan cost approximation (startup cost followed by linear
> output up to total cost) can't even represent.  Sticking a
> LIMIT on top of that is certainly not going to lead to any useful
> estimate of the actual cost, meaning that if you get a correct
> plan choice it'll just be by luck, and if you don't there'll be
> nothing to be done about it.

Thanks for explaining.  I see where you're coming from now.  I think
this point would carry more weight if using the Append instead of the
MergeAppend were worse in some cases as we could end up using an
inferior plan accidentally.  However, that's not the case. The Append
plan should always perform better both for startup and pulling a
single row all the way to pulling the final row.  The underlying
subplans are the same in each case, but Append has the additional
saving of not having to determine to perform a sort on the top row
from each subpath.

I also think that cost-vs-number-of-rows-returned is not any worse
than a SeqScan where the required rows are unevenly distributed
throughout the table. In fact, the SeqScan case is much worse as we
could end up choosing that over an index scan, which could be
significantly better, but as mentioned above, and benchmarked in the
initial post of this thread, Append always wins over MergeAppend, so I
don't quite understand your reasoning here.  I could understand it if
Append needed the sorts but MergeAppend did not, but they both need
the sorts if there's not a path that already provides the required
ordering.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Christoph Berg

Re: Fabien COELHO 2019-03-22 
> Attached is a quick patch about "pg_rewind", so that the control file is
> updated after everything else is committed to disk.

>   update_controlfile(datadir_target, progname, &ControlFile_new, do_sync);
>  
> - pg_log(PG_PROGRESS, "syncing target data directory\n");
> - syncTargetDirectory();

Doesn't the control file still need syncing?

Christoph

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Simon Riggs

On Fri, 22 Mar 2019 at 11:12, Tom Lane  wrote:

> David Rowley  writes:
> > On Sat, 9 Mar 2019 at 10:52, Tom Lane  wrote:
> >>> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> >>> x".  Even if the first subpath(s) aren't natively ordered, not all of
> >>> the sorts should actually be performed.
>
> >> [ shrug... ] We've got no realistic chance of estimating such situations
> >> properly, so I'd have no confidence in a plan choice based on such a
> >> thing.  Nor do I believe that this case is all that important.
>
> > Wondering if you can provide more details on why you think there's no
> > realistic chance of the planner costing these cases correctly?
>
> The reason why I'm skeptical about LIMIT with a plan of the form
> append-some-sorts-together is that there are going to be large
> discontinuities in the cost-vs-number-of-rows-returned graph,
> ie, every time you finish one child plan and start the next one,
> there'll be a hiccup while the sort happens.  This is something
> that our plan cost approximation (startup cost followed by linear
> output up to total cost) can't even represent.  Sticking a
> LIMIT on top of that is certainly not going to lead to any useful
> estimate of the actual cost, meaning that if you get a correct
> plan choice it'll just be by luck, and if you don't there'll be
> nothing to be done about it.
>
> If we don't incorporate that, then the situation that the planner
> will have to model is a MergeAppend with possibly some sorts in
> front of it, and it will correctly cost that as if all the sorts
> happen before any output occurs, and so you can hope that reasonable
> plan choices will ensue.
>
> I believe that the cases where a LIMIT query actually ought to go
> for a fast-start plan are where *all* the child rels have fast-start
> (non-sort) paths --- which is exactly the cases we'd get if we only
> allow "sorted" Appends when none of the inputs require a sort.
> Imagining that we'll get good results in cases where some of them
> need to be sorted is just wishful thinking.
>
> > It would be unfortunate to reject this patch based on a sentence that
> > starts with "[ shrug... ]", especially so when three people have stood
> > up and disagreed with you.
>
> I don't want to reject the patch altogether, I just want it to not
> include a special hack to allow Append rather than MergeAppend in such
> cases.  I am aware that I'm probably going to be shouted down on this
> point, but that will not change my opinion that the shouters are wrong.
>

I agree that the issue of mixing sorts at various points will make nonsense
of the startup cost/total cost results.

What you say about LIMIT is exactly right. But ISTM that LIMIT itself is
the issue there and it need more smarts to correctly calculate costs.

I don't see LIMIT costing being broken as a reason to restrict this
optimization. I would ask that we allow improvements to the important use
case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [proposal] Add an option for returning SQLSTATE in psql error message

2019-03-22 Thread Peter Eisentraut

On 2019-03-21 19:01, David Steele wrote:
> What do you think, Peter?  Is the extra test valuable or will it cause 
> unpredictable outputs as Tom and Michael predict?

Yes, I'm OK with that.

But now that I read the patch again, I'm not sure why this needs to
touch libpq.  The formatting of error messages in psql should be handled
in psql.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_basebackup ignores the existing data directory permissions

2019-03-22 Thread Peter Eisentraut

On 2019-03-22 05:00, Michael Paquier wrote:
> On Fri, Mar 22, 2019 at 02:45:24PM +1100, Haribabu Kommi wrote:
>> How about letting the pg_basebackup to decide group permissions of the
>> standby directory irrespective of the primary directory permissions.
>>
>> Default - permissions are same as primary
>> --allow-group-access - standby directory have group access permissions
>> --no-group--access - standby directory doesn't have group permissions
>>
>> The last two options behave irrespective of the primary directory
>> permissions.
> 
> Yes, I'd imagine that we would want to be able to define three
> different behaviors, by either having a set of options, or a sinple
> option with a switch, say --group-access:
> - "inherit" causes the permissions to be inherited from the source
> node, and that's the default.
> - "none" enforces the default 0700/0600.
> - "group" enforces group read access.

Yes, we could use those three behaviors.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Enable data checksums by default

2019-03-22 Thread Christoph Berg

Lately, PostgreSQL has moved many defaults from "bare minimum" more to
the "user friendly by default" side, e.g. hot_standby & replication in
the default configuration, parallelism, and generally higher defaults
for resource knobs like *_mem, autovacuum_* and so on.

I think, the next step in that direction would be to enable data
checksums by default. They make sense in most setups, and people who
plan to run very performance-critical systems where checksums might be
too much need to tune many knobs anyway, and can as well choose to
disable them manually, instead of having everyone else have to enable
them manually. Also, disabling is much easier than enabling.

One argument against checksums used to be that we lack tools to fix
problems with them. But ignore_checksum_failure and the pg_checksums
tool fix that.

The attached patch flips the default in initdb. It also adds a new
option -k --no-data-checksums that wasn't present previously. Docs are
updated to say what the new default is, and the testsuite exercises
the -K option.

Christoph
>From cca069298f1cfc5c95e9e7dde08407a0c05f538a Mon Sep 17 00:00:00 2001
From: Christoph Berg 
Date: Fri, 22 Mar 2019 14:55:32 +0100
Subject: [PATCH] Enable data checksums by default

---
 doc/src/sgml/config.sgml   |  4 ++--
 doc/src/sgml/ref/initdb.sgml   | 11 +++
 doc/src/sgml/wal.sgml  |  6 +++---
 src/bin/initdb/initdb.c| 11 ---
 src/bin/initdb/t/001_initdb.pl | 22 --
 5 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..bfbe3afbb2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2678,8 +2678,8 @@ include_dir 'conf.d'

 

-If data checksums are enabled, hint bit updates are always WAL-logged
-and this setting is ignored. You can use this setting to test how much
+If data checksums are enabled (which is the default), hint bit updates are always WAL-logged
+and this setting is ignored. If data checksums are disabled, you can use this setting to test how much
 extra WAL-logging would occur if your database had data checksums
 enabled.

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 84fb37c293..c22f64ed95 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -219,6 +219,17 @@ PostgreSQL documentation
 may incur a noticeable performance penalty. This option can only
 be set during initialization, and cannot be changed later. If
 set, checksums are calculated for all objects, in all databases.
+The default is to use data checksums.
+   
+  
+ 
+
+ 
+  -K
+  --no-data-checksums
+  
+   
+Do not use checksums on data pages.

   
  
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 4eb8feb903..2790e021e5 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -193,10 +193,10 @@
 
 
  
-  Data pages are not currently checksummed by default, though full page images
+  Data pages are checksummed by default. Additionally, full page images
   recorded in WAL records will be protected; see initdb
-  for details about enabling data page checksums.
+  linkend="app-initdb-no-data-checksums">initdb
+  for details about disabling data page checksums which might be beneficial for performance.
  
 
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4886090132..e7aef0b2c5 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -140,7 +140,7 @@ static bool noclean = false;
 static bool do_sync = true;
 static bool sync_only = false;
 static bool show_setting = false;
-static bool data_checksums = false;
+static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
 static int	wal_segment_size_mb;
@@ -2408,7 +2408,8 @@ usage(const char *progname)
 	printf(_("  --wal-segsize=SIZEsize of WAL segments, in megabytes\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -d, --debug   generate lots of debugging output\n"));
-	printf(_("  -k, --data-checksums  use data page checksums\n"));
+	printf(_("  -k, --data-checksums  use data page checksums (default)\n"));
+	printf(_("  -K, --no-data-checksums   do not use data page checksums\n"));
 	printf(_("  -L DIRECTORY  where to find the input files\n"));
 	printf(_("  -n, --no-cleando not clean up after errors\n"));
 	printf(_("  -N, --no-sync do not wait for changes to be written safely to disk\n"));
@@ -3100,6 +3101,7 @@ main(int argc, char *argv[])
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
 		{"data-checksums", no_argument, NULL, 'k'},
+		{"no-data-checksums", no_argument, NULL, 'K'},
 		{"a

Re: Removing unneeded self joins

2019-03-22 Thread David Rowley

On Sat, 23 Mar 2019 at 03:39, Alexander Kuzmenkov
 wrote:
> The bug you mention later is an implementation bug that can be fixed (I
> will expand on that below). Besides this, do you think current self-join
> detection algorithm has fundamental correctness problems? I am not aware
> of such problems and this algorithm reflects my understanding of what
> constitutes a removable self-join, so it would be helpful if you could
> explain what exactly is wrong with it.

I did explain what is wrong with it, and also showed an example of why
it is broken.  I didn't see anything that looks fundamentally broken,
just the implementation needs more work.

> Your alternative idea sounds plausible, but I won't know it's correct
> for sure until I implement it, which is a nontrivial amount of work. I
> am also concerned that we will have to redo calculations similar to
> innerrel_is_unique(), because having near-zero overhead is a hard
> prerequisite for this patch, as was discussed upthread.

Are you worried about bypassing the unique rel cache is going to be too costly?

> > Here's an example of what can go wrong with your current code:
>
> This is a bug indeed. Unique index search is not exhaustive, so if many
> indexes match the join quals, we might not find the same index for both
> sides. I think this can be overcome if we switch to exhaustive search in
> relation_has_unique_index_for, and then try to find matching indexes in
> is_unique_self_join. Please see the new patch for the fix.

I really don't think modifying relation_has_unique_index_for to
collect details for up to 5 indexes is a good fix for this.  It looks
like it's still possible to trigger this, just the example would need
to be more complex. Also, you've likely just more than doubled the
runtime of a successful match in relation_has_unique_index_for().
Previously, on average we'd have found that match halfway through the
list of indexes. Now it's most likely you'll end up looking at every
index, even after a match has been found. That does not seem well
aligned to keeping the CPU overhead for the patch low.

What is wrong with just weeding out join quals that don't have the
equivalent expression on either side before passing them to
relation_has_unique_index_for()? That'll save you from getting false
matches like I showed. To make that work it just seems mostly like
you'd mostly just need to swap the order of operations in the patch,
but you'd also likely end up needing to rip out all the UniqueRelInfo
code, since I don't think that'll be needed any longer. Likely that
means your entire patch would be limited to analyzejoins.c, although
I'm unsure what of the eclass editing code should be moved into
equivclass.c.

> > I also think this way would give you the subquery GROUP BY / DISTINCT
> > self join removal for just about free.
>
>
> Could you explain how exactly we can generalize join removal to the
> DISTINCT/GROUP BY case? I understand the removable self-joins as having
> the same row on both sides of join, as identified by ctid, but I'm not
> sure how to apply this to subqueries. Your earlier DISTINCT example
> looked like it needed a different kind of join removal, with different
> conditions for when it applies (please see my previous letter for details).

Well, by removing the requirement that the unique proofs have to come
from a unique index. I don't think you need to ensure this works for
your patch, it would be nice if it did for free. I just don't think
your implementation should block it from ever working.

> Another thing I'd like to mention is that this patch splits in two
> independent parts, the detection of self-joins and their removal. While
> we are looking for agreement on the detection part, could you also
> review the removal part? I'm sure it has its own share of problems.

I'd rather focus on the detection method before reviewing the removal
code.  If there's some blocker in the detection code then the removal
code is not useful.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Ordered Partitioned Table Scans

2019-03-22 Thread Tom Lane

David Rowley  writes:
> On Sat, 9 Mar 2019 at 10:52, Tom Lane  wrote:
>>> This can be a huge win for queries of the form "ORDER BY partkey LIMIT
>>> x".  Even if the first subpath(s) aren't natively ordered, not all of
>>> the sorts should actually be performed.

>> [ shrug... ] We've got no realistic chance of estimating such situations
>> properly, so I'd have no confidence in a plan choice based on such a
>> thing.  Nor do I believe that this case is all that important.

> Wondering if you can provide more details on why you think there's no
> realistic chance of the planner costing these cases correctly?

The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens.  This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent.  Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.

If we don't incorporate that, then the situation that the planner
will have to model is a MergeAppend with possibly some sorts in
front of it, and it will correctly cost that as if all the sorts
happen before any output occurs, and so you can hope that reasonable
plan choices will ensue.

I believe that the cases where a LIMIT query actually ought to go
for a fast-start plan are where *all* the child rels have fast-start
(non-sort) paths --- which is exactly the cases we'd get if we only
allow "sorted" Appends when none of the inputs require a sort.
Imagining that we'll get good results in cases where some of them
need to be sorted is just wishful thinking.

> It would be unfortunate to reject this patch based on a sentence that
> starts with "[ shrug... ]", especially so when three people have stood
> up and disagreed with you.

I don't want to reject the patch altogether, I just want it to not
include a special hack to allow Append rather than MergeAppend in such
cases.  I am aware that I'm probably going to be shouted down on this
point, but that will not change my opinion that the shouters are wrong.

regards, tom lane

Re: Removing unneeded self joins

2019-03-22 Thread Alexander Kuzmenkov


On 3/21/19 01:54, David Rowley wrote:

I really just don't think checking the unique indexes match is going
to cut it. You should be looking for a unique index where the join
clauses match on either side of the join, not looking independently
and then checking the indexes are the same ones.



The bug you mention later is an implementation bug that can be fixed (I 
will expand on that below). Besides this, do you think current self-join 
detection algorithm has fundamental correctness problems? I am not aware 
of such problems and this algorithm reflects my understanding of what 
constitutes a removable self-join, so it would be helpful if you could 
explain what exactly is wrong with it.


Your alternative idea sounds plausible, but I won't know it's correct 
for sure until I implement it, which is a nontrivial amount of work. I 
am also concerned that we will have to redo calculations similar to 
innerrel_is_unique(), because having near-zero overhead is a hard 
prerequisite for this patch, as was discussed upthread.


In short, I am reluctant to implement a new approach to detection until 
I understand why the current one is fundamentally broken.




Here's an example of what can go wrong with your current code:



This is a bug indeed. Unique index search is not exhaustive, so if many 
indexes match the join quals, we might not find the same index for both 
sides. I think this can be overcome if we switch to exhaustive search in 
relation_has_unique_index_for, and then try to find matching indexes in 
is_unique_self_join. Please see the new patch for the fix.




I also think this way would give you the subquery GROUP BY / DISTINCT
self join removal for just about free.



Could you explain how exactly we can generalize join removal to the 
DISTINCT/GROUP BY case? I understand the removable self-joins as having 
the same row on both sides of join, as identified by ctid, but I'm not 
sure how to apply this to subqueries. Your earlier DISTINCT example 
looked like it needed a different kind of join removal, with different 
conditions for when it applies (please see my previous letter for details).



Another thing I'd like to mention is that this patch splits in two 
independent parts, the detection of self-joins and their removal. While 
we are looking for agreement on the detection part, could you also 
review the removal part? I'm sure it has its own share of problems.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From c40b4b59df27e24c2238f422d5c6fde6fe1a91d3 Mon Sep 17 00:00:00 2001
From: Alexander Kuzmenkov 
Date: Fri, 22 Mar 2019 17:25:01 +0300
Subject: [PATCH v13] Remove unique self joins.

---
 src/backend/nodes/outfuncs.c  |  14 +-
 src/backend/optimizer/path/indxpath.c |  46 +-
 src/backend/optimizer/path/joinpath.c |   6 +-
 src/backend/optimizer/plan/analyzejoins.c | 998 --
 src/backend/optimizer/plan/planmain.c |   7 +-
 src/backend/optimizer/util/pathnode.c |   3 +-
 src/backend/optimizer/util/plancat.c  |   7 +-
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/cache/relcache.c|   4 +-
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/pathnodes.h |  25 +-
 src/include/optimizer/pathnode.h  |   5 +
 src/include/optimizer/paths.h |   4 +-
 src/include/optimizer/planmain.h  |  10 +-
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/insert_conflict.out |   2 +-
 src/test/regress/expected/join.out| 255 ++-
 src/test/regress/expected/stats_ext.out   |  23 +-
 src/test/regress/sql/equivclass.sql   |  17 +
 src/test/regress/sql/join.sql | 108 +++
 20 files changed, 1478 insertions(+), 115 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738..ba03fdf 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2270,7 +2270,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_OID_FIELD(userid);
 	WRITE_BOOL_FIELD(useridiscurrent);
 	/* we don't try to print fdwroutine or fdw_private */
-	/* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */
+	WRITE_NODE_FIELD(unique_for_rels);
+	/* can't print non_unique_for_rels; BMSes aren't Nodes */
 	WRITE_NODE_FIELD(baserestrictinfo);
 	WRITE_UINT_FIELD(baserestrict_min_security);
 	WRITE_NODE_FIELD(joininfo);
@@ -2343,6 +2344,14 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node)
 }
 
 static void
+_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node)
+{
+	WRITE_NODE_TYPE("UNIQUERELINFO");
+
+	WRITE_BITMAPSET_FIELD(outerrelids);
+}
+
+static void
 _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 {
 	/*
@@ -4095,6 +4104,9 @@ outNode(StringInfo str, const void *obj)
 			case T_Statistic

Re: Ordered Partitioned Table Scans

2019-03-22 Thread David Rowley

On Sat, 9 Mar 2019 at 10:52, Tom Lane  wrote:
>
> Julien Rouhaud  writes:
> > On Fri, Mar 8, 2019 at 9:15 PM Tom Lane  wrote:
> >> I think you should remove all that
> >> and restrict this optimization to the case where all the subpaths are
> >> natively ordered --- if we have to insert Sorts, it's hardly going to move
> >> the needle to worry about simplifying the parent MergeAppend to Append.
>
> > This can be a huge win for queries of the form "ORDER BY partkey LIMIT
> > x".  Even if the first subpath(s) aren't natively ordered, not all of
> > the sorts should actually be performed.
>
> [ shrug... ] We've got no realistic chance of estimating such situations
> properly, so I'd have no confidence in a plan choice based on such a
> thing.  Nor do I believe that this case is all that important.

Hi Tom,

Wondering if you can provide more details on why you think there's no
realistic chance of the planner costing these cases correctly?   It
would be unfortunate to reject this patch based on a sentence that
starts with "[ shrug... ]", especially so when three people have stood
up and disagreed with you.

I've explained why I think you're wrong. Would you be able to explain
to me why you think I'm wrong?

You also mentioned that you didn't like the fact I'd changed the API
for create_append_plan().  Could you suggest why you think passing
pathkeys in is the wrong thing to do?  The Append path obviously needs
pathkeys so that upper paths know what order the path guarantees.
Passing pathkeys in allows us to verify that pathkeys are a valid
thing to have for the AppendPath.  They're not valid in the Parallel
Append case, for example, so setting them afterwards does not seem
like an improvement.  They also allow us to cost the cheaper startup
cost properly, however, you did seem to argue that you have no
confidence in cheap startup plans, which I'm still confused by.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Fabien COELHO




Done the switch for this case.  For pg_rewind actually I think that this 
is an area where its logic could be improved a bit. So first the data 
folder is synced, and then the control file is updated.


Attached is a quick patch about "pg_rewind", so that the control file is 
updated after everything else is committed to disk.


--
Fabien.diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3dcadb9b40..c1e6d7cd07 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -351,9 +351,12 @@ main(int argc, char **argv)
 
 	progress_report(true);
 
-	pg_log(PG_PROGRESS, "\ncreating backup label and updating control file\n");
+	pg_log(PG_PROGRESS, "\ncreating backup label\n");
 	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
+	pg_log(PG_PROGRESS, "syncing target data directory\n");
+	syncTargetDirectory();
+
 	/*
 	 * Update control file of target. Make it ready to perform archive
 	 * recovery when restarting.
@@ -362,6 +365,7 @@ main(int argc, char **argv)
 	 * source server. Like in an online backup, it's important that we recover
 	 * all the WAL that was generated while we copied the files over.
 	 */
+	pg_log(PG_PROGRESS, "updating control file\n");
 	memcpy(&ControlFile_new, &ControlFile_source, sizeof(ControlFileData));
 
 	if (connstr_source)
@@ -377,11 +381,9 @@ main(int argc, char **argv)
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+
 	update_controlfile(datadir_target, progname, &ControlFile_new, do_sync);
 
-	pg_log(PG_PROGRESS, "syncing target data directory\n");
-	syncTargetDirectory();
-
 	printf(_("Done!\n"));
 
 	return 0;

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Fabien COELHO



Bonjour Michaël,


Does that look fine to you?


Mostly.

Patch v9 part 1 applies cleanly, compiles, global and local check ok, doc 
build ok.


On write(), the error message is not translatable whereas it is for all 
others.


I agree that a BIG STRONG warning is needed about not to start the cluster 
under pain of possible data corruption. I still think that preventing this 
is desirable, preferably before v12.



Otherwise, my remaining non showstopper (committer's opinion matters more) 
issues:


Doc: A postgres cluster is like an Oracle instance. I'd use "replicated 
setup" instead of "cluster", and "cluster" instead of "instance. I'll try 
to improve the text, possibly over the week-end.


Enabling/disabling an already enabled/disabled cluster should not be a 
failure for me, because the cluster is left under the prescribed state.


Patch v9 part 2 applies cleanly, compiles, global and local check ok, doc 
build ok.


Ok for me.

--
Fabien.

Re: Performance issue in foreign-key-aware join estimation

2019-03-22 Thread Tom Lane

David Rowley  writes:
> On Fri, 22 Mar 2019 at 10:10, Tom Lane  wrote:
>> I'm unsure how hard we should push to get something like this into v12.
>> I'm concerned that its dependency on list_nth might result in performance
>> regressions in some cases; ...

> However, there's always a danger we find some show-stopper with your
> list reimplementation patch, in which case I wouldn't really like to
> be left with list_skip_forward() in core.

We could always do something like what we've already done with
simple_rel_array and simple_rte_array, ie, replace the eq_classes
List with a manually-managed pointer array.  Given the small number
of places that touch that list, it wouldn't be too awful --- but
still, I'd only consider that if the List-reimplementation patch
goes down in flames.

regards, tom lane

Re: speeding up planning with partitions

2019-03-22 Thread David Rowley

On Fri, 22 Mar 2019 at 20:39, Amit Langote
 wrote:
> The problem is that make_partitionedrel_pruneinfo() does some work which
> involves allocating arrays containing nparts elements and then looping
> over the part_rels array to fill values in those arrays that will be used
> by run-time pruning.  It does all that even *before* it has checked that
> run-time pruning will be needed at all, which if it is not, it will simply
> discard the result of the aforementioned work.
>
> Patch 0008 of 0009 rearranges the code such that we check whether we will
> need run-time pruning at all before doing any of the work that's necessary
> for run-time to work.

I had a quick look at the make_partitionedrel_pruneinfo() code earlier
before this patch appeared and I agree that something like this could
be done. I've not gone over the patch in detail though.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Progress reporting for pg_verify_checksums

2019-03-22 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 09:04 +0900 schrieb Kyotaro HORIGUCHI:
> At Mon, 18 Mar 2019 23:14:01 +0100 (CET), Fabien COELHO  
> wrote in 
> > >>> + /* we handle SIGUSR1 only, and toggle the value of show_progress
> > >>> */
> > >>> + if (signum == SIGUSR1)
> > >>> + show_progress = !show_progress;
> > >>
> > >> SIGUSR1 *toggles* progress.
> > >
> > > Not sure what you mean here,
> > 
> > Probably it is meant to simplify the comment?
> 
> Sorry. I meant that "it can be turned off and a perhaps-garbage
> line is left alone after turning off. Don't we need to erase it?".

The current version prints a newline when it progress reporting is
toggled off. Do you mean there is a hazard that this happens right when
we are printing the progress, so end up with a partly garbage line? I
don't think that'd be so bad to warrant further code complexitiy, after
all, the user explicitly wanted the progress to stop.

But maybe I am misunderstanding?


Michael
  
-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: partitioned tables referenced by FKs

2019-03-22 Thread Jesper Pedersen


Hi Alvaro,

On 3/21/19 6:18 PM, Alvaro Herrera wrote:

On 2019-Mar-21, Jesper Pedersen wrote:

pgbench -M prepared -f select.sql 

I'm seeing 82.64% spent in GetCachedPlan(). plan_cache_mode is auto.


Hmm, I can't reproduce this at all ...  I don't even see GetCachedPlan
in the profile.  Do you maybe have some additional patch in your tree?



No, with 7df159a62 and v7 compiled with "-O0 -fno-omit-frame-pointer" I 
still see it.


plan_cache_mode = auto
  2394 TPS w/ GetCachePlan() @ 81.79%

plan_cache_mode = force_generic_plan
 10984 TPS w/ GetCachePlan() @ 23.52%

perf sent off-list.

Best regards,
 Jesper

Re: GiST VACUUM

2019-03-22 Thread Andrey Borodin




> 22 марта 2019 г., в 19:37, Heikki Linnakangas  написал(а):
> 
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
> 
> And committed. Thanks Andrey!
> 
> - Heikki

Cool! Thank you very much! At the beginning I could not image how many caveats 
are there.

> 22 марта 2019 г., в 18:28, Heikki Linnakangas  написал(а):
> 
> * Currently, a search needs to scan all items on a page. If the keys are 
> small, that can be pretty slow. Divide each page further into e.g. 4 
> sub-pages, with a "bounding box" key for each sub-page, to speed up search.
BTW, I already have intra-page indexing patch. But now it obviously need a 
rebase :) Along with gist amcheck patch.

Thanks!

Best regards, Andrey Borodin.

Re: insensitive collations

2019-03-22 Thread Peter Eisentraut

On 2019-03-18 00:19, Peter Eisentraut wrote:
> On 2019-03-11 21:36, Peter Eisentraut wrote:
>> Patches here.  This will allow all the existing collation customization
>> options as well as the ones being proposed in this thread to work in
>> older ICU versions.
> 
> This has been committed, and here is an updated main patch, which now
> has a chance to pass the cfbot builds.

The main patch has also been committed.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-22 Thread Robert Haas

On Fri, Mar 22, 2019 at 12:34 AM Alvaro Herrera
 wrote:
> > And then you have to decide what to do about other background
> > transactions.
>
> Not count them if they're implementation details; otherwise count them.
> For example, IMO autovacuum transactions should definitely be counted
> (as one transaction, even if they run parallel vacuum).

Hmm, interesting.  autovacuum isn't an implementation detail?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Special role for subscriptions

2019-03-22 Thread Andrey Borodin

Hi!


> 22 марта 2019 г., в 19:17, Petr Jelinek  
> написал(а):
> 
> I still don't like that we are running the subscription workers as
> superuser even for subscriptions created by regular user. That has
> plenty of privilege escalation issues in terms of how user functions are
> run (we execute triggers, index expressions etc, in that worker).
Yes, this is important concern, thanks! I think it is not a big deal to run 
worker without superuser privileges too.

> Regardless of my complain above, patch with this big security
> implications that has arrived in middle of last CF should not be merged
> in that last CF IMHO.
Yes, this patch is a pure security implication and nothing else.
This thread was started in November with around twenty messages before this CF. 
Our wiki states that "in our community -- if no one objects, then there is 
implicit approval. Within reason!"
I do not really think argument "last version of the patch arrived at last CF" 
applies here. But I understand that it is not easy to setup consensus on the 
problem at hand.
Independently from the willingness of any committer to work on this at current 
CF, the topic of subscription security relaxation really worth efforts.


Best regards, Andrey Borodin.

Re: partitioned tables referenced by FKs

2019-03-22 Thread Jesper Pedersen


Hi Alvaro,

On 3/21/19 4:49 PM, Alvaro Herrera wrote:

I think the attached is a better solution, which I'll go push shortly.



I see you pushed this, plus 0002 as well.

Thanks !

Best regards,
 Jesper

Re: jsonpath

2019-03-22 Thread Alexander Korotkov

On Fri, Mar 22, 2019 at 5:38 AM John Naylor  wrote:
> On Thu, Mar 21, 2019 at 9:59 PM Alexander Korotkov
>  wrote:
> > Attaches patches improving jsonpath parser.  First one introduces
> > cosmetic changes, while second gets rid of backtracking.  I'm also
> > planning to add high-level comment for both grammar and lexer.
>
> The cosmetic changes look good to me. I just noticed a couple things
> about the comments.
>
> 0001:
>
> +/* Check if current scanstring value constitutes a keyword */
>
> 'is a keyword' is better. 'Constitutes' implies parts of a whole.
>
> + * Resize scanstring for appending of given length.  Reinitilize if required.
>
> s/Reinitilize/Reinitialize/
>
> The first sentence is not entirely clear to me.

Thank you, fixed.

> 0002:
>
> These two rules are not strictly necessary:
>
> {unicode}+\\ {
> /* throw back the \\, and treat as unicode */
> yyless(yyleng - 1);
> parseUnicode(yytext, yyleng);
> }
>
> {hex_char}+\\ {
> /* throw back the \\, and treat as hex */
> yyless(yyleng - 1);
> parseHexChars(yytext, yyleng);
> }
>
> ...and only seem to be there because of how these are written:
>
> {unicode}+ { parseUnicode(yytext, yyleng); }
> {hex_char}+ { parseHexChars(yytext, yyleng); }
> {unicode}*{unicodefail} { yyerror(NULL, "Unicode
> sequence is invalid"); }
> {hex_char}*{hex_fail} { yyerror(NULL, "Hex character
> sequence is invalid"); }
>
> I don't understand the reasoning here -- is it a micro-optimization?
> The following is simpler, allow the rules I mentioned to be removed,
> and make check still passes. I would prefer it unless there is a
> performance penalty, in which case a comment to describe the
> additional complexity would be helpful.
>
> {unicode} { parseUnicode(yytext, yyleng); }
> {hex_char} { parseHexChars(yytext, yyleng); }
> {unicodefail} { yyerror(NULL, "Unicode sequence is invalid"); 
> }
> {hex_fail} { yyerror(NULL, "Hex character sequence is
> invalid"); }

These rules are needed for unicode.  Sequential escaped unicode
characters might be connected by hi surrogate value.  See
jsonpath_encoding regression test in attached patch.

Regarding hex, I made it so for the sake of uniformity.  But I changed
my mind and decided that simpler flex rules are more important.  So,
now they are considered one-by-one.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-cosmetic-changes-jsonpath-parser-2.patch
Description: Binary data


0002-get-rid-of-backtracking-2.patch
Description: Binary data

Re: speeding up planning with partitions

2019-03-22 Thread Jesper Pedersen


Hi Amit,

On 3/22/19 3:39 AM, Amit Langote wrote:

I took a stab at this.  I think rearranging the code in
make_partitionedrel_pruneinfo() slightly will take care of this.

The problem is that make_partitionedrel_pruneinfo() does some work which
involves allocating arrays containing nparts elements and then looping
over the part_rels array to fill values in those arrays that will be used
by run-time pruning.  It does all that even *before* it has checked that
run-time pruning will be needed at all, which if it is not, it will simply
discard the result of the aforementioned work.

Patch 0008 of 0009 rearranges the code such that we check whether we will
need run-time pruning at all before doing any of the work that's necessary
for run-time to work.



Yeah, this is better :) Increasing number of partitions leads to a TPS 
within the noise level.


Passes check-world.

Best regards,
 Jesper

Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas


On 22/03/2019 13:37, Heikki Linnakangas wrote:

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.


And committed. Thanks Andrey!


This caused the buildfarm to go pink... I was able to reproduce it, by 
running the regression tests in one terminal, and repeatedly running 
"VACUUM;" in another terminal. Strange that it seems to happen all the 
time on the buildfarm, but never happened to me locally.


Anyway, I'm investigating.

- Heikki

Re: jsonpath

2019-03-22 Thread Nikita Glukhov


On 21.03.2019 16:58, Alexander Korotkov wrote:


On Tue, Mar 19, 2019 at 8:10 PM Alexander Korotkov
 wrote:
Attaches patches improving jsonpath parser.  First one introduces
cosmetic changes, while second gets rid of backtracking.  I'm also
planning to add high-level comment for both grammar and lexer.


Parsing of integers now is wrong: neither JSON specification, nor our json[b]
allow leading zeros in integers and floats starting with a dot.

=# SELECT json '.1';
ERROR:  invalid input syntax for type json
LINE 1: SELECT jsonb '.1';
 ^
DETAIL:  Token "." is invalid.
CONTEXT:  JSON data, line 1: 

=# SELECT json '00';
ERROR:  invalid input syntax for type json
LINE 1: SELECT jsonb '00';
 ^
DETAIL:  Token "00" is invalid.
CONTEXT:  JSON data, line 1: 00

=# SELECT json '00.1';
ERROR:  invalid input syntax for type json
LINE 1: SELECT jsonb '00.1';
 ^
DETAIL:  Token "00" is invalid.
CONTEXT:  JSON data, line 1: 00...


In JavaScript integers with leading zero are treated as octal numbers,
but leading dot is a allowed:

v8 > 0123
83
v8 > 000.1
Uncaught SyntaxError: Unexpected number
v8> .1
0.1

Attached patch 0003 fixes this issue.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 5b0844d77f4fe65d666072356f7c0e42d13c9a63 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Fri, 22 Mar 2019 14:38:46 +0300
Subject: [PATCH 3/3] Fix parsing of numbers in jsonpath

---
 src/backend/utils/adt/jsonpath_scan.l  |   6 +-
 src/test/regress/expected/jsonpath.out | 117 +++--
 2 files changed, 55 insertions(+), 68 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath_scan.l b/src/backend/utils/adt/jsonpath_scan.l
index 844ea5e..a6b67ea 100644
--- a/src/backend/utils/adt/jsonpath_scan.l
+++ b/src/backend/utils/adt/jsonpath_scan.l
@@ -77,9 +77,9 @@ any			[^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\"\' \t\n\r\f]
 blank		[ \t\n\r\f]
 
 digit		[0-9]
-integer		{digit}+
-decimal		{digit}*\.{digit}+
-decimalfail	{digit}+\.
+integer		(0|[1-9]{digit}*)
+decimal		{integer}\.{digit}+
+decimalfail	{integer}\.
 real		({integer}|{decimal})[Ee][-+]?{digit}+
 realfail1	({integer}|{decimal})[Ee]
 realfail2	({integer}|{decimal})[Ee][-+]
diff --git a/src/test/regress/expected/jsonpath.out b/src/test/regress/expected/jsonpath.out
index b7de491..a99643f 100644
--- a/src/test/regress/expected/jsonpath.out
+++ b/src/test/regress/expected/jsonpath.out
@@ -547,23 +547,20 @@ select '$ ? (@.a < +1)'::jsonpath;
 (1 row)
 
 select '$ ? (@.a < .1)'::jsonpath;
-jsonpath 
--
- $?(@."a" < 0.1)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < .1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < -.1)'::jsonpath;
- jsonpath 
---
- $?(@."a" < -0.1)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < -.1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < +.1)'::jsonpath;
-jsonpath 
--
- $?(@."a" < 0.1)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < +.1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < 0.1)'::jsonpath;
 jsonpath 
 -
@@ -619,23 +616,20 @@ select '$ ? (@.a < +1e1)'::jsonpath;
 (1 row)
 
 select '$ ? (@.a < .1e1)'::jsonpath;
-   jsonpath

- $?(@."a" < 1)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < .1e1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < -.1e1)'::jsonpath;
-jsonpath
-
- $?(@."a" < -1)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < -.1e1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < +.1e1)'::jsonpath;
-   jsonpath

- $?(@."a" < 1)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < +.1e1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < 0.1e1)'::jsonpath;
jsonpath
 ---
@@ -691,23 +685,20 @@ select '$ ? (@.a < +1e-1)'::jsonpath;
 (1 row)
 
 select '$ ? (@.a < .1e-1)'::jsonpath;
- jsonpath 
---
- $?(@."a" < 0.01)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < .1e-1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < -.1e-1)'::jsonpath;
- jsonpath  

- $?(@."a" < -0.01)
-(1 row)
-
+ERROR:  bad jsonpath representation
+LINE 1: select '$ ? (@.a < -.1e-1)'::jsonpath;
+   ^
+DETAIL:  syntax error, unexpected '.' at or near "."
 select '$ ? (@.a < +.1e-1)'::jsonpath;
- json

Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas


On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.


And committed. Thanks Andrey!

- Heikki

Re: Sparse bit set data structure

2019-03-22 Thread Heikki Linnakangas


Committed, thanks for the review!

I made one last-minute change: Instead of allocating a memory context in 
intset_create(), it is left to the caller. The GiST vacuum code created 
two integer sets, and it made more sense for it to use the same context 
for both. Also, the VACUUM tid list patch would like to use a memory 
context with very large defaults, so that malloc() will decide to 
mmap() it, so that the memory can be returned to the OS. Because the 
desired memory allocation varies between callers, so it's best to leave 
it to the caller.


On 20/03/2019 19:50, Julien Rouhaud wrote:

I forgot to mention a minor gripe about the intset_binsrch_uint64 /
intset_binsrch_leaf function, which are 99% duplicates.  But I don't
know if fixing that (something like passing the array as a void * and
passing a getter function?) is worth the trouble.


Yeah. I felt that it's simpler to have two almost-identical functions, 
even though it's a bit repetitive, than try to have one function serve 
both uses. The 'nextkey' argument is always passed as false to 
intset_binsrch_leaf() function, but I kept it anyway, to keep the 
functions identical. The compiler will surely optimize it away, so it 
makes no difference to performance.


On 20/03/2019 04:48, Andrey Borodin wrote:

1. I'm not sure it is the best interface for iteration

uint64
intset_iterate_next(IntegerSet *intset, bool *found)

we will use it like

while
{
 bool found;
 BlockNumber x = (BlockNumber) intset_iterate_next(is, &found);
 if (!found)
 break;
 // do stuff
}

we could use it like

BlockNumber x;
while(intset_iterate_next(is, &x))
{
 // do stuff
}

But that's not a big difference.


Agreed, I changed it that way.

- Heikki

Re: Contribution to Perldoc for TestLib module in Postgres

2019-03-22 Thread Ramanarayana

Hi,

Please find the first version of the patch for review. I was not sure what
some of the functions are used for and marked them with TODO.

Cheers
Ram 4.0


v1_perldoc_testlib.patch
Description: Binary data

Re: Special role for subscriptions

2019-03-22 Thread Petr Jelinek

Hi,

On 22/03/2019 03:15, Andrey Borodin wrote:
> 
>> 22 марта 2019 г., в 9:28, Michael Paquier  написал(а):
>>
>> On Thu, Mar 21, 2019 at 10:06:03AM -0300, Euler Taveira wrote:
>>> It will be really strange but I can live with that. Another idea is
>>> CREATE bit to create subscriptions (without replicate) and SUBSCRIBE
>>> bit to replicate tables. It is not just a privilege to create a
>>> subscription but also to modify tables that a role doesn't have
>>> explicit permission. Let's allocate another AclItem?
>>
>> By the way, as the commit fest is coming to its end in a couple of
>> days, and that we are still discussing how the thing should be shaped,
>> I would recommend to mark the patch as returned with feedback.  Any
>> objections with that?
> 
> It seems to me that we have consensus that:
> 1. We need special role to create subscription
> 2. This role can create subscription with some security checks
> 3. We have complete list of possible security checks
> 4. We have code that implements most of these checks (I believe 
> pg_subscription_role_v2.patch is enough, but we can tighten checks a little 
> more)
> 

I still don't like that we are running the subscription workers as
superuser even for subscriptions created by regular user. That has
plenty of privilege escalation issues in terms of how user functions are
run (we execute triggers, index expressions etc, in that worker).

> Do we have any objection on these points?
> 
> If not, it is RFC, it should not be returned.
> 

Regardless of my complain above, patch with this big security
implications that has arrived in middle of last CF should not be merged
in that last CF IMHO.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 10:04:02AM +0100, Michael Banck wrote:
> How about this:
> 
> + 
> +  Notes
> +  
> +   When enabling checksums in a cluster, the operation can potentially take a
> +   long time if the data directory is large.  During this operation, the 
> +   cluster or other programs that write to the data directory must not be 
> +   started or else data-loss will occur.
> +  

Sounds fine to me.  Will add an extra paragraph on that.

> Maybe it could be formulated like "If pg_checksums is aborted or killed
> in its operation while enabling or disabling checksums, the cluster
> will have the same state with respect of checksums as before the
> operation and pg_checksums needs to be restarted."?

We could use that as well.  With the current patch, and per the
suggestions from Fabien, this holds true as the control file is
updated and flushed last.
--
Michael


signature.asc
Description: PGP signature

Re: Performance issue in foreign-key-aware join estimation

2019-03-22 Thread David Rowley

Thanks for having a hack at this.

On Fri, 22 Mar 2019 at 10:10, Tom Lane  wrote:
> I'm unsure how hard we should push to get something like this into v12.
> I'm concerned that its dependency on list_nth might result in performance
> regressions in some cases; it's a lot easier to believe that this will
> be mostly-a-win with the better List infrastructure we're hoping to get
> into v13.  If you want to keep playing with it, OK, but I'm kind of
> tempted to shelve it for now.

Yeah,  mentioned a similar concern above.  The best I could come up
with to combat it was the list_skip_forward function. It wasn't
particularly pretty and was only intended as a stop-gap until List
become array-based. I think it should stop any regression.  I'm okay
with waiting until we get array based Lists, if you think that's best,
but it's a bit sad to leave this regression in yet another major
release.

However, there's always a danger we find some show-stopper with your
list reimplementation patch, in which case I wouldn't really like to
be left with list_skip_forward() in core.

If there's any consensus we want this for v12, then I'll happily look
over your patch, otherwise, I'll look sometime before July.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas


On 22/03/2019 10:00, Andrey Borodin wrote:

22 марта 2019 г., в 1:04, Heikki Linnakangas 
написал(а):

PS. for Gist, we could almost use the LSN / NSN mechanism to detect
the case that a deleted page is reused: Add a new field to the GiST
page header, to store a new "deleteNSN" field. When a page is
deleted, the deleted page's deleteNSN is set to the LSN of the
deletion record. When the page is reused, the deleteNSN field is
kept unchanged. When you follow a downlink during search, if you
see that the page's deleteNSN > parent's LSN, you know that it was
concurrently deleted and recycled, and should be ignored. That
would allow reusing deleted pages immediately. Unfortunately that
would require adding a new field to the gist page header/footer,
which requires upgrade work :-(. Maybe one day, we'll bite the
bullet. Something to keep in mind, if we have to change the page
format anyway, for some reason.


Yeah, the same day we will get rid of invalid tuples. I can make a
patch for v13. Actually, I have a lot of patches that I want in GiST
in v13. Or v14.


Cool! Here's my wishlist:

* That deleteNSN thing
* Add a metapage to blk #0.
* Add a "level"-field to page header.
* Currently, a search needs to scan all items on a page. If the keys are 
small, that can be pretty slow. Divide each page further into e.g. 4 
sub-pages, with a "bounding box" key for each sub-page, to speed up search.


- Heikki

Re: pg_upgrade version checking questions

2019-03-22 Thread Christoph Berg

Re: Peter Eisentraut 2019-03-22 
<57769959-8960-a9ca-fc9c-4dbb12629...@2ndquadrant.com>
> I'm still in favor of defaulting --new-bindir appropriately.  It seems
> silly not to.  We know where the directory is, we don't have to ask anyone.

Fwiw I've been wondering why I have to pass that option every time
I've been using pg_upgrade. +1 on making it optional/redundant.

Christoph

Re: [HACKERS] Can ICU be used for a database's default sort order?

2019-03-22 Thread Peter Eisentraut

On 2019-03-21 18:46, Marius Timmer wrote:
> as I mentioned three weeks ago the patch from October 2018 did not apply
> on the master. In the meantime I rebased it. Additionally I fixed some
> Makefiles because a few icu-libs were missing. Now this patch applies
> and compiles successfully on my machine. After installing running "make
> installcheck-world" results in some failures (for example "select"). I
> will take a closer look at those failures and review the whole patch in
> the next few days. I just wanted to avoid that you have to do the same
> rebasing stuff. The new patch is attached to this mail.

As I said previously in this thread, this patch needs some fundamental
design work.  I don't think it's worth doing a code review on the patch
as it is.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_upgrade version checking questions

2019-03-22 Thread Peter Eisentraut

On 2019-03-19 16:51, Tom Lane wrote:
> I'm not really getting your point here.  Packagers ordinarily tie
> those versions together anyway, I'd expect --- there's no upside
> to not doing so, and plenty of risk if one doesn't, because of
> exactly the sort of coordinated-changes hazard I'm talking about here.

The RPM packages do that, but the Debian packages do not.

>>> 3. Actually, I'm kind of wondering why pg_upgrade has a --new-bindir
>>> option at all, rather than just insisting on finding the new-version
>>> executables in the same directory it is in.  This seems like, at best,
>>> a hangover from before it got into core.  Even if you don't want to
>>> remove the option, we could surely provide a useful default setting
>>> based on find_my_exec.
> 
>> Previously discussed here:
>> https://www.postgresql.org/message-id/flat/1304710184.28821.9.camel%40vanquo.pezone.net
>>  (Summary: right)
> 
> Mmm.  The point that a default is of no particular use to scripts is
> still valid.  Shall we then remove the useless call to find_my_exec?

I'm still in favor of defaulting --new-bindir appropriately.  It seems
silly not to.  We know where the directory is, we don't have to ask anyone.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Banck

Hi,

Am Freitag, den 22.03.2019, 17:37 +0900 schrieb Michael Paquier:
> On Fri, Mar 22, 2019 at 09:13:43AM +0100, Michael Banck wrote:
> > Don't we need a big warning that the cluster must not be started during
> > operation of pg_checksums as well, now that we don't disallow it?
> 
> The same applies to pg_rewind and pg_basebackup, so I would classify
> that as a pilot error.  

How would it apply to pg_basebackup? The cluster is running while the
base backup is taken and I believe the control file is written at the
end so you can't start another instance off the backup directory until
the base backup has finished.

It would apply to pg_rewind, but pg_rewind's runtime is not scaling with
cluster size, does it? pg_checksums will run for hours on large clusters
so the window of errors is much larger and I don't think you can easily
compare the two.

> How would you formulate that in the docs if you add it.

(I would try to make sure you can't start the cluster but that seems off
the table for now)

How about this:

+ 
+  Notes
+  
+   When enabling checksums in a cluster, the operation can potentially take a
+   long time if the data directory is large.  During this operation, the 
+   cluster or other programs that write to the data directory must not be 
+   started or else data-loss will occur.
+  
+
+  
+   When disabling or enabling checksums in a cluster of multiple instances,
[...]

Also, the following is not very clear to me:

+   If the event of a crash of the operating system while enabling or

s/If/In/

+   disabling checksums, the data folder may have checksums in an inconsistent
+   state, in which case it is recommended to check the state of checksums
+   in the data folder.

How is the user supposed to check the state of checksums? Do you mean
that if the user intended to enable checksums and the box dies in
between, they should check whether checksums are actually enabled and
re-run if not?  Because it could also mean running pg_checksums --check
on the cluster, which wouldn't work in that case as the control file has
not been updated yet.

Maybe it could be formulated like "If pg_checksums is aborted or killed
in its  operation while enabling or disabling checksums, the cluster
will have the same state with respect of checksums as before the
operation and pg_checksums needs to be restarted."?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Paquier

On Fri, Mar 22, 2019 at 09:13:43AM +0100, Michael Banck wrote:
> Don't we need a big warning that the cluster must not be started during
> operation of pg_checksums as well, now that we don't disallow it?

The same applies to pg_rewind and pg_basebackup, so I would classify
that as a pilot error.  How would you formulate that in the docs if
you add it.
--
Michael


signature.asc
Description: PGP signature

Re: Fwd: Add tablespace tap test to pg_rewind

2019-03-22 Thread Shaoqi Bai

On Fri, Mar 22, 2019 at 1:34 PM Michael Paquier  wrote:

> On Thu, Mar 21, 2019 at 11:41:01PM +0800, Shaoqi Bai wrote:
> > Have updated the patch doing as you suggested
>
> +   RewindTest::setup_cluster($test_mode, ['-g']);
> +   RewindTest::start_master();
>
> There is no need to test for group permissions here, 002_databases.pl
> already looks after that.
>
>
Deleted the test for group permissions in updated patch.


> +   # Check for symlink -- needed only on source dir, only allow symlink
> +   # when under pg_tblspc
> +   # (note: this will fall through quietly if file is already gone)
> +   if (-l $srcpath)
> So you need that but in RecursiveCopy.pm because of init_from_backup
> when creating the standby, which makes sense when it comes to
> pg_tblspc.  I am wondering about any side effects though, and if it
> would make sense to just remove the restriction for symlinks in
> _copypath_recurse().
> --
> Michael
>

Checking the RecursiveCopy::copypath being called, only _backup_fs and
init_from_backup called it.
After runing cmd make -C src/bin check in updated patch, seeing no failure.


0002-Add-new-test-with-integration-in-RewindTest.pm.patch
Description: Binary data


0001-Refactors-the-code-for-the-new-option-in-PostgresNod.patch
Description: Binary data

Re: selecting from partitions and constraint exclusion

2019-03-22 Thread Amit Langote

Hi David,

Thanks for checking.

On 2019/03/20 19:41, David Rowley wrote:
> On Wed, 20 Mar 2019 at 17:37, Amit Langote
>  wrote:
>> That's because get_relation_constraints() no longer (as of PG 11) includes
>> the partition constraint for SELECT queries.  But that's based on an
>> assumption that partitions are always accessed via parent, so partition
>> pruning would make loading the partition constraint unnecessary.  That's
>> not always true, as shown in the above example.
>>
>> Should we fix that?  I'm attaching a patch here.
> 
> Perhaps we should. The constraint_exclusion documents [1] just mention:
> 
>> Controls the query planner's use of table constraints to optimize queries.
> 
> and I'm pretty sure you could class the partition constraint as a
> table constraint.

Yes.

> As for the patch:
> 
> + if ((root->parse->commandType == CMD_SELECT && !IS_OTHER_REL(rel)) ||
> 
> Shouldn't this really be checking rel->reloptkind == RELOPT_BASEREL
> instead of !IS_OTHER_REL(rel) ?

Hmm, thought I'd use the macro if we have one, but I'll change as you
suggest if that's what makes the code easier to follow.  As you might
know, we can only get "simple" relations here.

> For the comments:
> 
> + * For selects, we only need those if the partition is directly mentioned
> + * in the query, that is not via parent.  In case of the latter, partition
> + * pruning, which uses the parent table's partition bound descriptor,
> + * ensures that we only consider partitions whose partition constraint
> + * satisfy the query quals (or, the two don't contradict each other), so
> + * loading them is pointless.
> + *
> + * For updates and deletes, we always need those for performing partition
> + * pruning using constraint exclusion, but, only if pruning is enabled.
> 
> You mention "the latter", normally you'd only do that if there was a
> former, but in this case there's not.

I was trying to go for "accessing partition directly" as the former and
"accessing it via the parent" as the latter, but maybe the sentence as
written cannot be read that way.

> How about just making it:
> 
> /*
>  * Append partition predicates, if any.
>  *
>  * For selects, partition pruning uses the parent table's partition bound
>  * descriptor, so there's no need to include the partition constraint for
>  * this case.  However, if the partition is referenced directly in the query
>  * then no partition pruning will occur, so we'll include it in the case.
>  */
> if ((root->parse->commandType != CMD_SELECT && enable_partition_pruning) ||
> (root->parse->commandType == CMD_SELECT && rel->reloptkind ==
> RELOPT_BASEREL))

OK, I will use this text.

> For the tests, it seems excessive to create some new tables for this.
> Won't the tables in the previous test work just fine?

OK, I have revised the tests to use existing tables.

I'll add this to July fest to avoid forgetting about this.

Thanks,
Amit
From 2fadf7c9cb35f3993e2d9cf91f8cd580fe5f59fb Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 20 Mar 2019 13:27:37 +0900
Subject: [PATCH v2] Fix planner to load partition constraint in some cases

For select queries that access a partition directly, we should
load the partition constraint so that constraint exclusion can
use it to exclude the partition based on query quals.  When
partitions are accessed indirectly via the parent table, it's
unnecessary to load the partition constraint, because partition
pruning will only select those partitions whose partition
constraint satisfies query quals, making it unnecessary to run
constraint exclusion on partitions.
---
 src/backend/optimizer/util/plancat.c  | 10 +--
 src/test/regress/expected/partition_prune.out | 41 +++
 src/test/regress/sql/partition_prune.sql  | 18 
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 30f4dc151b..ce38a50afb 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1270,10 +1270,14 @@ get_relation_constraints(PlannerInfo *root,
 * Append partition predicates, if any.
 *
 * For selects, partition pruning uses the parent table's partition 
bound
-* descriptor, instead of constraint exclusion which is driven by the
-* individual partition's partition constraint.
+* descriptor, so there's no need to include the partition constraint 
for
+* this case.  However, if the partition is referenced directly in the
+* query then no partition pruning will occur, so we'll include it in 
that
+* case.
 */
-   if (enable_partition_pruning && root->parse->commandType != CMD_SELECT)
+   if ((root->parse->commandType != CMD_SELECT && 
enable_partition_pruning) ||
+   (root->parse->commandType == CMD_SELECT &&
+rel->reloptkind == RELOPT_BASEREL))
{
List   *pcqual = Rel

Re: Offline enabling/disabling of data checksums

2019-03-22 Thread Michael Banck

Hi,

Am Freitag, den 22.03.2019, 09:27 +0900 schrieb Michael Paquier:
> I have added in the docs a warning about a host crash while doing the 
> operation, with a recommendation to check the state of the checksums
> on the data folder should it happen, and the previous portion of the
> docs about clusters.  Your suggestion sounds adapted.  I would be
> tempted to add a bigger warning in pg_rewind or pg_basebackup about
> that, but that's a different story for another time.
> 
> Does that look fine to you?

Don't we need a big warning that the cluster must not be started during
operation of pg_checksums as well, now that we don't disallow it?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Special role for subscriptions

2019-03-22 Thread Evgeniy Efimkin

Hi!
> These are basically that the truncate, insert, delete and insert
> rights for the role creating the subscription. Why would we actually
> need that?
It's for security reasons. Because possible to attack target server. If 
publication have system tables for instance pg_authid

> pg_subscription_users and these should be able to dump subscriptions,
> so you have at least one problem.
But in system_views.sql we give grant on subconninfo column and pg_dump 
required superuser privilege only for postgesql under 12 version. Old version 
pg_dump still works but require superuser for dump subscription.

 
Efimkin Evgeny

1 2 >

1 - 100 of 103 matches

Mail list logo