Re: BUG #15668: Server crash in transformPartitionRangeBounds

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 04:00:42PM +0900, Amit Langote wrote:
> Thanks for looking at this.  Your patch seems better, because it allows us
> to keep the error message consistent with the message one would get with
> list-partitioned syntax.

Thanks for confirming.  I think that it would be nice as well to add
more test coverage for such error patterns with all the strategies.
It would be good to fix that first, so I can take care of that.

Now I don't really find the error "missing FROM-clause entry for
table" quite convincing when this is applied to a partition bound when
using a column defined in the relation.  Adding more error classes in
the set of CRERR_NO_RTE would be perhaps nice, still I am not sure how
elegant it could be made when looking at expressions for partition
bounds.
--
Michael


signature.asc
Description: PGP signature


Re: Prevent extension creation in temporary schemas

2019-03-06 Thread Chris Travers
On Wed, Mar 6, 2019 at 3:19 AM Michael Paquier  wrote:

> On Tue, Mar 05, 2019 at 12:47:54PM +, Chris Travers wrote:
> > I tried installing a test extension into a temp schema.  I found
> > this was remarkably difficult to do because pg_temp did not work (I
> > had to create a temporary table and then locate the actual table it
> > was created in).  While that might also be a bug it is not in the
> > scope of this patch so mostly noting in terms of future work.
>
> pgcrypto works in this case.
>

So the issue here is in finding the pg temp schema to install into.  The
extension is less of an issue.

The point of my note above is that there are other sharp corners that have
to be rounded off in order to make this work really well.

>
> > After creating the extension I did as follows:
> > \dx in the current session shows the extension
> > \dx in a stock psql shows the extension in a separate session
> > \dx with a patched psql in a separate session does not show the
> > extension.
> >
> > In terms of the scope of this patch, I think this correctly and
> > fully solves the problem at hand.
>
> I was just looking at this patch this morning with fresh eyes, and I
> think that I have found one argument to *not* apply it.  Imagine the
> following in one session:
> =# create extension pgcrypto with schema pg_temp_3;
> CREATE EXTENSION
> =# \dx
>   List of installed extensions
>Name   | Version |   Schema   | Description
> --+-++--
>  pgcrypto | 1.3 | pg_temp_3  | cryptographic functions
>  plpgsql  | 1.0 | pg_catalog | PL/pgSQL procedural language
> (2 rows)
>
> That's all good, we see that the session which created this extension
> has it listed.  Now let's use in parallel a second session:
> =# create extension pgcrypto with schema pg_temp_4;
> ERROR:  42710: extension "pgcrypto" already exists
> LOCATION:  CreateExtension, extension.c:1664
> =# \dx
>   List of installed extensions
>Name   | Version |   Schema   | Description
> --+-++--
>  plpgsql  | 1.0 | pg_catalog | PL/pgSQL procedural language
> (1 row)
>
> This is actually also good, because the extension of the temporary
> schema of the first session does not show up.  Now I think that this
> can bring some confusion to the user actually, because the extension
> becomes not listed via \dx, but trying to create it with a different
> schema fails.
>

Ok so at present I see three distinct issues here, where maybe three
different patches over time might be needed.

The issues are:

1.  create extension pgcrypto with schema pg_temp; fails because there is
no schema actually named pg_temp.
2.  If you work around this, the \dx shows temporary extensions in other
sessions.  This is probably the most minor issue of the three.
3.  You cannot create the same extension in two different schemas.

My expectation is that this may be a situation where other sharp corners
are discovered over time.  My experience is that where things are difficult
to do in PostgreSQL and hence not common, these sharp corners exist
(domains vs constraints in table-based composite types for example,
multiple inheritance being another).

It is much easier to review patches if they make small, well defined
changes to the code.  I don't really have an opinion on whether this should
be applied as is, or moved to next commitfest in the hope we can fix issue
#3 there too.  But I would recommend not fixing the pg_temp naming (#1
above) until at least the other two are fixed.  There is no sense in making
this easy yet.  But I would prefer to review or write patches that address
these issues one at a time rather than try to get them all reviewed and
included together.

But I don't think there is likely to be a lot of user confusion here.  It
is hard enough to install extensions in temporary schemas that those who do
can be expected to know more that \dx commands.

>
> Thoughts?
> --
> Michael
>


-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: Prevent extension creation in temporary schemas

2019-03-06 Thread Chris Travers
On Wed, Mar 6, 2019 at 9:33 AM Chris Travers 
wrote:

>
>
>> Thoughts?
>>
>
To re-iterate, my experience with PostgreSQL is that people doing
particularly exotic work in PostgreSQL can expect to hit equally exotic
bugs.  I have a list that I will not bore people with here.

I think there is a general consensus here that creating extensions in temp
schemas is something we would like to support.  So I think we should fix
these bugs before we make it easy to do.  And this patch addresses one of
those.

--
>> Michael
>>
>
>
> --
> Best Regards,
> Chris Travers
> Head of Database
>
> Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
> Saarbrücker Straße 37a, 10405 Berlin
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


pg_basebackup against older server versions

2019-03-06 Thread Devrim Gündüz

Hi,

Apologies if this has been discussed before: When I run pg_basebackup in git
head against v11 server, it treats v11 as v12: Does not create recovery.conf, 
adds recovery parameters to postgresql.auto.conf, and also creates
standby.signal file. Is this expected, or a bug?

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part


Re: Make drop database safer

2019-03-06 Thread Ashwin Agrawal
On Mon, Feb 11, 2019 at 3:55 PM Ashwin Agrawal  wrote:

>
> Thanks for the response and inputs.
>
> On Sat, Feb 9, 2019 at 4:51 AM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2019-02-08 16:36:13 -0800, Alexandra Wang wrote:
>> > Current sequence of operations for drop database (dropdb())
>> > 1. Start Transaction
>> > 2. Make catalog changes
>> > 3. Drop database buffers
>> > 4. Forget database fsync requests
>> > 5. Checkpoint
>> > 6. Delete database directory
>> > 7. Commit Transaction
>> >
>> > Problem
>> > This sequence is unsafe from couple of fronts. Like if drop database,
>> > aborts (means system can crash/shutdown can happen) right after buffers
>> are
>> > dropped step 3 or step 4. The database will still exist and fully
>> > accessible but will loose the data from the dirty buffers. This seems
>> very
>> > bad.
>> >
>> > Operation can abort after step 5 as well in which can the entries
>> remain in
>> > catalog but the database is not accessible. Which is bad as well but
>> not as
>> > severe as above case mentioned, where it exists but some stuff goes
>> > magically missing.
>> >
>> > Repo:
>> > ```
>> > CREATE DATABASE test;
>> > \c test
>> > CREATE TABLE t1(a int); CREATE TABLE t2(a int); CREATE TABLE t3(a int);
>> > \c postgres
>> > DROP DATABASE test; <<== kill the session after
>> DropDatabaseBuffers()
>> > (make sure to issue checkpoint before killing the session)
>> > ```
>> >
>> > Proposed ways to fix
>> > 1. CommitTransactionCommand() right after step 2. This makes it fully
>> safe
>> > as the catalog will have the database dropped. Files may still exist on
>> > disk in some cases which is okay. This also makes it consistent with the
>> > approach used in movedb().
>>
>> To me this seems bad. The current failure mode obviously isn't good, but
>> the data obviously isn't valuable, and just loosing track of an entire
>> database worth of data seems worse.
>>
>
> So, based on that response seems not loosing track to the files associated
> with the database is design choice we wish to achieve. Hence catalog having
> entry but data directory being deleted is fine behavior to have and doesn't
> need to be solved.
>
> > 2. Alternative way to make it safer is perform Checkpoint (step 5) just
>
>> > before dropping database buffers, to avoid the unsafe nature. Caveats of
>> > this solution is:
>> > - Performs IO for data which in success case anyways will get deleted
>> > - Still doesn't cover the case where catalog has the database entry but
>> > files are removed from disk
>>
>> That seems like an unacceptable slowdown.
>>
>
> Given dropping database should be infrequent operation and only addition
> IO cost is for buffers for that database itself as Checkpoint is anyways
> performed in later step, is it really unacceptable slowdown, compared to
> safety it brings ?
>
>
>>
>> > 3. One more fancier approach is to use pending delete mechanism used by
>> > relation drops, to perform these non-catalog related activities at
>> commit.
>> > Easily, the pending delete structure can be added boolean to convey
>> > database directory dropping instead of file. Given drop database can't
>> be
>> > performed inside transaction, not needed to be done this way, but this
>> > makes it one consistent approach used to deal with on-disk removal.
>>
>> ISTM we'd need to do something like this.
>>
>
> Given the above design choice to retain link to database files till
> actually deleted, not seeing why pending delete approach any better than
> approach 1. This approach will allow us to track the database oid in commit
> transaction xlog record but any checkpoint post the same still looses the
> reference to the database. Which is same case in approach 1 where separate
> xlog record XLOG_DBASE_DROP is written just after committing the
> transaction.
> When we proposed approach 3, we thought its functionally same as approach
> 1 just differs in implementation. But your preference to this approach and
> stating approach 1 as bad, reads as pending deletes approach is
> functionally different, we would like to hear more how?
>
> Considering the design choice we must meet, seems approach 2, moving
> Checkpoint from step 5 before step 3 would give us the safety desired and
> retain the desired link to the database till we actually delete the files
> for it.
>

Awaiting response on this thread, highly appreciate the inputs.


Re: pg_basebackup against older server versions

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 11:55:12AM +0300, Devrim Gündüz wrote:
> Apologies if this has been discussed before: When I run pg_basebackup in git
> head against v11 server, it treats v11 as v12: Does not create recovery.conf, 
> adds recovery parameters to postgresql.auto.conf, and also creates
> standby.signal file. Is this expected, or a bug?

You are right, this is a bug.  Compatibility with past server versions
should be preserved, and we made an effort to do so in the past (see
the switch to pg_wal/ for example).  Fortunately, maintaining the
compatibility is simple enough as the connection information is close
by so that we just need to change postgresql.auto.conf to
recovery.conf, and avoid the creation of standby.signal.
--
Michael


signature.asc
Description: PGP signature


bug in update tuple routing with foreign partitions

2019-03-06 Thread Amit Langote
Hi,

(added Fujita-san)

I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos
to use for partition routing targets.  Specifically, the bug occurs when
UPDATE targets include a foreign partition that is locally modified (as
opposed to being modified directly on the remove server) and its
ResultRelInfo (called subplan result rel in the source code) is reused for
tuple routing:

-- setup
create extension postgres_fdw ;
create server loopback foreign data wrapper postgres_fdw;
create user mapping for current_user server loopback;
create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2base (a int check (a = 2));
create foreign table p2 partition of p for values in (2) server loopback
options (table_name 'p2base');
insert into p values (1), (2);

-- see in the plan that foreign partition p2 is locally modified
explain verbose update p set a = 2 from (values (1), (2)) s(x) where a =
s.x returning *;
   QUERY PLAN

─
 Update on public.p  (cost=0.05..236.97 rows=50 width=42)
   Output: p1.a, "*VALUES*".column1
   Update on public.p1
   Foreign Update on public.p2
 Remote SQL: UPDATE public.p2base SET a = $2 WHERE ctid = $1 RETURNING a
   ->  Hash Join  (cost=0.05..45.37 rows=26 width=42)
 Output: 2, p1.ctid, "*VALUES*".*, "*VALUES*".column1
 Hash Cond: (p1.a = "*VALUES*".column1)
 ->  Seq Scan on public.p1  (cost=0.00..35.50 rows=2550 width=10)
   Output: p1.ctid, p1.a
 ->  Hash  (cost=0.03..0.03 rows=2 width=32)
   Output: "*VALUES*".*, "*VALUES*".column1
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2
width=32)
 Output: "*VALUES*".*, "*VALUES*".column1
   ->  Hash Join  (cost=100.05..191.59 rows=24 width=42)
 Output: 2, p2.ctid, "*VALUES*".*, "*VALUES*".column1
 Hash Cond: (p2.a = "*VALUES*".column1)
 ->  Foreign Scan on public.p2  (cost=100.00..182.27 rows=2409
width=10)
   Output: p2.ctid, p2.a
   Remote SQL: SELECT a, ctid FROM public.p2base FOR UPDATE
 ->  Hash  (cost=0.03..0.03 rows=2 width=32)
   Output: "*VALUES*".*, "*VALUES*".column1
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2
width=32)
 Output: "*VALUES*".*, "*VALUES*".column1


-- as opposed to directly on remote side (because there's no local join)
explain verbose update p set a = 2 returning *;
 QUERY PLAN
─
 Update on public.p  (cost=0.00..227.40 rows=5280 width=10)
   Output: p1.a
   Update on public.p1
   Foreign Update on public.p2
   ->  Seq Scan on public.p1  (cost=0.00..35.50 rows=2550 width=10)
 Output: 2, p1.ctid
   ->  Foreign Update on public.p2  (cost=100.00..191.90 rows=2730 width=10)
 Remote SQL: UPDATE public.p2base SET a = 2 RETURNING a
(8 rows)

Running the first update query crashes:

update p set a = 2 from (values (1), (2)) s(x) where a = s.x returning
tableoid::regclass, *;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The problem seems to occur because ExecSetupPartitionTupleRouting thinks
it can reuse p2's ResultRelInfo for tuple routing.  In this case, it can't
be used, because its ri_FdwState contains information that will be needed
for postgresExecForeignUpdate to work, but it's still used today.  Because
it's assigned to be used for tuple routing, its ri_FdwState will be
overwritten by postgresBeginForeignInsert that's invoked by the tuple
routing code using the aforementioned ResultRelInfo.  Crash occurs when
postgresExecForeignUpdate () can't find the information it's expecting in
the ri_FdwState.

The solution is to teach ExecSetupPartitionTupleRouting to avoid using a
subplan result rel if its ri_FdwState is already set.  I was wondering if
it should also check ri_usesFdwDirectModify is true, but in that case,
ri_FdwState is unused, so it's perhaps safe for tuple routing code to
scribble on it.

I have attached 2 patches, one for PG 11 where the problem first appeared
and another for HEAD.  The patch for PG 11 is significantly bigger due to
having to handle the complexities of mapping of subplan result rel indexes
to leaf partition indexes.  Most of that complexity is gone in HEAD due to
3f2393edefa5, so the patch for HEAD is much simpler.  I've also added a
test in postgres_fdw.sql to exercise this test case.

Thanks,
Amit
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index cdd788f825..b108aad897 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/p

Re: Proposal for Signal Detection Refactoring

2019-03-06 Thread Chris Travers
Here's a new patch.  No rush on it.  I am moving it to next commitfest
anyway because as code documentation I think this is a low priority late in
the release cycle.

The changes  mostly address Andres's feedback above.

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


signal_readme.patch
Description: Binary data


patch tester symbols

2019-03-06 Thread Erik Rijkers

Hi,

Context: I'm trying to compile the jsonpath v36 patches (these apply 
OK), and on top of those the jsonfunctions and jsontable patch series.


That fails for me (on 
0001-Implementation-of-SQL-JSON-path-language-v36.patch), and now I'm 
wondering why that does not agree with what the patch-tester page shows 
( http://commitfest.cputube.org/ ).


The patch-tester page does not explain what the colors and symbols mean. 
 Of course one can guess 'red' and 'cross' is bad, and 'green' and 
'check' is good.


But some questions remain:
- Some symbols' color is 'filled-in' (solid), and some are not.  What 
does that mean?

- For each patch there are three symbols; what do those three stand for?
- I suppose there is a regular schedule of apply and compile of each 
patch.  How often does it happen?  Can I see how recent a particular 
reported state is?


Can you throw some light on this?

thanks,

Erik Rijkes



Re: pg_basebackup against older server versions

2019-03-06 Thread Sergei Kornilov
Hello

My fault. I thought pg_basebackup works only with same major version, sorry.
How about attached patch?

regards, Sergeidiff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 3d2d4cd0b9..e1aa2c1cfc 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -66,6 +66,11 @@ typedef struct TablespaceList
  */
 #define MINIMUM_VERSION_FOR_TEMP_SLOTS 10
 
+/*
+ * recovery.conf has been moved into GUC system in version 12
+ */
+#define MINIMUM_VERSION_FOR_RECOVERY_GUC 12
+
 /*
  * Different ways to include WAL
  */
@@ -974,6 +979,7 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 	bool		basetablespace = PQgetisnull(res, rownum, 0);
 	bool		in_tarhdr = true;
 	bool		skip_file = false;
+	bool		is_recovery_guc_supported = true;
 	bool		is_postgresql_auto_conf = false;
 	bool		found_postgresql_auto_conf = false;
 	int			file_padding_len = 0;
@@ -984,6 +990,9 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 	gzFile		ztarfile = NULL;
 #endif
 
+	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_RECOVERY_GUC)
+		is_recovery_guc_supported = false;
+
 	if (basetablespace)
 	{
 		/*
@@ -1130,11 +1139,15 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 			{
 char		header[512];
 
-if (!found_postgresql_auto_conf)
+/*
+ * we need write new postgresql.auto.conf if missing or
+ * recovery.conf for old systems
+ */
+if (!found_postgresql_auto_conf || !is_recovery_guc_supported)
 {
 	int			padding;
 
-	tarCreateHeader(header, "postgresql.auto.conf", NULL,
+	tarCreateHeader(header, is_recovery_guc_supported ? "postgresql.auto.conf" : "recovery.conf", NULL,
 	recoveryconfcontents->len,
 	pg_file_create_mode, 04000, 02000,
 	time(NULL));
@@ -1147,13 +1160,16 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		WRITE_TAR_DATA(zerobuf, padding);
 }
 
-tarCreateHeader(header, "standby.signal", NULL,
-0,	/* zero-length file */
-pg_file_create_mode, 04000, 02000,
-time(NULL));
+if (is_recovery_guc_supported)
+{
+	tarCreateHeader(header, "standby.signal", NULL,
+	0,	/* zero-length file */
+	pg_file_create_mode, 04000, 02000,
+	time(NULL));
 
-WRITE_TAR_DATA(header, sizeof(header));
-WRITE_TAR_DATA(zerobuf, 511);
+	WRITE_TAR_DATA(header, sizeof(header));
+	WRITE_TAR_DATA(zerobuf, 511);
+}
 			}
 
 			/* 2 * 512 bytes empty data at end of file */
@@ -1253,15 +1269,25 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		 * look at the file metadata: we may want append
 		 * recovery info into postgresql.auto.conf and skip
 		 * standby.signal file.  In both cases we must
-		 * calculate tar padding
+		 * calculate tar padding. Also we need skip
+		 * recovery.conf file for old cluster versions
 		 */
-		skip_file = (strcmp(&tarhdr[0], "standby.signal") == 0);
-		is_postgresql_auto_conf = (strcmp(&tarhdr[0], "postgresql.auto.conf") == 0);
+		if (is_recovery_guc_supported)
+		{
+			skip_file = (strcmp(&tarhdr[0], "standby.signal") == 0);
+			is_postgresql_auto_conf = (strcmp(&tarhdr[0], "postgresql.auto.conf") == 0);
+		}
+		else
+		{
+			skip_file = (strcmp(&tarhdr[0], "recovery.conf") == 0);
+		}
 
 		filesz = read_tar_number(&tarhdr[124], 12);
 		file_padding_len = ((filesz + 511) & ~511) - filesz;
 
-		if (is_postgresql_auto_conf && writerecoveryconf)
+		if (is_recovery_guc_supported &&
+			is_postgresql_auto_conf &&
+			writerecoveryconf)
 		{
 			/* replace tar header */
 			char		header[512];
@@ -1313,7 +1339,9 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		pos += bytes2write;
 		filesz -= bytes2write;
 	}
-	else if (is_postgresql_auto_conf && writerecoveryconf)
+	else if (is_recovery_guc_supported &&
+			 is_postgresql_auto_conf &&
+			 writerecoveryconf)
 	{
 		/* append recovery config to postgresql.auto.conf */
 		int			padding;
@@ -1690,6 +1718,9 @@ GenerateRecoveryConf(PGconn *conn)
 		exit(1);
 	}
 
+	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_RECOVERY_GUC)
+		appendPQExpBufferStr(recoveryconfcontents, "standby_mode = 'on'\n");
+
 	connOptions = PQconninfo(conn);
 	if (connOptions == NULL)
 	{
@@ -1765,9 +1796,11 @@ WriteRecoveryConf(void)
 	char		filename[MAXPGPATH];
 	FILE	   *cf;
 
-	snprintf(filename, MAXPGPATH, "%s/%s", basedir, "postgresql.auto.conf");
+	snprintf(filename, MAXPGPATH, "%s/%s", basedir,
+			 PQserverVersion(conn) < MINIMUM_VERSION_FOR_RECOVERY_GUC ?
+			 "recovery.conf" : "postgresql.auto.conf");
 
-	cf = fopen(filename, "a");
+	cf = fopen(filename, PQserverVersion(conn) < MINIMUM_VERSION_FOR_RECOVERY_GUC ? "w" : "a");
 	if (cf == NULL)
 	{
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"), progname, filename, strerror(errno));
@@ -17

RE: proposal: pg_restore --convert-to-text

2019-03-06 Thread José Arthur Benetasso Villanova




On Thu, 28 Feb 2019, Imai, Yoshikazu wrote:


Is there no need to rewrite the Description in the Doc to state we should 
specify either -d or -f option?
(and also it might be better to write if -l option is given, neither -d nor -f 
option isn't necessarily needed.)


Since the default part of text was removed, looks ok to me.



I also have the simple question in the code.

I thought the below if-else condition

+   if (filename && strcmp(filename, "-") == 0)
+   fn = fileno(stdout);
+   else if (filename)
fn = -1;
   else if (AH->FH)

can also be written by the form below.

   if (filename)
   {
   if(strcmp(filename, "-") == 0)
   fn = fileno(stdout);
   else
   fn = -1;
   }
   else if (AH->FH)

I think the former one looks like pretty, but which one is preffered?


Aside the above question, I tested the code against a up-to-date 
repository. It compiled, worked as expected and passed all tests.


--
Jose Arthur Benetasso Villanova



openLogOff is not needed anymore

2019-03-06 Thread Antonin Houska
... obviously due to commit c24dcd0. The following patch removes it.

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index ecd12fc53a..0fdd82a287 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -771,13 +771,11 @@ static const char *xlogSourceNames[] = {"any", "archive", 
"pg_wal", "stream"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
- * When it's open, openLogOff is the current seek offset in the file.
- * openLogSegNo identifies the segment.  These variables are only
- * used to write the XLOG, and so will normally refer to the active segment.
+ * openLogSegNo identifies the segment.  These variables are only used to
+ * write the XLOG, and so will normally refer to the active segment.
  */
 static int openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
-static uint32 openLogOff = 0;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -2447,7 +2445,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
/* create/use new log file */
use_existent = true;
openLogFile = XLogFileInit(openLogSegNo, &use_existent, 
true);
-   openLogOff = 0;
}
 
/* Make sure we have the current logfile open */
@@ -2456,7 +2453,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
wal_segment_size);
openLogFile = XLogFileOpen(openLogSegNo);
-   openLogOff = 0;
}
 
/* Add current page to the set of pending pages-to-dump */
@@ -2508,15 +2504,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 errmsg("could not 
write to log file %s "
"at 
offset %u, length %zu: %m",

XLogFileNameP(ThisTimeLineID, openLogSegNo),
-   
openLogOff, nbytes)));
+   
startoffset, nbytes)));
}
nleft -= written;
from += written;
startoffset += written;
} while (nleft > 0);
 
-   /* Update state for write */
-   openLogOff += nbytes;
npages = 0;
 
/*
@@ -2602,7 +2596,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
XLByteToPrevSeg(LogwrtResult.Write, 
openLogSegNo,

wal_segment_size);
openLogFile = XLogFileOpen(openLogSegNo);
-   openLogOff = 0;
}
 
issue_xlog_fsync(openLogFile, openLogSegNo);


-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Wed, 6 Mar 2019 at 20:19, Michael Paquier  wrote:
>
> On Wed, Mar 06, 2019 at 07:45:06PM +1300, David Rowley wrote:
> > Can anyone see any fundamental reason that we should not create a
> > partitioned table by doing CREATE TABLE followed by ATTACH PARTITION?
> >  If not, I'll write a patch that fixes it that way.
>
> The part for partitioned indexes is already battle-proven, so if the
> part for partitioned tables can be consolidated the same way that
> would be really nice.

I think Andres is also going to need it to work this way for the
pluggable storage patch too.

Looking closer at this, I discovered that when pg_dump is in binary
upgrade mode that it crafts the pg_dump output in this way anyway...
Obviously, the column orders can't go changing magically in that case
since we're about to plug the old heap into the new table.  Due to the
removal of the special case, it means this patch turned out to remove
more code than it adds.

Patch attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v1-0001-Make-pg_dump-emit-ATTACH-PARTITION-instead-of-PAR.patch
Description: Binary data


Re: Inheriting table AMs for partitioned tables

2019-03-06 Thread David Rowley
On Wed, 6 Mar 2019 at 07:19, Robert Haas  wrote:
>
> On Tue, Mar 5, 2019 at 12:59 PM Andres Freund  wrote:
> > Based on this mail I'm currently planning to simply forbid specifying
> > USING for partitioned tables. Then we can argue about this later.
>
> +1.  I actually think that might be the right thing in the long-term,
> but it undeniably avoids committing to any particular decision in the
> short term, which seems good.

I've not really been following the storage am patch, but given that a
partition's TABLESPACE is inherited from its partitioned table, I'd
find it pretty surprising that USING wouldn't do the same.  They're
both storage options, so I think having them behave differently is
going to cause some confusion.

I think the patch I just submitted to [1] should make it pretty easy
to make this work the same as TABLESPACE does.

[1] 
https://www.postgresql.org/message-id/CAKJS1f_iyBpAuYBPQv_GGeME%3Dg9Rpr8yWjCaYV4E685yQ1uzkw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP: Avoid creation of the free space map for small tables

2019-03-06 Thread John Naylor
On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila  wrote:
> Once we agree on the code, we need to test below scenarios:
> (a) upgrade from all supported versions to the latest version
> (b) upgrade standby with and without using rsync.

Although the code hasn't been reviewed yet, I went ahead and tested
(a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4
instance with the regression database and restored it to all supported
versions. To make it work with pg_upgrade, I first had to drop tables
with oids, drop functions referring to C libraries, and drop the
later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all
versions to HEAD with the patch applied. pg_upgrade worked without
error, and the following query returned 0 bytes on all the new
clusters:

select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size
from pg_class where relkind in ('r', 't')
and pg_relation_size(oid, 'main') <= 4 * 8192;

The complementary query (> 4 * 8192) returned the same number of bytes
as in the original 9.4 instance.

To make it easy to find, the latest regression test patch, which is
intended to be independent of block-size, was in [2].

[1] 
https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com

[2] 
https://www.postgresql.org/message-id/CACPNZCsWa%3Ddd0K%2BFiODwM%3DLEsepAHVJCoSx2Avew%3DxBEX3Ywjw%40mail.gmail.com

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: patch tester symbols

2019-03-06 Thread Nikita Glukhov

On 06.03.2019 13:11, Erik Rijkers wrote:


Hi,

Context: I'm trying to compile the jsonpath v36 patches (these apply 
OK), and on top of those the jsonfunctions and jsontable patch series.


That fails for me (on 
0001-Implementation-of-SQL-JSON-path-language-v36.patch), and now I'm 
wondering why that does not agree with what the patch-tester page 
shows ( http://commitfest.cputube.org/ ).


Patch 0001-Implementation-of-SQL-JSON-path-language-v36.patch from the 
"SQL/JSON: functions" patch set is a simply squash of all 6 jsonpath 
patches on which this patch set depends. I included it into this patch 
set just for testing in patch-tester (I guess patch-tester cannot handle 
patch dependencies). If you apply SQL/JSON patch sets step by step, then 
you need to apply only patch 0002 from "SQL/JSON: functions" and 
"SQL/JSON: JSON_TABLE".


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: patch tester symbols

2019-03-06 Thread Erik Rijkers

On 2019-03-06 13:00, Nikita Glukhov wrote:

On 06.03.2019 13:11, Erik Rijkers wrote:


Context: I'm trying to compile the jsonpath v36 patches (these apply 
OK), and on top of those the jsonfunctions and jsontable patch series.



Patch 0001-Implementation-of-SQL-JSON-path-language-v36.patch from the
"SQL/JSON: functions" patch set is a simply squash of all 6 jsonpath
patches on which this patch set depends. I included it into this patch
set just for testing in patch-tester (I guess patch-tester cannot
handle patch dependencies). If you apply SQL/JSON patch sets step by
step, then you need to apply only patch 0002 from "SQL/JSON:
functions" and "SQL/JSON: JSON_TABLE".


Ah, that explains it. I suppose I should have guessed that.

Applied in that way it built fine (apply, compile, check-world OK).

Thank you,

Erik Rijkers



Re: Problems with plan estimates in postgres_fdw

2019-03-06 Thread Etsuro Fujita

(2019/02/22 17:17), Etsuro Fujita wrote:

(2019/02/21 6:19), Jeff Janes wrote:



With your tweaks, I'm still not seeing the ORDER-less LIMIT get pushed
down when using use_remote_estimate in a simple test case, either with
this set of patches, nor in the V4 set. However, without
use_remote_estimate, the LIMIT is now getting pushed with these patches
when it does not in HEAD.


Good catch! I think the root cause of that is: when using
use_remote_estimate, remote estimates obtained through remote EXPLAIN
are rounded off to two decimal places, which I completely overlooked.
Will fix. I think I can post a new version early next week.


Done.  Please see [1].  Sorry for the delay.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5C7FC458.4020400%40lab.ntt.co.jp




Re: bug in update tuple routing with foreign partitions

2019-03-06 Thread Etsuro Fujita

Hi Amit,

(2019/03/06 18:33), Amit Langote wrote:

I noticed a bug with how UPDATE tuple routing initializes ResultRelInfos
to use for partition routing targets.  Specifically, the bug occurs when
UPDATE targets include a foreign partition that is locally modified (as
opposed to being modified directly on the remove server) and its
ResultRelInfo (called subplan result rel in the source code) is reused for
tuple routing:


Will look into this.

Best regards,
Etsuro Fujita




Re: Patch to document base64 encoding

2019-03-06 Thread Karl O. Pinc
On Tue, 5 Mar 2019 23:23:20 -0600
"Karl O. Pinc"  wrote:

> Added documentation for hex and escape encodings,
> including output formats and what are acceptable
> inputs.

Attached: doc_base64_v6.patch

Added index entries for encode and decode functions
above the encoding documentation.  As index entries
are currently generated this does not make any
difference.  But this paragraph is very far
from the other encode/decode index targets on
the page.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..bf724d3dee 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+  base64 encoding
+
 decode(string text,
 format text)

@@ -1769,16 +1772,25 @@
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
-escape converts zero bytes and high-bit-set bytes to
-octal sequences (\nnn) and
-doubles backslashes.
+formats are:
+base64,
+hex,
+escape.

encode('123\000\001', 'base64')
MTIzAAE=
@@ -2365,6 +2377,90 @@
 format treats a NULL as a zero-element array.

 
+   
+ encode
+   
+   
+ decode
+   
+   
+ base64 encoding
+   
+   
+hex encoding
+   
+   
+escape encoding
+   
+
+   
+ The string and the binary encode
+ and decode functions support the following
+ encodings:
+
+ 
+   
+ base64
+ 
+   
+ The base64 encoding is that
+ of https://tools.ietf.org/html/rfc2045#section-6.8";>RFC
+ 2045 section 6.8.  As per the RFC, encoded lines are
+ broken at 76 characters.  However instead of the MIME CRLF
+ end-of-line marker, only a newline is used for end-of-line.
+   
+   
+ The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is
+ raised when decode is supplied invalid
+ base64 data — including when trailing padding is incorrect.
+   
+ 
+   
+
+   
+ hex
+ 
+   
+ hex represents each 4 bits of data as a single
+ hexadecimal digit, 0
+ through f.  Encoding outputs
+ the a-f hex digits in lower
+ case.  Because the smallest unit of data is 8 bits there are
+ always an even number of characters returned
+ by encode.
+   
+   
+ The decode function
+ accepts a-f characters in
+ either upper or lower case.  An error is raised
+ when decode is supplied invalid hex data
+ — including when given an odd number of characters.
+   
+ 
+   
+
+   
+ escape
+ 
+   
+ escape converts zero bytes and high-bit-set
+ bytes to octal sequences
+ (\nnn) and doubles
+ backslashes.  Encoding always produces 4 characters for each
+ high-bit-set input byte.
+   
+   
+ The decode function accepts any number of
+ octal digits after a \ character.  An error is
+ raised when decode is supplied a
+ single \ not followed by an octal digit.
+   
+ 
+   
+ 
+   
+

See also the aggregate function string_agg in
.
@@ -3577,16 +3673,25 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
-   escape converts zero bytes and high-bit-set bytes to
-   octal sequences (\nnn) and
-   doubles backslashes.
+   formats are:
+   base64,
+   hex,
+   escape.
   
   encode('123\000456'::bytea, 'escape')
   123\000456


Batch insert in CTAS/MatView code

2019-03-06 Thread Paul Guo
Hello, Postgres hackers,

The copy code has used batch insert with function heap_multi_insert() to
speed up. It seems that Create Table As or Materialized View could leverage
that code also to boost the performance also. Attached is a patch to
implement that. That was done by Taylor (cc-ed) and me.

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation. For Modifytable->insert, it seems that
there are more limitations for batch insert (trigger, etc?) but it seems
that it is possible that we could do batch insert for the case that we
could do?

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be a
risk of stack overflow. Maybe we should put it as static or global.

Here is a quick simple performance testing on a mirrorless Postgres
instance with the SQLs below. The tests cover tables with small column
length, large column length and toast.

-- tuples with small size.
drop table if exists t1;
create table t1 (a int);

insert into t1 select * from generate_series(1, 1000);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are untoasted and data that is 1664 bytes wide
drop table if exists t1;
create table t1 (a name, b name, c name, d name, e name, f name, g name, h
name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name);

insert into t1 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z' from generate_series(1, 50);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are toastable.
drop table if exists t1;
create table t1 (a text, b text, c text, d text, e text, f text, g text, h
text, i text, j text, k text, l text, m text, n text, o text, p text, q
text, r text, s text, t text, u text, v text, w text, x text, y text, z
text);

insert into t1 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
i, i, i, i, i, i, i, i from (select repeat('123456789', 1) from
generate_series(1,2000)) i;
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

Here are the timing results:

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

While for toast and large column size there is < 10% decrease but for small
column size the improvement is super good. Actually if I hardcode the batch
count as 4 all test cases are better but the improvement for small column
size is smaller than that with current patch. Pretty much the number 4 is
quite case specific so I can not hardcode that in the patch. Of course we
could further tune that but the current value seems to be a good trade-off?

Thanks.


0001-Heap-batch-insert-for-CTAS.patch
Description: Binary data


Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread Tom Lane
David Rowley  writes:
> As far as I can see, the biggest fundamental difference with doing
> things this way will be that the column order of partitions will be
> preserved, where before it would inherit the order of the partitioned
> table.  I'm a little unsure if doing this column reordering was an
> intended side-effect or not.

Well, if the normal behavior results in changing the column order,
it'd be necessary to do things differently in --binary-upgrade mode
anyway, because there we *must* preserve column order.  I don't know
if what you're describing represents a separate bug for pg_upgrade runs,
but it might.  Is there any test case for the situation left behind by
the core regression tests?

regards, tom lane



Server Crash in logical decoding if used inside --single mode

2019-03-06 Thread tushar

Hi,

Steps to reproduce on Master Sources -

.) Perform initdb ( ./initdb -D data)

.) set wal_level=logical in postgresql.conf file

.)Connect to psql in single-user mode  ( ./postgres --single  -D data  
postgres)


.)Create logical replication slot followed by select * from 
pg_logical_slot_get_changes


backend>  SELECT * FROM pg_create_logical_replication_slot('m7', 
'test_decoding','f');

     1: slot_name    (typeid = 19, len = 64, typmod = -1, byval = f)
     2: lsn    (typeid = 3220, len = 8, typmod = -1, byval = t)
    
2019-03-06 17:27:42.080 GMT [1132] LOG:  logical decoding found 
consistent point at 0/163B9C8
2019-03-06 17:27:42.080 GMT [1132] DETAIL:  There are no running 
transactions.
2019-03-06 17:27:42.080 GMT [1132] STATEMENT:   SELECT * FROM 
pg_create_logical_replication_slot('m7', 'test_decoding','f');


     1: slot_name = "m7"    (typeid = 19, len = 64, typmod = -1, byval = f)
     2: lsn = "0/163BA00"    (typeid = 3220, len = 8, typmod = -1, 
byval = t)

    
backend> select * from pg_logical_slot_get_changes('m7',null,null);
     1: lsn    (typeid = 3220, len = 8, typmod = -1, byval = t)
     2: xid    (typeid = 28, len = 4, typmod = -1, byval = t)
     3: data    (typeid = 25, len = -1, typmod = -1, byval = f)
    
2019-03-06 17:28:04.979 GMT [1132] LOG:  starting logical decoding for 
slot "m7"
2019-03-06 17:28:04.979 GMT [1132] DETAIL:  Streaming transactions 
committing after 0/163BA00, reading WAL from 0/163B9C8.
2019-03-06 17:28:04.979 GMT [1132] STATEMENT:  select * from 
pg_logical_slot_get_changes('m7',null,null);


2019-03-06 17:28:04.979 GMT [1132] LOG:  logical decoding found 
consistent point at 0/163B9C8
2019-03-06 17:28:04.979 GMT [1132] DETAIL:  There are no running 
transactions.
2019-03-06 17:28:04.979 GMT [1132] STATEMENT:  select * from 
pg_logical_slot_get_changes('m7',null,null);


TRAP: FailedAssertion("!(slot != ((void *)0) && slot->active_pid != 0)", 
File: "slot.c", Line: 428)

Aborted (core dumped)

Stack trace -

(gdb) bt
#0  0x003746e325e5 in raise () from /lib64/libc.so.6
#1  0x003746e33dc5 in abort () from /lib64/libc.so.6
#2  0x008a96ad in ExceptionalCondition (conditionName=optimized out>, errorType=, fileName=optimized out>, lineNumber=)

    at assert.c:54
#3  0x00753253 in ReplicationSlotRelease () at slot.c:428
#4  0x00734dbd in pg_logical_slot_get_changes_guts 
(fcinfo=0x2771e48, confirm=true, binary=false) at logicalfuncs.c:355
#5  0x00640aa5 in ExecMakeTableFunctionResult 
(setexpr=0x27704f8, econtext=0x27703a8, argContext=out>, expectedDesc=0x2792c10, randomAccess=false)

    at execSRF.c:233
#6  0x00650c43 in FunctionNext (node=0x2770290) at 
nodeFunctionscan.c:95
#7  0x0063fbad in ExecScanFetch (node=0x2770290, 
accessMtd=0x650950 , recheckMtd=0x6501d0 
) at execScan.c:93
#8  ExecScan (node=0x2770290, accessMtd=0x650950 , 
recheckMtd=0x6501d0 ) at execScan.c:143
#9  0x006390a7 in ExecProcNode (queryDesc=0x276fc28, 
direction=, count=0, execute_once=144) at 
../../../src/include/executor/executor.h:241
#10 ExecutePlan (queryDesc=0x276fc28, direction=, 
count=0, execute_once=144) at execMain.c:1643
#11 standard_ExecutorRun (queryDesc=0x276fc28, direction=optimized out>, count=0, execute_once=144) at execMain.c:362
#12 0x0079e30b in PortalRunSelect (portal=0x27156b8, 
forward=, count=0, dest=) at 
pquery.c:929
#13 0x0079f671 in PortalRun (portal=0x27156b8, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0xa826a0, altdest=0xa826a0, completionTag=0x7ffc04af2690 "")

    at pquery.c:770
#14 0x0079ba7b in exec_simple_query (query_string=0x27236f8 
"select * from pg_logical_slot_get_changes('m7',null,null);\n") at 
postgres.c:1215
#15 0x0079d044 in PostgresMain (argc=, 
argv=, dbname=0x26c9010 "postgres", username=optimized out>) at postgres.c:4256

#16 0x006874eb in main (argc=5, argv=0x26a7e20) at main.c:224
(gdb) q

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: openLogOff is not needed anymore

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 6:10 AM Antonin Houska  wrote:
> ... obviously due to commit c24dcd0. The following patch removes it.

Committed, after a short struggle to get the patch out of the email
body in a usable form.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] CLUSTER command progress monitor

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 8:03 PM Alvaro Herrera  wrote:
> One, err, small issue with that idea is that we need the param numbers
> not to conflict for any "progress update providers" that are to be used
> simultaneously by any command.

Is that really an issue?  I think progress reporting -- at least with
the current infrastructure -- is only ever going to be possible for
utility commands, not queries.  And those really shouldn't have very
many sorts going on at once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch to document base64 encoding

2019-03-06 Thread Karl O. Pinc
On Wed, 6 Mar 2019 07:09:48 -0600
"Karl O. Pinc"  wrote:

> On Tue, 5 Mar 2019 23:23:20 -0600
> "Karl O. Pinc"  wrote:
> 
> > Added documentation for hex and escape encodings,
> > including output formats and what are acceptable
> > inputs.  

Attached: doc_base64_v7.patch

Improved escape decoding sentence.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..989344c98c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+  base64 encoding
+
 decode(string text,
 format text)

@@ -1769,16 +1772,25 @@
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
-escape converts zero bytes and high-bit-set bytes to
-octal sequences (\nnn) and
-doubles backslashes.
+formats are:
+base64,
+hex,
+escape.

encode('123\000\001', 'base64')
MTIzAAE=
@@ -2365,6 +2377,90 @@
 format treats a NULL as a zero-element array.

 
+   
+ encode
+   
+   
+ decode
+   
+   
+ base64 encoding
+   
+   
+hex encoding
+   
+   
+escape encoding
+   
+
+   
+ The string and the binary encode
+ and decode functions support the following
+ encodings:
+
+ 
+   
+ base64
+ 
+   
+ The base64 encoding is that
+ of https://tools.ietf.org/html/rfc2045#section-6.8";>RFC
+ 2045 section 6.8.  As per the RFC, encoded lines are
+ broken at 76 characters.  However instead of the MIME CRLF
+ end-of-line marker, only a newline is used for end-of-line.
+   
+   
+ The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is
+ raised when decode is supplied invalid
+ base64 data — including when trailing padding is incorrect.
+   
+ 
+   
+
+   
+ hex
+ 
+   
+ hex represents each 4 bits of data as a single
+ hexadecimal digit, 0
+ through f.  Encoding outputs
+ the a-f hex digits in lower
+ case.  Because the smallest unit of data is 8 bits there are
+ always an even number of characters returned
+ by encode.
+   
+   
+ The decode function
+ accepts a-f characters in
+ either upper or lower case.  An error is raised
+ when decode is supplied invalid hex data
+ — including when given an odd number of characters.
+   
+ 
+   
+
+   
+ escape
+ 
+   
+ escape converts zero bytes and high-bit-set
+ bytes to octal sequences
+ (\nnn) and doubles
+ backslashes.  Encoding always produces 4 characters for each
+ high-bit-set input byte.
+   
+   
+ The decode function accepts fewer than three
+ octal digits after a \ character.  An error is
+ raised when decode is supplied a
+ single \ not followed by an octal digit.
+   
+ 
+   
+ 
+   
+

See also the aggregate function string_agg in
.
@@ -3577,16 +3673,25 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
-   escape converts zero bytes and high-bit-set bytes to
-   octal sequences (\nnn) and
-   doubles backslashes.
+   formats are:
+   base64,
+   hex,
+   escape.
   
   encode('123\000456'::bytea, 'escape')
   123\000456


Re: openLogOff is not needed anymore

2019-03-06 Thread Antonin Houska
Robert Haas  wrote:

> On Wed, Mar 6, 2019 at 6:10 AM Antonin Houska  wrote:
> > ... obviously due to commit c24dcd0. The following patch removes it.
> 
> Committed, after a short struggle to get the patch out of the email
> body in a usable form.

It was just convenient for me to use vc-diff emacs command and copy & paste
the diff into the message because I also use emacs as email client. I thought
it should work but perhaps something went wrong with spaces.

I'll always attach diffs to the message next time. Sorry for the problem.

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: proposal: variadic argument support for least, greatest function

2019-03-06 Thread Andrew Dunstan


On 3/4/19 9:39 AM, David Steele wrote:
> On 3/1/19 3:59 AM, Chapman Flack wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:   tested, passed
>> Spec compliant:   not tested
>> Documentation:    tested, passed
>>
>> For completeness, I'll mark this reviewed again. It passes
>> installcheck-world, the latest patch addresses the suggestions from
>> me, and is improved on the code-structure matters that Tom pointed
>> out. I don't know if it will meet Tom's threshold for desirability
>> overall, but that sounds like a committer call at this point, so I'll
>> change it to RfC.
>
> Both committers who have looked at this patch (Tom, and Andres in his
> patch roundup [1]) recommend that it be rejected.
>
> If no committer steps up in the next week I think we should mark it as
> rejected.
>
>

Having reviewed the thread, I'm with Andres and Tom. Maybe though we
should have a note somewhere to the effect that you can't use VARIADIC
with these.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] CLUSTER command progress monitor

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Robert Haas wrote:

> On Tue, Mar 5, 2019 at 8:03 PM Alvaro Herrera  
> wrote:
> > One, err, small issue with that idea is that we need the param numbers
> > not to conflict for any "progress update providers" that are to be used
> > simultaneously by any command.
> 
> Is that really an issue?  I think progress reporting -- at least with
> the current infrastructure -- is only ever going to be possible for
> utility commands, not queries.  And those really shouldn't have very
> many sorts going on at once.

Well, I don't think it is, but I thought it was worth pointing out
explicitly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: proposal: variadic argument support for least, greatest function

2019-03-06 Thread Chapman Flack
On 3/6/19 10:12 AM, Andrew Dunstan wrote:

> Having reviewed the thread, I'm with Andres and Tom. Maybe though we
> should have a note somewhere to the effect that you can't use VARIADIC
> with these.

Perhaps such a note belongs hoisted into the functions-conditional
section of the manual, making a general observation that these things
are conditional *expressions* that may resemble functions, but in
particular, COALESCE, GREATEST, and LEAST cannot be called with
keyword VARIADIC and an array argument, as they could if they were
ordinary functions.

Regards,
-Chap



Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-03-06 Thread Andrey Borodin
Hi!


> 20 февр. 2019 г., в 17:06, Alexey Kondratov  
> написал(а):
> 
>> 
>> I will work out this one with postgres -C and come back till the next 
>> commitfest. I found that something similar is already used in pg_ctl and 
>> there is a mechanism for finding valid executables in exec.c. So it does not 
>> seem to be a big deal at the first sight.
>> 
> 
> I have reworked the patch, please find new version attached. It is 3 times as 
> smaller than the previous one and now touches pg_rewind's code only. Tests 
> are also slightly refactored in order to remove duplicated code. Execution of 
> postgres -C is used for restore_command retrieval (if -r is passed) as being 
> suggested. Otherwise everything works as before.
> 

> 

The new patch is much smaller (less than 400 lines) and works as advertised.
There's a typo "retreive" there.

These lines look a little suspicious:
char  postgres_exec_path[MAXPGPATH],
  postgres_cmd[MAXPGPATH],
  cmd_output[MAX_RESTORE_COMMAND];
Is it supposed to be any difference between MAXPGPATH and MAX_RESTORE_COMMAND?

Besides this, patch looks fine to me.

Best regards, Andrey Borodin.


Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  wrote:
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?

I *really* dislike this.  For one thing, it means that users don't
have control over the behaviors individually.  For another, the
documentation is now quite imprecise about what the option actually
does, while expecting users to figure out whether those behaviors are
acceptable or preferable in their environment.  It lists recycling of
WAL files and zero-filling of those files as examples of behavior
changes, but it does not say that those are the only changes, or even
that they are made in all cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: A separate table level option to control compression

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 5:30 PM Andrew Dunstan
 wrote:
> This is a nice idea, and I'm a bit surprised it hasn't got more
> attention. The patch itself seems very simple and straightforward,
> although it could probably do with having several sets of eyeballs on it.

I haven't needed this for anything, but it seems reasonable to me.
The documentation seems to need some wordsmithing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Andrew Dunstan


On 3/6/19 10:38 AM, Robert Haas wrote:
> On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
>> I think the idea of it being a generic tunable for assorted behavior
>> changes, rather than specific to WAL recycling, is a good one.  I'm
>> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
> I *really* dislike this.  For one thing, it means that users don't
> have control over the behaviors individually.  For another, the
> documentation is now quite imprecise about what the option actually
> does, while expecting users to figure out whether those behaviors are
> acceptable or preferable in their environment.  It lists recycling of
> WAL files and zero-filling of those files as examples of behavior
> changes, but it does not say that those are the only changes, or even
> that they are made in all cases.
>

So you want two options, like wal_recycle_files and wal_zero_fill, both
defaulting to true? Is there a reasonably use case for turning one off
without the other?


Alternatively, we could remove the 'for example" wording, which I agree
is unfortunate.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Make drop database safer

2019-03-06 Thread Tomas Vondra




On 2/12/19 12:55 AM, Ashwin Agrawal wrote:


Thanks for the response and inputs.

On Sat, Feb 9, 2019 at 4:51 AM Andres Freund > wrote:


Hi,

On 2019-02-08 16:36:13 -0800, Alexandra Wang wrote:
 > Current sequence of operations for drop database (dropdb())
 > 1. Start Transaction
 > 2. Make catalog changes
 > 3. Drop database buffers
 > 4. Forget database fsync requests
 > 5. Checkpoint
 > 6. Delete database directory
 > 7. Commit Transaction
 >
 > Problem
 > This sequence is unsafe from couple of fronts. Like if drop database,
 > aborts (means system can crash/shutdown can happen) right after
buffers are
 > dropped step 3 or step 4. The database will still exist and fully
 > accessible but will loose the data from the dirty buffers. This
seems very
 > bad.
 >
 > Operation can abort after step 5 as well in which can the entries
remain in
 > catalog but the database is not accessible. Which is bad as well
but not as
 > severe as above case mentioned, where it exists but some stuff goes
 > magically missing.
 >
 > Repo:
 > ```
 > CREATE DATABASE test;
 > \c test
 > CREATE TABLE t1(a int); CREATE TABLE t2(a int); CREATE TABLE t3(a
int);
 > \c postgres
 > DROP DATABASE test; <<== kill the session after
DropDatabaseBuffers()
 > (make sure to issue checkpoint before killing the session)
 > ```
 >
 > Proposed ways to fix
 > 1. CommitTransactionCommand() right after step 2. This makes it
fully safe
 > as the catalog will have the database dropped. Files may still
exist on
 > disk in some cases which is okay. This also makes it consistent
with the
 > approach used in movedb().

To me this seems bad. The current failure mode obviously isn't good, but
the data obviously isn't valuable, and just loosing track of an entire
database worth of data seems worse.


So, based on that response seems not loosing track to the files 
associated with the database is design choice we wish to achieve. Hence 
catalog having entry but data directory being deleted is fine behavior 
to have and doesn't need to be solved.




What about adding 'is dropped' flag to pg_database, set it to true at 
the beginning of DROP DATABASE and commit? And ensure no one can connect 
to such database, making DROP DATABASE the only allowed operation?


ISTM we could then continue doing the same thing we do today, without 
any extra checkpoints etc.



 > 2. Alternative way to make it safer is perform Checkpoint (step 5) just

 > before dropping database buffers, to avoid the unsafe nature.
Caveats of
 > this solution is:
 > - Performs IO for data which in success case anyways will get deleted
 > - Still doesn't cover the case where catalog has the database
entry but
 > files are removed from disk

That seems like an unacceptable slowdown.


Given dropping database should be infrequent operation and only addition 
IO cost is for buffers for that database itself as Checkpoint is anyways 
performed in later step, is it really unacceptable slowdown, compared to 
safety it brings ?




That's probably true, although I do know quite a few systems that create 
and drop databases fairly often. And the implied explicit checkpoints 
are quite painful, so I'd vote not to make this worse.


FWIW I don't recall why exactly we need the checkpoints, except perhaps 
to ensure the file copies see the most recent data (in CREATE DATABASE) 
and evict stuff for the to-be-dropped database from shared bufers. I 
wonder if we could do that without a checkpoint somehow ...




 > 3. One more fancier approach is to use pending delete mechanism
used by
 > relation drops, to perform these non-catalog related activities
at commit.
 > Easily, the pending delete structure can be added boolean to convey
 > database directory dropping instead of file. Given drop database
can't be
 > performed inside transaction, not needed to be done this way, but
this
 > makes it one consistent approach used to deal with on-disk removal.

ISTM we'd need to do something like this.


Given the above design choice to retain link to database files till 
actually deleted, not seeing why pending delete approach any better than 
approach 1. This approach will allow us to track the database oid in 
commit transaction xlog record but any checkpoint post the same still 
looses the reference to the database. Which is same case in approach 1 
where separate xlog record XLOG_DBASE_DROP is written just after 
committing the transaction.
When we proposed approach 3, we thought its functionally same as 
approach 1 just differs in implementation. But your preference to this 
approach and stating approach 1 as bad, reads as pending deletes 
approach is functionally different, we would like to hear more how?





Re: PostgreSQL vs SQL/XML Standards

2019-03-06 Thread Chapman Flack
This CF entry shows Pavel and me as reviewers, but the included
patches were also produced by one or the other of us, so additional
review by someone who isn't us seems appropriate. :)

Would it make sense to remove one or both of us from the 'reviewers'
field in the app, to make it more obviously 'available' for reviewing?

One of the patches deals only with docs, so even someone who isn't
sure about reviewing XML functionality, but takes an interest in
documentation, would have a valuable role looking at that.

Regards,
-Chap



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-06 19:45:06 +1300, David Rowley wrote:
> Over on [1] Andres pointed out that the pg_dump support for the new to
> PG12 tablespace inheritance feature is broken.  This is the feature
> added in ca4103025dfe26 to allow a partitioned table to have a
> tablespace that acts as the default tablespace for newly attached
> partitions. The idea being that you can periodically change the
> default tablespace for new partitions to a tablespace that sits on a
> disk partition with more free space without affecting the default
> tablespace for normal non-partitioned tables. Anyway...

I'm also concerned that the the current catalog representation isn't
right. As I said:

> I also find it far from clear that:
> 
>  
>   The tablespace_name is the 
> name
>   of the tablespace in which the new table is to be created.
>   If not specified,
>is consulted, or
>if the table is temporary.  For
>   partitioned tables, since no storage is required for the table itself,
>   the tablespace specified here only serves to mark the default tablespace
>   for any newly created partitions when no other tablespace is explicitly
>   specified.
>  
> 
> is handled correctly. The above says that the *specified* tablespaces -
> which seems to exclude the default tablespace - is what's going to
> determine what partitions use as their default tablespace. But in fact
> that's not true, the partitioned table's pg_class.retablespace is set to
> what default_tablespaces was at the time of the creation.

I still think the feature as is doesn't seem to have very well defined
behaviour.



> If we instead did:
> 
> CREATE TABLE public.listp1 (a integer
> );
> 
> ALTER TABLE public.list1 ATTACH PARTITION public.listp FOR VALUES IN (1);

Isn't that a bit more expensive, because now the table needs to be
scanned for maching the value? That's probably neglegible though, given
it'd probably always empty.


Greetings,

Andres Freund



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 10:55 AM Andrew Dunstan
 wrote:
> > I *really* dislike this.  For one thing, it means that users don't
> > have control over the behaviors individually.  For another, the
> > documentation is now quite imprecise about what the option actually
> > does, while expecting users to figure out whether those behaviors are
> > acceptable or preferable in their environment.  It lists recycling of
> > WAL files and zero-filling of those files as examples of behavior
> > changes, but it does not say that those are the only changes, or even
> > that they are made in all cases.
>
> So you want two options, like wal_recycle_files and wal_zero_fill, both
> defaulting to true? Is there a reasonably use case for turning one off
> without the other?

I don't know whether there's a use case for that, and that's one of
the things that worries me.  I know, though, that if we have two
parameters, then if there is a use case for it, people will be able to
meet that use case without submitting a patch.  On the other hand, if
we had convincing evidence that those two things should always go
together, that would be fine, too.  But I don't see that anyone has
made an argument that such a thing is necessarily true outside of ZFS.

I actually wouldn't find it very surprising if disabling WAL recycling
is sometimes beneficial even on ext4.  The fact that we haven't found
such cases on this thread doesn't mean they don't exist.  On the other
hand I think the wal_zero_fill behavior is not about performance but
about reliability, so you can't afford to turn that on just because
non-recycling happens to be faster on your machine.

> Alternatively, we could remove the 'for example" wording, which I agree
> is unfortunate.

Yeah.  We seem to sometimes like to avoid documenting specifics for
fear that, should they change, we'd have to update the documentation.
But I think that just makes the documentation less useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Andrew Dunstan


On 3/6/19 11:30 AM, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 10:55 AM Andrew Dunstan
>  wrote:
>>> I *really* dislike this.  For one thing, it means that users don't
>>> have control over the behaviors individually.  For another, the
>>> documentation is now quite imprecise about what the option actually
>>> does, while expecting users to figure out whether those behaviors are
>>> acceptable or preferable in their environment.  It lists recycling of
>>> WAL files and zero-filling of those files as examples of behavior
>>> changes, but it does not say that those are the only changes, or even
>>> that they are made in all cases.
>> So you want two options, like wal_recycle_files and wal_zero_fill, both
>> defaulting to true? Is there a reasonably use case for turning one off
>> without the other?
> I don't know whether there's a use case for that, and that's one of
> the things that worries me.  I know, though, that if we have two
> parameters, then if there is a use case for it, people will be able to
> meet that use case without submitting a patch.  On the other hand, if
> we had convincing evidence that those two things should always go
> together, that would be fine, too.  But I don't see that anyone has
> made an argument that such a thing is necessarily true outside of ZFS.
>
> I actually wouldn't find it very surprising if disabling WAL recycling
> is sometimes beneficial even on ext4.  The fact that we haven't found
> such cases on this thread doesn't mean they don't exist.  On the other
> hand I think the wal_zero_fill behavior is not about performance but
> about reliability, so you can't afford to turn that on just because
> non-recycling happens to be faster on your machine.
>
>


Well, let's put the question another way. Is there any reason to allow
skipping zero filling if we are recycling? That seems possibly
dangerous. I can imagine turning off recycling but leaving on
zero-filling, although I don't have a concrete use case for it ATM.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Robert Haas wrote:

> On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
> > I think the idea of it being a generic tunable for assorted behavior
> > changes, rather than specific to WAL recycling, is a good one.  I'm
> > unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
> 
> I *really* dislike this.  For one thing, it means that users don't
> have control over the behaviors individually.  For another, the
> documentation is now quite imprecise about what the option actually
> does, while expecting users to figure out whether those behaviors are
> acceptable or preferable in their environment.  It lists recycling of
> WAL files and zero-filling of those files as examples of behavior
> changes, but it does not say that those are the only changes, or even
> that they are made in all cases.

I can understand this argument.  Is there really a reason to change
those two behaviors separately?  The reason I wrote the documentation
weasely is that it seems pointless to have to update it whenever we
implement more things controlled by the same GUC option (which we might,
if we learn new things about how to use COW filesystems later on).
AFAIR Jerry's wording was more precise about what the parameter did.  If
the only reason to change those behaviors is to make WAL work better on
COW filesystems, then I don't see the point in splitting the GUC in two,
or documenting in minute detail what it does.

That said -- if there *is* such a reason, we can certainly split them up
and indicate to COW-filesystem users to change them both together.  I
don't think it's a big deal, but OTOH I see no reason to complicate
matters needlessly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 11:37 AM Andrew Dunstan
 wrote:
> Well, let's put the question another way. Is there any reason to allow
> skipping zero filling if we are recycling? That seems possibly
> dangerous. I can imagine turning off recycling but leaving on
> zero-filling, although I don't have a concrete use case for it ATM.

I think the short answer is that we don't know.  Any filesystem where
just writing the last byte of the file is good enough to guarantee
that all the intervening space is allocated can skip zero-filling.
Any system where creating new WAL files is faster than recycling old
ones can choose to do it that way.  I don't know how you can make a
categorical argument that there can't be a system where one of those
things -- either one -- is true and the other is false.  At least to
me, they seem like basically unrelated issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 11:41 AM Alvaro Herrera  wrote:
> I can understand this argument.  Is there really a reason to change
> those two behaviors separately?

See my previous rely to Andrew, but also, I think you're putting the
burden of proof in the wrong place.  You could equally well ask "Is
there really a reason for work_mem to be different for sorts and index
builds?  For sorts and hashes?  For foreground and background
vacuums?"  Well, now we've got work_mem, maintenance_work_mem,
autovacuum_work_mem, and at least in my experience, that's not
necessarily fine-grained enough -- people can't predict whether their
maintenance_work_mem setting is OK because they don't know if
somebody's going to be running a foreground VACUUM or a CREATE INDEX
while it's in flight.  See also the bit about hash_mem in
https://www.postgresql.org/message-id/CAH2-WzmNwV=LfDRXPsmCqgmm91mp=2b4FvXNF=ccvmrb8yf...@mail.gmail.com
-- see also commit a1b395b6a26ae80cde17fdfd2def8d351872f399's
introduction of pending_list_cleanup_size, yet another place where we
started to decouple something that was inadvisably tied to work_mem.

There have been other cases, too, where we've bound unrelated things
together into a single parameter, and my feeling is that most of those
have turned out a mess.  Separate behaviors ought to be controlled by
separate settings, even though it means we'll end up with more
settings.  Two settings each of which does one clear and well-defined
thing can even be easier to understand than one setting that does
multiple loosely-related things.

> The reason I wrote the documentation
> weasely is that it seems pointless to have to update it whenever we
> implement more things controlled by the same GUC option (which we might,
> if we learn new things about how to use COW filesystems later on).
> AFAIR Jerry's wording was more precise about what the parameter did.  If
> the only reason to change those behaviors is to make WAL work better on
> COW filesystems, then I don't see the point in splitting the GUC in two,
> or documenting in minute detail what it does.

Really?  What about somebody who has a different experience from
Jerry?  They turn the parameter on in release N and it's good and then
the behavior changes in release N+1 and now it sucks and they read the
documentation and it tells them nothing about what has actually
changed.  They can neither get the behavior back that they liked nor
can they understand what behavior they're actually getting that is
causing a problem, because it's not documented.

I do not think our track record is very good when it comes to deciding
which things users "need to know about."  Users need to know what's
really happening.  The idea that we're just going to have a magic flag
here that is going to change all of the things that you want changed
when you're running on a copy-on-write filesystem and it's all going
to work great so that nobody cares about the details does not sound
very likely to be correct.  We don't even know that the same
combination of behavior is performant or safe on every filesystem out
there, let alone that future things that come along are going to have
similar properties.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] kNN for btree

2019-03-06 Thread Nikita Glukhov

Attached 9th version of the patches.

On 03.03.2019 12:46, Anastasia Lubennikova wrote:


The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi,
thank you for your work on this patch.


Thank you for you review.


Patch #1 is ready for commit.
It only fixes lack of refactoring after INCLUDE index patch.

Patches 2-7 are ready for committer's review.
I found no significant problems in algorithm or implementation.
Here are few suggestions to improve readability:

1) patch 0002.
* Returned matched index clause exression.
* Number of matched index column is returned in *indexcol_p.

Typos in comment:
exPression
index columnS


"exPression" is fixed.
But there should be "column" because only single index column is matched.


2) patch 0002.
+   /*
+* We allow any column of this index to match each 
pathkey; they
+* don't have to match left-to-right as you might 
expect.
+*/

Before refactoring this comment had a line about gist and sp-gist specific:

-* We allow any column of the index to match each 
pathkey; they
-* don't have to match left-to-right as you might 
expect.  This is
-* correct for GiST, and it doesn't matter for SP-GiST 
because
-* that doesn't handle multiple columns anyway, and no 
other
-* existing AMs support amcanorderbyop.  We might need 
different
-* logic in future for other implementations.

Now, when the code was moved to a separate function, it is not clear why the
same logic is ok for b-tree and other index methods.  If I got it right, it
doesn't affect the correctness of the algorithm, because b-tree specific
checks are performed in another function.  Still it would be better to
explain it in the comment for future readers.


It seems that match_orderbyop_pathkey() is simply the wrong place for this
comment.  I moved it into match_orderbyop_pathkeys() which is implementation of
ammatchorderby() for GiST an SP-GiST.  Also I added old sentence about its
correctness for GiST and SP-GiST.


3) patch 0004
  if (!so->distanceTypeByVal)
  {
so->state.currDistance = PointerGetDatum(NULL);
so->state.markDistance = PointerGetDatum(NULL);
   }

Why do we reset these fields only for !distanceTypeByVal?


These fields should be initialized (it is initialization, not reset) only for
by-ref types because before writing a new distance values to these fields,
the previous by-ref values are pfreed.  The corresponding comment was added.


4) patch 0004
+ * _bt_next_item() -- Advance to next tuple on current page;
+ * or if there's no more, try to step to the next page with data.
+ *
+ * If there are no more matching records in the given direction
*/

Looks like the last sentence of the comment is unfinished.


Yes, "false is returned" is missing. Fixed.


5) patch 0004
_bt_start_knn_scan()

so->currRightIsNearest = _bt_compare_current_dist(so, rstate, lstate);
/* Reset right flag if the left item is nearer. */
right = so->currRightIsNearest;

If this comment relates to the string above it?


No, it relates only to string below. 'right' flag determines later the selected
scan direction, so 'currRightIsNearest' should be assigned to it. This comment
was rewritten.


6) patch 0004
_bt_parallel_seize()

+ scanPage = state == &so->state
+ ? &btscan->btps_scanPage
+ : &btscan->btps_knnScanPage;

This code looks a bit tricke to me. Why do we need to pass BTScanState state
to _bt_parallel_seize() at all? Won't it be enough to choose the page before
function call and pass it?


If we will pass page, then we will have to pass it through the whole function
tree:
_bt_parallel_seize()
  _bt_steppage()
_bt_next_item()
  _bt_next_nearest()
_bt_load_first_page()
  _bt_init_knn_scan()
  _bt_readnextpage()
_bt_parallel_readpage()
  _bt_first()

I decided simply to add flag 'isKnn' to BtScanState, so the code now looks like
this:
  scanPage = state->isKnn
? &btscan->btps_scanPage
: &btscan->btps_knnScanPage;

I also can offer to add 'id' (0/1) to BtScanState instead, then the code will
look like this:
  scanPage = &btscan->btps_scanPages[state->id];


7) patch 0006
+  Upgrade notes for version 1.6
+
+  
+   In version 1.6 btree_gist switched to using in-core
+   distance operators, and its own implementations were removed.  References to
+   these operators in btree_gist opclasses will be updated
+   automatically during the extension upgrade, but if the user has created
+   objects referencing these operators or functions, then these objects must be
+   dropped manually before updating the extension.

Is this comment still relevant after the latest changes?
Function

Re: [HACKERS] Incomplete startup packet errors

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 5:35 PM Andrew Dunstan
 wrote:
> OK, I think we have agreement on Tom's patch. Do we want to backpatch
> it? It's a change in behaviour, but I find it hard to believe anyone
> relies on the existence of these annoying messages, so my vote would be
> to backpatch it.

I don't think it's a bug fix, so I don't think it should be
back-patched.  I think trying to guess which behavior changes are
likely to bother users is an unwise strategy -- it's very hard to know
what will actually bother people, and it's very easy to let one's own
desire to get a fix out the door lead to an unduly rosy view of the
situation.  Plus, all patches carry some risk, because all developers
make mistakes; the fewer things we back-patch, the fewer regressions
we'll introduce.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera
I want your dictating software.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-06 Thread Robert Haas
On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
 wrote:
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I see the distinction you're making, and you're right.  The problem
is, whether in this case or whether for a real torn page, we don't
seem to have a way to distinguish between a state that occurs
transiently due to lack of synchronization and a situation that is
permanent and means that we have corruption.  And that worries me,
because it means we'll either report bogus complaints that will scare
easily-panicked users (and anybody who is running this tool has a good
chance of being in the "easily-panicked" category ...), or else we'll
skip reporting real problems.  Neither is good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online verification of checksums

2019-03-06 Thread Robert Haas
On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >  wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?

Yeah.  Consider a paranoid customer with 100 clusters who runs this
every day on every cluster.  They're going to see failures every day
or three and go ballistic.

I suspect that better retry logic might help here.  I mean, I would
guess that 10 retries at 1 second intervals or something of that sort
would be enough to virtually eliminate false positives while still
allowing us to report persistent -- and thus real -- problems.  But if
even that is going to produce false positives with any measurable
probability different from zero, then I think we have a problem,
because I neither like a verification tool that ignores possible signs
of trouble nor one that "cries wolf" when things are fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera  wrote:
> I want your dictating software.

I'm afraid this is just me and a keyboard, but sadly for me you're not
the first person to accuse me of producing giant walls of text.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: proposal: variadic argument support for least, greatest function

2019-03-06 Thread Pavel Stehule
st 6. 3. 2019 v 16:24 odesílatel Chapman Flack 
napsal:

> On 3/6/19 10:12 AM, Andrew Dunstan wrote:
>
> > Having reviewed the thread, I'm with Andres and Tom. Maybe though we
> > should have a note somewhere to the effect that you can't use VARIADIC
> > with these.
>
> Perhaps such a note belongs hoisted into the functions-conditional
> section of the manual, making a general observation that these things
> are conditional *expressions* that may resemble functions, but in
> particular, COALESCE, GREATEST, and LEAST cannot be called with
> keyword VARIADIC and an array argument, as they could if they were
> ordinary functions.
>

+1

Pavel


> Regards,
> -Chap
>


Re: Online verification of checksums

2019-03-06 Thread Andres Freund
On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  
> wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > >  wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> 
> Yeah.  Consider a paranoid customer with 100 clusters who runs this
> every day on every cluster.  They're going to see failures every day
> or three and go ballistic.

+1


> I suspect that better retry logic might help here.  I mean, I would
> guess that 10 retries at 1 second intervals or something of that sort
> would be enough to virtually eliminate false positives while still
> allowing us to report persistent -- and thus real -- problems.  But if
> even that is going to produce false positives with any measurable
> probability different from zero, then I think we have a problem,
> because I neither like a verification tool that ignores possible signs
> of trouble nor one that "cries wolf" when things are fine.

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Greetings,

Andres Freund



Re: Server Crash in logical decoding if used inside --single mode

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, tushar wrote:

> backend> select * from pg_logical_slot_get_changes('m7',null,null);
 [...]
> TRAP: FailedAssertion("!(slot != ((void *)0) && slot->active_pid != 0)",
> File: "slot.c", Line: 428)
> Aborted (core dumped)

See argumentation in
https://www.postgresql.org/message-id/flat/3b2f809f-326c-38dd-7a9e-897f957a4eb1%40enterprisedb.com
-- essentially, we don't really care to support this case.

If you want to submit a patch to report an error before crashing, that's
fine, but don't ask for this functionality to actually work.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Block level parallel vacuum

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 1:26 AM Masahiko Sawada  wrote:
> Okay, attached the latest version of patch set. I've incorporated all
> comments I got and separated the patch for making vacuum options a
> Node (0001 patch). And the patch doesn't use parallel_workers. It
> might be proposed in the another form again in the future if
> requested.

Why make it a Node?  I mean I think a struct makes sense, but what's
the point of giving it a NodeTag?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Robert Haas wrote:

> On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera  
> wrote:
> > I want your dictating software.
> 
> I'm afraid this is just me and a keyboard, but sadly for me you're not
> the first person to accuse me of producing giant walls of text.

Well, I don't have a problem reading long texts; my problem is that I'm
unable to argue as quickly.

I do buy your argument, though (if reluctantly); in particular I was
worried to offer a parameter (to turn off zero-filling of segments) that
would enable dangerous behavior, but then I realized we also have
fsync=off of which the same thing can be said.  So I agree we should
have two GUCs, properly explained, with a warning where appropriate.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: tableam: introduce table AM infrastructure.

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Andres Freund wrote:

> tableam: introduce table AM infrastructure.

Thanks for doing this!!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Protect syscache from bloating with negative cache entries

2019-03-06 Thread Robert Haas
On Fri, Mar 1, 2019 at 3:33 AM Kyotaro HORIGUCHI
 wrote:
> > > It is artificial (or acutually wont't be repeatedly executed in a
> > > session) but anyway what can get benefit from
> > > catalog_cache_memory_target would be a kind of extreme.
> >
> > I agree.  So then let's not have it.
>
> Ah... Yeah! I see. Andres' concern was that crucial syscache
> entries might be blown away during a long idle time. If that
> happens, it's enough to just turn off in the almost all of such
> cases.

+1.

> In the attached v18,
>catalog_cache_memory_target is removed,
>removed some leftover of removing the hard limit feature,
>separated catcache clock update during a query into 0003.
>attached 0004 (monitor part) in order just to see how it is working.
>
> v18-0001-Add-dlist_move_tail:
>   Just adds dlist_move_tail
>
> v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
>   Revised pruning feature.

OK, so this is getting simpler, but I'm wondering why we need
dlist_move_tail() at all.  It is a well-known fact that maintaining
LRU ordering is expensive and it seems to be unnecessary for our
purposes here.  Can't CatCacheCleanupOldEntries just use a single-bit
flag on the entry?  If the flag is set, clear it.  If the flag is
clear, drop the entry.  When an entry is used, set the flag.  Then,
entries will go away if they are not used between consecutive calls to
CatCacheCleanupOldEntries.  Sure, that might be slightly less accurate
in terms of which entries get thrown away, but I bet it makes no real
difference.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 1:02 PM Alvaro Herrera  wrote:
> Well, I don't have a problem reading long texts; my problem is that I'm
> unable to argue as quickly.

That's my secret weapon... except that it's not much of a secret.

> I do buy your argument, though (if reluctantly); in particular I was
> worried to offer a parameter (to turn off zero-filling of segments) that
> would enable dangerous behavior, but then I realized we also have
> fsync=off of which the same thing can be said.  So I agree we should
> have two GUCs, properly explained, with a warning where appropriate.

OK, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Patch to document base64 encoding

2019-03-06 Thread Fabien COELHO




Attached: doc_base64_v7.patch


Patch applies cleanly, doc compiles, navigation tested and ok.

"... section 6.8" -> "... Section 6.8" (capital S).

"The string and the binary encode and decode functions..." sentence looks 
strange to me, especially with the English article that I do not really 
master, so maybe it is ok. I'd have written something more 
straightforward, eg: "Functions encode and decode support the following 
encodings:", and also I'd use a direct "Function <...>decode ..." 
rather than "The decode function ..." (twice).


Maybe I'd use the exact same grammatical structure for all 3 cases, 
starting with "The <>whatever encoding converts bla bla bla" instead of 
varying the sentences.


Otherwise, all explanations look both precise and useful to me.

--
Fabien.



Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Alvaro Herrera
On 2019-Feb-08, Tomas Vondra wrote:

> I'm wondering if we should just get rid of all such optimizations, and
> make the unowned list doubly-linked (WIP patch attached, needs fixing
> the comments etc.).

+1 for that approach.

Did you consider using a dlist?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: New vacuum option to do only freezing

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 11:29 PM Masahiko Sawada  wrote:
> Attached updated patch incorporated all of comments. Also I've added
> new reloption vacuum_index_cleanup as per discussion on the "reloption
> to prevent VACUUM from truncating empty pages at the end of relation"
> thread. Autovacuums also can skip index cleanup when the reloption is
> set to false. Since the setting this to false might lead some problems
> I've made autovacuums report the number of dead tuples and dead
> itemids we left.

It seems to me that the disable_index_cleanup should be renamed
index_cleanup and the default should be changed to true, for
consistency with the reloption (and, perhaps, other patches).

- num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+ num_tuples = live_tuples = tups_vacuumed  = nkeep = nunused =
+ nleft_dead_itemids = nleft_dead_tuples = 0;

I would suggest leaving the existing line alone (and not adding an
extra space to it as the patch does now) and just adding a second
initialization on the next line as a separate statement. a = b = c = d
= e = 0 isn't such great coding style that we should stick to it
rigorously even when it ends up having to be broken across lines.

+ /* Index vacuum must be enabled in two-pass vacuum */
+ Assert(!skip_index_vacuum);

I am a big believer in naming consistency.  Please, let's use the same
name everywhere!  If it's going to be index_cleanup, then call the
reloption vacuum_index_cleanup, and call the option index_cleanup, and
call the variable index_cleanup.  Picking a different subset of
cleanup, index, vacuum, skip, and disable for each new name makes it
harder to understand.

- * If there are no indexes then we can vacuum the page right now
- * instead of doing a second scan.
+ * If there are no indexes or index vacuum is disabled we can
+ * vacuum the page right now instead of doing a second scan.

This comment is wrong.  That wouldn't be safe.  And that's probably
why it's not what the code does.

- /* If no indexes, make log report that lazy_vacuum_heap would've made */
+ /*
+ * If no index or index vacuum is disabled, make log report that
+ * lazy_vacuum_heap would've make.
+ */
  if (vacuumed_pages)

Hmm, does this really do what the comment claims?  It looks to me like
we only increment vacuumed_pages when we call lazy_vacuum_page(), and
we (correctly) don't do that when index cleanup is disabled, but then
here this claims that if (vacuumed_pages) will be true in that case.

I wonder if it would be cleaner to rename vacrelstate->hasindex to
'useindex' and set it to false if there are no indexes or index
cleanup is disabled.  But that might actually be worse, not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-06 Thread Georgios Kokolatos
Overall the patch looks good and according to the previous discussion fulfils 
its purpose.

It might be worthwhile to also check for errors on close in SaveSlotToPath().

pgstat_report_wait_end();

CloseTransientFile(fd);

/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)

Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 1:53 PM Alvaro Herrera  wrote:
> On 2019-Feb-08, Tomas Vondra wrote:
> > I'm wondering if we should just get rid of all such optimizations, and
> > make the unowned list doubly-linked (WIP patch attached, needs fixing
> > the comments etc.).
>
> +1 for that approach.

+1 for me, too.

> Did you consider using a dlist?

Maybe that is worthwhile, but this is a smaller change, which I think
should count for quite a bit.  Nothing keeps somebody from doing the
dlist change as a separate patch, if desired.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Tomas Vondra



On 3/6/19 7:52 PM, Alvaro Herrera wrote:
> On 2019-Feb-08, Tomas Vondra wrote:
> 
>> I'm wondering if we should just get rid of all such optimizations, and
>> make the unowned list doubly-linked (WIP patch attached, needs fixing
>> the comments etc.).
> 
> +1 for that approach.
> 
> Did you consider using a dlist?
> 

I'm not sure. I might have considered it, but decided to go with a
simpler / less invasive fix demonstrating the effect. And maybe make it
more acceptable for backpatch, if we want that. Which we probably don't,
so I agree dlist might be a better choice.

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra
On 3/6/19 6:26 PM, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
>  wrote:
>> FWIW I don't think this qualifies as torn page - i.e. it's not a full
>> read with a mix of old and new data. This is partial write, most likely
>> because we read the blocks one by one, and when we hit the last page
>> while the table is being extended, we may only see the fist 4kB. And if
>> we retry very fast, we may still see only the first 4kB.
> 
> I see the distinction you're making, and you're right.  The problem
> is, whether in this case or whether for a real torn page, we don't
> seem to have a way to distinguish between a state that occurs
> transiently due to lack of synchronization and a situation that is
> permanent and means that we have corruption.  And that worries me,
> because it means we'll either report bogus complaints that will scare
> easily-panicked users (and anybody who is running this tool has a good
> chance of being in the "easily-panicked" category ...), or else we'll
> skip reporting real problems.  Neither is good.
> 

Sure, I'd also prefer having a tool that reliably detects all cases of
data corruption, and I certainly do share your concerns about false
positives and false negatives.

But maybe we shouldn't expect a tool meant to verify checksums to detect
various other issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Pluggable Storage - Andres's take

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches

Done and pushed. Some collation dependent fallout, I'm hoping I've just
fixed that.

> - cleanup & polish the ON CONFLICT refactoring

Here's a cleaned up version of that patch.  David, Alvaro, you also
played in that area, any objections? I think this makes that part of the
code easier to read actually. Robert, thanks for looking at that patch
already.

Greetings,

Andres Freund
>From e4caa7de3006370f52b4dafe204d45f9d99fa5a4 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 6 Mar 2019 11:30:33 -0800
Subject: [PATCH v16] Don't reuse slots between root and partition in ON
 CONFLICT ... UPDATE.

Until now the the slot to store the conflicting tuple, and the result
of the ON CONFLICT SET, where reused between partitions. That
necessitated changing slots descriptor when switching partitions.

Besides the overhead of switching descriptors on a slot (which
requires memory allocations and prevents JITing), that's importantly
also problematic for tableam. There individual partitions might belong
to different tableams, needing different kinds of slots.

In passing also fix ExecOnConflictUpdate to clear the existing slot at
exit. Otherwise that slot could continue to hold a pin till the query
ends, which could be far too long if the input data set is large, and
there's no further conflicts. While previously also problematic, it's
now more important as there will be more such slots when partitioned.

Author: Andres Freund
Reviewed-By: Robert Haas
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n...@alap3.anarazel.de
---
 src/backend/executor/execPartition.c   | 52 +--
 src/backend/executor/nodeModifyTable.c | 70 +-
 src/include/nodes/execnodes.h  |  5 +-
 3 files changed, 64 insertions(+), 63 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e41801662b3..4491ee69912 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -723,28 +723,55 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
 			TupleConversionMap *map;
+			TupleDesc	leaf_desc;
 
 			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+			leaf_desc = RelationGetDescr(leaf_part_rri->ri_RelationDesc);
 
 			Assert(node->onConflictSet != NIL);
 			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
+			leaf_part_rri->ri_onConflict = makeNode(OnConflictSetState);
+
+			/*
+			 * Need a separate existing slot for each partition, as the
+			 * partition could be of a different AM, even if the tuple
+			 * descriptors match.
+			 */
+			leaf_part_rri->ri_onConflict->oc_Existing =
+ExecInitExtraTupleSlot(mtstate->ps.state,
+	   leaf_desc,
+	   &TTSOpsBufferHeapTuple);
+
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
-			 * parent (the common case), we can simply re-use the parent's ON
+			 * parent (the common case), we can re-use most of the parent's ON
 			 * CONFLICT SET state, skipping a bunch of work.  Otherwise, we
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
+			{
+/*
+ * It's safe to reuse these from the partition root, as we
+ * only process one tuple at a time (therefore we won't
+ * overwrite needed data in slots), and the results of
+ * projections are independent of the underlying
+ * storage. Projections and where clauses themselves don't
+ * store state / are independent of the underlying storage.
+ */
+leaf_part_rri->ri_onConflict->oc_ProjSlot =
+	rootResultRelInfo->ri_onConflict->oc_ProjSlot;
+leaf_part_rri->ri_onConflict->oc_ProjInfo =
+	rootResultRelInfo->ri_onConflict->oc_ProjInfo;
+leaf_part_rri->ri_onConflict->oc_WhereClause =
+	rootResultRelInfo->ri_onConflict->oc_WhereClause;
+			}
 			else
 			{
 List	   *onconflset;
 TupleDesc	tupDesc;
 bool		found_whole_row;
 
-leaf_part_rri->ri_onConflict = makeNode(OnConflictSetState);
-
 /*
  * Translate expressions in onConflictSet to account for
  * different attribute numbers.  For that, map partition
@@ -778,20 +805,17 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 /* Finally, adjust this tlist to match the partition. */
 onconflset = adjust_partition_tlist(onconflset, map);
 
-/*
- * Build UPDATE SET's projection info.  The user of this
- * projection is responsible for setting the slot's tupdesc!
- * We set aside a tupdesc that's good for the common case of a
- * partition that's tupdesc-equal to the partitioned table;
- * partitions of different tupdescs must generate their own.
- */
+/* create the tuple slot for the 

Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra



On 3/6/19 6:42 PM, Andres Freund wrote:
> On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
>> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  
>> wrote:
>>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
 On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
  wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

 Maybe I'm confused here, but catching 80% of torn pages doesn't sound
 robust at all.
>>>
>>> The chance that pg_verify_checksums hits a torn page (at least in my
>>> tests, see below) is already pretty low, a couple of times per 1000
>>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>>> on. Otherwise, we now just issue a warning and skip the file (or so was
>>> the idea, see below), do you think that is not acceptable?
>>
>> Yeah.  Consider a paranoid customer with 100 clusters who runs this
>> every day on every cluster.  They're going to see failures every day
>> or three and go ballistic.
> 
> +1
> 
> 
>> I suspect that better retry logic might help here.  I mean, I would
>> guess that 10 retries at 1 second intervals or something of that sort
>> would be enough to virtually eliminate false positives while still
>> allowing us to report persistent -- and thus real -- problems.  But if
>> even that is going to produce false positives with any measurable
>> probability different from zero, then I think we have a problem,
>> because I neither like a verification tool that ignores possible signs
>> of trouble nor one that "cries wolf" when things are fine.
> 
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
> 

Yes, if we could ensure the retry happens after completing the current
I/O on the page (without actually initiating a read into shared buffers)
that would work I think - both for partial reads and torn pages.

Not sure how to integrate it into the CLI tool, though. Perhaps we it
could require connection info so that it can execute a function, when
executed in online mode?

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
> Not sure how to integrate it into the CLI tool, though. Perhaps we it
> could require connection info so that it can execute a function, when
> executed in online mode?

To me the right fix would be to simply have this run as part of the
cluster / in a function. I don't see much point in running this outside
of the cluster.

Greetings,

Andres Freund



Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Tomas Vondra
On 3/6/19 8:04 PM, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 1:53 PM Alvaro Herrera  
> wrote:
>> On 2019-Feb-08, Tomas Vondra wrote:
>>> I'm wondering if we should just get rid of all such optimizations, and
>>> make the unowned list doubly-linked (WIP patch attached, needs fixing
>>> the comments etc.).
>>
>> +1 for that approach.
> 
> +1 for me, too.
> 
>> Did you consider using a dlist?
> 
> Maybe that is worthwhile, but this is a smaller change, which I think
> should count for quite a bit.  Nothing keeps somebody from doing the
> dlist change as a separate patch, if desired.
> 

Yeah, although now that I think about it I wouldn't expect the dlist
version to be much more complicated. We access next_unowned_reln on two
or three places, IIRC, so switching to dlist would be trivial I think.

What worries me more is that I observe the opposite behavior than what's
described in comment for b4166911 (which is from 2018, so not that old)
and 279628a0a7 (from 2013). So what changed since then? Seems fishy ...


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Make drop database safer

2019-03-06 Thread Ashwin Agrawal
On Wed, Mar 6, 2019 at 7:56 AM Tomas Vondra 
wrote:

>
>
> On 2/12/19 12:55 AM, Ashwin Agrawal wrote:
> >
> > Thanks for the response and inputs.
> >
> > On Sat, Feb 9, 2019 at 4:51 AM Andres Freund  > > wrote:
> >
> > Hi,
> >
> > On 2019-02-08 16:36:13 -0800, Alexandra Wang wrote:
> >  > Current sequence of operations for drop database (dropdb())
> >  > 1. Start Transaction
> >  > 2. Make catalog changes
> >  > 3. Drop database buffers
> >  > 4. Forget database fsync requests
> >  > 5. Checkpoint
> >  > 6. Delete database directory
> >  > 7. Commit Transaction
> >  >
> >  > Problem
> >  > This sequence is unsafe from couple of fronts. Like if drop
> database,
> >  > aborts (means system can crash/shutdown can happen) right after
> > buffers are
> >  > dropped step 3 or step 4. The database will still exist and fully
> >  > accessible but will loose the data from the dirty buffers. This
> > seems very
> >  > bad.
> >  >
> >  > Operation can abort after step 5 as well in which can the entries
> > remain in
> >  > catalog but the database is not accessible. Which is bad as well
> > but not as
> >  > severe as above case mentioned, where it exists but some stuff
> goes
> >  > magically missing.
> >  >
> >  > Repo:
> >  > ```
> >  > CREATE DATABASE test;
> >  > \c test
> >  > CREATE TABLE t1(a int); CREATE TABLE t2(a int); CREATE TABLE t3(a
> > int);
> >  > \c postgres
> >  > DROP DATABASE test; <<== kill the session after
> > DropDatabaseBuffers()
> >  > (make sure to issue checkpoint before killing the session)
> >  > ```
> >  >
> >  > Proposed ways to fix
> >  > 1. CommitTransactionCommand() right after step 2. This makes it
> > fully safe
> >  > as the catalog will have the database dropped. Files may still
> > exist on
> >  > disk in some cases which is okay. This also makes it consistent
> > with the
> >  > approach used in movedb().
> >
> > To me this seems bad. The current failure mode obviously isn't good,
> but
> > the data obviously isn't valuable, and just loosing track of an
> entire
> > database worth of data seems worse.
> >
> >
> > So, based on that response seems not loosing track to the files
> > associated with the database is design choice we wish to achieve. Hence
> > catalog having entry but data directory being deleted is fine behavior
> > to have and doesn't need to be solved.
> >
>
> What about adding 'is dropped' flag to pg_database, set it to true at
> the beginning of DROP DATABASE and commit? And ensure no one can connect
> to such database, making DROP DATABASE the only allowed operation?
>
> ISTM we could then continue doing the same thing we do today, without
> any extra checkpoints etc.
>

Sure, adding 'is dropped' column flag to pg_database and committing that
update before dropping database buffers can give us the safety and also
allows to keep the link. But seems very heavy duty way to gain the desired
behavior with extra column in pg_database catalog table specifically just
to protect against this failure window. If this solution gets at least one
more vote as okay route to take, I am fine implementing it.

I am surprised though that keeping the link to database worth of data and
not losing track of it is preferred for dropdb() but not cared for in
movedb(). In movedb(), transaction commits first and then old database
directory is deleted, which was the first solution proposed for dropdb().

FWIW I don't recall why exactly we need the checkpoints, except perhaps
> to ensure the file copies see the most recent data (in CREATE DATABASE)
> and evict stuff for the to-be-dropped database from shared bufers. I
> wonder if we could do that without a checkpoint somehow ...
>

Checkpoint during CREATE DATABASE is understandable. But yes during
dropdb() seems unnecessary. Only rational seems for windows based on
comment in code "On Windows, this also ensures that background procs don't
hold any open files, which would cause rmdir() to fail." I think we can
avoid the checkpoint for all other platforms in dropdb() except windows.
Even for windows if have way to easily ensure no background procs have open
files, without checkpoint then can be avoided even for it.

> Considering the design choice we must meet, seems approach 2, moving
> > Checkpoint from step 5 before step 3 would give us the safety desired
> > and retain the desired link to the database till we actually delete the
> > files for it.
> >
>
> Ummm? That essentially means this order:
>
> 1. Start Transaction
> 2. Make catalog changes
> 5. Checkpoint
> 3. Drop database buffers
> 4. Forget database fsync requests
> 6. Delete database directory
> 7. Commit Transaction
>
> I don't see how that actually fixes any of the issues? Can you explain?
>

Since checkpoint is performed, all the dirty buffers ma

Re: Optimization of some jsonb functions

2019-03-06 Thread Andrew Dunstan


On 3/5/19 5:24 AM, David Steele wrote:
> On 2/22/19 2:05 AM, Nikita Glukhov wrote:
>> Attached set of patches with some jsonb optimizations that were made
>> during
>> comparison of performance of ordinal jsonb operators and jsonpath
>> operators.
>
> This patch was submitted just before the last commitfest for PG12 and
> seems to have potential for breakage.
>
> I have updated the target to PG13.
>
>

I think that's overly cautious. The first one I looked at, to optimize
JsonbExtractScalar, is very small, self-contained, and I think low risk.
I haven't looked at the others in detail, but I think at least some part
of this is reasonably committable.


I'll try to look at the others fairly shortly.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Should we increase the default vacuum_cost_limit?

2019-03-06 Thread Andrew Dunstan


On 3/6/19 1:38 PM, Jeremy Schneider wrote:
> On 3/5/19 14:14, Andrew Dunstan wrote:
>> This patch is tiny, seems perfectly reasonable, and has plenty of
>> support. I'm going to commit it shortly unless there are last minute
>> objections.
> +1
>

done.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra
On 3/6/19 8:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
>> Not sure how to integrate it into the CLI tool, though. Perhaps we it
>> could require connection info so that it can execute a function, when
>> executed in online mode?
> 
> To me the right fix would be to simply have this run as part of the
> cluster / in a function. I don't see much point in running this outside
> of the cluster.
> 

Not sure. AFAICS that would to require a single transaction, and if we
happen to add some sort of throttling (which is a feature request I'd
expect pretty soon to make it usable on live clusters) that might be
quite long-running. So, not great.

If we want to run it from the server itself, then I guess a background
worker would be a better solution. Incidentally, that's something I've
been toying with some time ago, see [1].


[1] https://github.com/tvondra/scrub

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Incomplete startup packet errors

2019-03-06 Thread Andrew Dunstan


On 3/6/19 12:12 PM, Robert Haas wrote:
> On Tue, Mar 5, 2019 at 5:35 PM Andrew Dunstan
>  wrote:
>> OK, I think we have agreement on Tom's patch. Do we want to backpatch
>> it? It's a change in behaviour, but I find it hard to believe anyone
>> relies on the existence of these annoying messages, so my vote would be
>> to backpatch it.
> I don't think it's a bug fix, so I don't think it should be
> back-patched.  I think trying to guess which behavior changes are
> likely to bother users is an unwise strategy -- it's very hard to know
> what will actually bother people, and it's very easy to let one's own
> desire to get a fix out the door lead to an unduly rosy view of the
> situation.  Plus, all patches carry some risk, because all developers
> make mistakes; the fewer things we back-patch, the fewer regressions
> we'll introduce.
>

OK, no back-patching it is.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Protect syscache from bloating with negative cache entries

2019-03-06 Thread Tom Lane
Robert Haas  writes:
> OK, so this is getting simpler, but I'm wondering why we need
> dlist_move_tail() at all.  It is a well-known fact that maintaining
> LRU ordering is expensive and it seems to be unnecessary for our
> purposes here.

Yeah ... LRU maintenance was another thing that used to be in the
catcache logic and was thrown out as far too expensive.  Your idea
of just using a clock sweep instead seems plausible.

regards, tom lane



Re: Should we increase the default vacuum_cost_limit?

2019-03-06 Thread Tomas Vondra



On 3/6/19 12:10 AM, David Rowley wrote:
> Thanks for chipping in on this.
> 
> On Wed, 6 Mar 2019 at 01:53, Tomas Vondra  
> wrote:
>> But on the other hand it feels a bit weird that we increase this one
>> value and leave all the other (also very conservative) defaults alone.
> 
> Which others did you have in mind? Like work_mem, shared_buffers?  If
> so, I mentioned in the initial post that I don't see vacuum_cost_limit
> as in the same category as those.  It's not like PostgreSQL won't
> start on a tiny server if vacuum_cost_limit is too high, but you will
> have issues with too big a shared_buffers, for example.   I think if
> we insist that this patch is a review of all the "how big is your
> server" GUCs then that's raising the bar significantly and
> unnecessarily for what I'm proposing here.
> 

On second thought, I think you're right. It's still true that you need
to bump up various other GUCs on reasonably current hardware, but it's
true vacuum_cost_limit is special enough to increase it separately.

so +1 (I see Andrew already pushed it, but anyway ...)

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-06 Thread Andres Freund
Hi,

After my tableam patch Andrew's buildfarm animal started failing in the
cross-version upgrades:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-03-06%2019%3A32%3A24

But I actually don't think that't really fault of the tableam patch. The
reason for the assertion is that we assert that
Assert(relation->rd_rel->relam != InvalidOid);
for tables that have storage.  The table in question is a toast table.

The reason that table doesn't have an AM is that it's the toast table
for a partitioned relation, and for toast tables we just copy the AM
from the main table.

The backtrace shows:

#7  0x558b4bdbecfc in create_toast_table (rel=0x7fdab64289e0, toastOid=0, 
toastIndexOid=0, reloptions=0, lockmode=8, check=false)
at /home/andres/src/postgresql/src/backend/catalog/toasting.c:263
263 toast_relid = heap_create_with_catalog(toast_relname,
(gdb) p *rel->rd_rel
$2 = {oid = 80244, relname = {
data = 
"partitioned_table\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"},
 relnamespace = 2200, reltype = 80246, reloftype = 0, relowner = 10, relam = 0, 
relfilenode = 0, 
  reltablespace = 0, relpages = 0, reltuples = 0, relallvisible = 0, 
reltoastrelid = 0, relhasindex = false, relisshared = false, relpersistence = 
112 'p', 
  relkind = 112 'p', relnatts = 2, relchecks = 0, relhasrules = false, 
relhastriggers = false, relhassubclass = false, relrowsecurity = false, 
  relforcerowsecurity = false, relispopulated = true, relreplident = 100 'd', 
relispartition = false, relrewrite = 0, relfrozenxid = 0, relminmxid = 0}

that were trying to create a toast table for a partitioned table. Which
seems wrong to me, given that recent commits made partitioned tables
have no storage.

The reason we're creating a storage is that we're upgrading from a
version of PG where partitioned tables *did* have storage. And thus the
dump looks like:

-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('80246'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT 
pg_catalog.binary_upgrade_set_next_array_pg_type_oid('80245'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type toast oid
SELECT 
pg_catalog.binary_upgrade_set_next_toast_pg_type_oid('80248'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_class oids
SELECT 
pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('80244'::pg_catalog.oid);
SELECT 
pg_catalog.binary_upgrade_set_next_toast_pg_class_oid('80247'::pg_catalog.oid);
SELECT 
pg_catalog.binary_upgrade_set_next_index_pg_class_oid('80249'::pg_catalog.oid);

CREATE TABLE "public"."partitioned_table" (
"a" integer,
"b" "text"
)
PARTITION BY LIST ("a");


and create_toast_table() has logic like:

{
/*
 * In binary-upgrade mode, create a TOAST table if and only if
 * pg_upgrade told us to (ie, a TOAST table OID has been 
provided).
 *
 * This indicates that the old cluster had a TOAST table for the
 * current table.  We must create a TOAST table to receive the 
old
 * TOAST file, even if the table seems not to need one.
 *
 * Contrariwise, if the old cluster did not have a TOAST table, 
we
 * should be able to get along without one even if the new 
version's
 * needs_toast_table rules suggest we should have one.  There 
is a lot
 * of daylight between where we will create a TOAST table and 
where
 * one is really necessary to avoid failures, so small 
cross-version
 * differences in the when-to-create heuristic shouldn't be a 
problem.
 * If we tried to create a TOAST table anyway, we would have the
 * problem that it might take up an OID that will conflict with 
some
 * old-cluster table we haven't seen yet.
 */
if (!OidIsValid(binary_upgrade_next_toast_pg_class_oid) ||
!OidIsValid(binary_upgrade_next_toast_pg_type_oid))
return false;


I think we probably should have pg_dump suppress emitting information
about the toast table of partitioned tables?

While I'm not hugely bothered by binary upgrade mode creating
inconsistent states - there's plenty of ways to crash the server that
way - it probably also would be a good idea to have heap_create()
elog(ERROR) when accessmtd is invalid.

Greetings,

Andres Freund



Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-06 Thread Robert Haas
On Fri, Mar 1, 2019 at 5:06 PM Joe Conway  wrote:
> Seems like it would be better to modify the arguments to
> CloseTransientFile() to include the filename being closed, errorlevel,
> and fail_on_error or something similar. Then all the repeated ereport
> stanzas could be eliminated.

Hmm.  I'm not sure that really saves much in terms of notation, and
it's less flexible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan  wrote:
> I agree that the parts covered by the first patch in the series are
> very unlikely to need changes, but I hesitate to commit it weeks ahead
> of the other patches.

I know I'm stating the obvious here, but we don't have many weeks left
at this point.  I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape.  It sounds like
really good stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-06 Thread Andrew Dunstan


On 3/6/19 3:41 PM, Andres Freund wrote:
> Hi,
>
> After my tableam patch Andrew's buildfarm animal started failing in the
> cross-version upgrades:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-03-06%2019%3A32%3A24


Incidentally, I just fixed a bug that was preventing the module from
picking up its log files. The latest report has them all.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-06 16:46:07 -0500, Andrew Dunstan wrote:
> 
> On 3/6/19 3:41 PM, Andres Freund wrote:
> > Hi,
> >
> > After my tableam patch Andrew's buildfarm animal started failing in the
> > cross-version upgrades:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-03-06%2019%3A32%3A24
> 
> 
> Incidentally, I just fixed a bug that was preventing the module from
> picking up its log files. The latest report has them all.

Ah, cool. I was wondering about that...

One thing I noticed is that it might need a more aggressive separator,
that report has:

\0\0\0rls_tbl_force\0\0\0\0ROW SECURITY\0\0\0\0\0K\0\0\0ALTER TABLE 
"regress_rls_schema"."rls_tbl_force" ENABLE ROW LEVEL 
SECURITY;\0\0\0\0\0\0\0\0\0\0\0\0regress_rls_schema\0\0\0\0\0\0\0 
\0\0\0buildfarm\0\0\0\0false\0\0\0\0350\0\0\0\0\0\0\0\0\0\0\0=
 REL_10_STABLE-pg_upgrade_dump_16384.log ==


Greetings,

Andres Freund



Re: shared-memory based stats collector

2019-03-06 Thread Robert Haas
On Sun, Feb 24, 2019 at 11:53 PM Kyotaro HORIGUCHI
 wrote:
> The two aboves are fixed in the attached v17.

Andres just drew my attention to patch 0004 in this series, which is
definitely not OK.  That patch allows the postmaster to use dynamic
shared memory, claiming: "Shared memory baesd stats collector needs it
to work on postmaster and no problem found to do that. Just allow it."

But if you just look a little further down in the code from where that
assertion is located, you'll find this:

/* Lock the control segment so we can register the new segment. */
LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);

It is a well-established principle that the postmaster must not
acquire any locks, because if it did, a corrupted shared memory
segment could take down not only individual backends but the
postmaster as well.  So this is entirely not OK in the postmaster.  I
think there might be other reasons as well why this is not OK that
aren't occurring to me at the moment, but that one is enough by
itself.

But even if for some reason that were OK, I'm pretty sure that any
design that involves the postmaster interacting with the data stored
in shared memory by the stats collector is an extremely bad idea.
Again, the postmaster is supposed to have as little interaction with
shared memory as possible, precisely so that it is doesn't crash and
burn when some other process corrupts shared memory.  Dynamic shared
memory is included in that.  So, really, the LWLock here is just the
tip of the iceberg: the postmaster not only CAN'T safely run this
code, but it shouldn't WANT to do so.

And I'm kind of baffled that it does.  I haven't looked at the other
patches, but it seems to me that, while a shared-memory stats
collector is a good idea in general to avoid the I/O and CPU costs of
with reading and writing temporary files, I don't see why the
postmaster would need to be involved in any of that.  Whatever the
reason, though, I'm pretty sure that's GOT to be changed for this
patch set to have any chance of being accepted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Should we increase the default vacuum_cost_limit?

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 08:54, Andrew Dunstan
 wrote:
>
>
> On 3/6/19 1:38 PM, Jeremy Schneider wrote:
> > On 3/5/19 14:14, Andrew Dunstan wrote:
> >> This patch is tiny, seems perfectly reasonable, and has plenty of
> >> support. I'm going to commit it shortly unless there are last minute
> >> objections.
> > +1
> >
>
> done.

Thanks!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 03:36, Tom Lane  wrote:
>
> David Rowley  writes:
> > As far as I can see, the biggest fundamental difference with doing
> > things this way will be that the column order of partitions will be
> > preserved, where before it would inherit the order of the partitioned
> > table.  I'm a little unsure if doing this column reordering was an
> > intended side-effect or not.
>
> Well, if the normal behavior results in changing the column order,
> it'd be necessary to do things differently in --binary-upgrade mode
> anyway, because there we *must* preserve column order.  I don't know
> if what you're describing represents a separate bug for pg_upgrade runs,
> but it might.  Is there any test case for the situation left behind by
> the core regression tests?

After having written the patch, I noticed that binary upgrade mode
does the CREATE TABLE then ATTACH PARTITION in order to preserve the
order.

After changing it nothing failed in make check-world with tap tests enabled.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 05:17, Andres Freund  wrote:
> I'm also concerned that the the current catalog representation isn't
> right. As I said:
>
> > I also find it far from clear that:
> > 
> >  
> >   The tablespace_name is 
> > the name
> >   of the tablespace in which the new table is to be created.
> >   If not specified,
> >is consulted, or
> >if the table is temporary.  For
> >   partitioned tables, since no storage is required for the table itself,
> >   the tablespace specified here only serves to mark the default 
> > tablespace
> >   for any newly created partitions when no other tablespace is 
> > explicitly
> >   specified.
> >  
> > 
> > is handled correctly. The above says that the *specified* tablespaces -
> > which seems to exclude the default tablespace - is what's going to
> > determine what partitions use as their default tablespace. But in fact
> > that's not true, the partitioned table's pg_class.retablespace is set to
> > what default_tablespaces was at the time of the creation.

Do you think it's fine to reword the docs to make this point more
clear, or do you see this as a fundamental problem with the patch?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Patch to document base64 encoding

2019-03-06 Thread Karl O. Pinc
On Wed, 6 Mar 2019 19:30:16 +0100 (CET)
Fabien COELHO  wrote:

> "... section 6.8" -> "... Section 6.8" (capital S).

Fixed.

> "The string and the binary encode and decode functions..." sentence
> looks strange to me, especially with the English article that I do
> not really master, so maybe it is ok. I'd have written something more 
> straightforward, eg: "Functions encode and decode support the
> following encodings:",

It is an atypical construction because I want to draw attention that
this is documentation not only for the encode() and decode() in
section 9.4. String Functions and Operators but also for
the encode() and decode in section 9.5. Binary String Functions 
and Operators.  Although I can't think of a better approach
it makes me uncomfortable that documentation written in
one section applies equally to functions in a different section.

Do you think it would be useful to hyperlink the word "binary"
to section 9.5?

The idiomatic phrasing would be "Both the string and the binary
encode and decode functions..." but the word "both" adds
no information.  Shorter is better.

> and also I'd use a direct "Function
> <...>decode ..." rather than "The decode
> function ..." (twice).

The straightforward English would be "Decode accepts...".  The problem
is that this begins the sentence with the name of a function.
This does not work very well when the function name is all lower case,
and can have other problems where clarity is lost depending 
on documentation output formatting.

I don't see a better approach.

> Maybe I'd use the exact same grammatical structure for all 3 cases, 
> starting with "The <>whatever encoding converts bla bla bla"
> instead of varying the sentences.

Agreed.  Good idea.  The first paragraph of each term has to 
do with encoding and the second with decoding.  
Uniformity in starting the second paragraphs helps make 
this clear, even though the first paragraphs are not uniform.
With this I am not concerned that the first paragraphs
do not have a common phrasing that's very explicit about
being about encoding.

Adjusted.

> Otherwise, all explanations look both precise and useful to me.

When writing I was slightly concerned about being overly precise;
permanently committing to behavior that might (possibly) be an artifact
of implementation.  E.g., that hex decoding accepts both
upper and lower case A-F characters, what input is ignored
and what raises an error, etc.  But it seems best
to document existing behavior, all of which has existed so long
anyway that changing it would be disruptive.  If anybody cares
they can object.

I wrote the docs by reading the code and did only a little
actual testing to be sure that what I wrote is correct.
I also did not check for regression tests which confirm
the behavior I'm documenting.  (It wouldn't hurt to have
such regression tests, if they don't already exist.
But writing regression tests is more than I want to take on 
with this patch.  Feel free to come up with tests.  :-)

I'm confident that the behavior I documented is how PG behaves
but you should know what I did in case you want further
validation.

Attached: doc_base64_v8.patch

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..e756bf53ba 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+  base64 encoding
+
 decode(string text,
 format text)

@@ -1769,16 +1772,25 @@
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
-escape converts zero bytes and high-bit-set bytes to
-octal sequences (\nnn) and
-doubles backslashes.
+formats are:
+base64,
+hex,
+escape.

encode('123\000\001', 'base64')
MTIzAAE=
@@ -2365,6 +2377,90 @@
 format treats a NULL as a zero-element array.

 
+   
+ encode
+   
+   
+ decode
+   
+   
+ base64 encoding
+   
+   
+hex encoding
+   
+   
+escape encoding
+   
+
+   
+ The string and the binary encode
+ and decode functions support the following
+ encodings:
+
+ 
+   
+ base64
+ 
+   
+ The base64 encoding is that
+ of https://tools.ietf.org/html/rfc2045#section-6.8";>RFC
+ 2045 Section 6.8.  As per the RFC, encoded lines are
+ broken at 76 characters.  However instead of the MIME CRLF
+ end-of-line marker, only a newline is used for end-of-line.
+   
+   
+   

Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 05:17, Andres Freund  wrote:
> > I'm also concerned that the the current catalog representation isn't
> > right. As I said:
> >
> > > I also find it far from clear that:
> > > 
> > >  
> > >   The tablespace_name is 
> > > the name
> > >   of the tablespace in which the new table is to be created.
> > >   If not specified,
> > >is consulted, or
> > >if the table is temporary.  
> > > For
> > >   partitioned tables, since no storage is required for the table 
> > > itself,
> > >   the tablespace specified here only serves to mark the default 
> > > tablespace
> > >   for any newly created partitions when no other tablespace is 
> > > explicitly
> > >   specified.
> > >  
> > > 
> > > is handled correctly. The above says that the *specified* tablespaces -
> > > which seems to exclude the default tablespace - is what's going to
> > > determine what partitions use as their default tablespace. But in fact
> > > that's not true, the partitioned table's pg_class.retablespace is set to
> > > what default_tablespaces was at the time of the creation.
> 
> Do you think it's fine to reword the docs to make this point more
> clear, or do you see this as a fundamental problem with the patch?

Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
problem, but ...

I don't think the argument that the user intended to explicitly set a
tablespace holds much water if it was just set via default_tablespace,
rather than an explicit TABLESPACE.  I think iff you really want
something like this feature, you'd have to mark a partition's
reltablespace as 0 unless an *explicit* assignment of the tablespace
happened. In which case you also would need to explicitly emit a
TABLESPACE for the partitioned table in pg_dump, to restore that.

Greetings,

Andres Freund



Re: shared-memory based stats collector

2019-03-06 Thread Andres Freund
Hi,

> From 88740269660d00d548910c2f3aa631878c7cf0d4 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi 
> Date: Thu, 21 Feb 2019 12:42:07 +0900
> Subject: [PATCH 4/6] Allow dsm to use on postmaster.
> 
> DSM is inhibited to be used on postmaster. Shared memory baesd stats
> collector needs it to work on postmaster and no problem found to do
> that. Just allow it.

Maybe I'm missing something, but why? postmaster doesn't actually need
to process stats messages in any way?


> From 774b1495136db1ad6d174ab261487fdf6cb6a5ed Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi 
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 5/6] Shared-memory based stats collector
> 
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

Btw, you can make the life of a committer easier by collecting the
reviewers and co-authors of a patch yourself...


This desparately needs an introductory comment in pgstat.c or such
explaining how the new scheme works.



> +LWLock   StatsMainLock;
> +#define  StatsLock (&StatsMainLock)

Wait, what? You can't just define a lock this way. That's process local
memory, locking that doesn't do anything useful.


> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct {

Please note that in PG's coding style the { comes in the next line.


> +/*
> + *  Backends store various database-wide info that's waiting to be flushed 
> out
> + *  to shared memory in these variables.
> + */
> +static int   n_deadlocks = 0;
> +static size_tn_tmpfiles = 0;
> +static size_tn_tmpfilesize = 0;
> +
> +/*
> + * have_recovery_conflicts represents the existence of any kind if conflict
> + */
> +static bool  have_recovery_conflicts = false;
> +static int   n_conflict_tablespace = 0;
> +static int   n_conflict_lock = 0;
> +static int   n_conflict_snapshot = 0;
> +static int   n_conflict_bufferpin = 0;
> +static int   n_conflict_startup_deadlock = 0;

Probably worthwhile to group those into a struct, even just to make
debugging easier.



>  
> -/* --
> - * pgstat_init() -
> - *
> - *   Called from postmaster at startup. Create the resources required
> - *   by the statistics collector process.  If unable to do so, do not
> - *   fail --- better to let the postmaster start with stats collection
> - *   disabled.
> - * --
> - */
> -void
> -pgstat_init(void)
> +static void
> +pgstat_postmaster_shutdown(int code, Datum arg)

You can't have a function like that without explaining why it's there.

> + /* trash the stats on crash */
> + if (code == 0)
> + pgstat_write_statsfiles();
>  }

And especially not without documenting what that code is supposed to
mean.



>  pgstat_report_stat(bool force)
>  {
> - /* we assume this inits to all zeroes: */
> - static const PgStat_TableCounts all_zeroes;
> - static TimestampTz last_report = 0;
> -
> + static TimestampTz last_flush = 0;
> + static TimestampTz pending_since = 0;
>   TimestampTz now;
> - PgStat_MsgTabstat regular_msg;
> - PgStat_MsgTabstat shared_msg;
> - TabStatusArray *tsa;
> - int i;
> + pgstat_flush_stat_context cxt = {0};
> + boolhave_other_stats = false;
> + boolpending_stats = false;
> + longelapsed;
> + longsecs;
> + int usecs;
> +
> + /* Do we have anything to flush? */
> + if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
> + have_other_stats = true;
>  
>   /* Don't expend a clock check if nothing to do */
>   if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
>   pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
> - !have_function_stats)
> - return;
> + !have_other_stats && !have_function_stats)
> + return 0;

"other" seems like a pretty mysterious category. Seems better to either
name precisely, or just use the underlying variables for checks.




> +/* ---
> + * Subroutines for pgstat_flush_stat.
> + * ---
> + */
> +
> +/*
> + * snapshot_statentry() - Find an entry from source dshash with cache.
> + *

Is snapshot_statentry() really a subroutine for pgstat_flush_stat()?

> +static void *
> +snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)
> +{
> + char *lentry = NULL;
> + size_t keysize = cxt->dsh_params->key_size;
> + size_t dsh_entry

Re: Pluggable Storage - Andres's take

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 08:33, Andres Freund  wrote:
> Here's a cleaned up version of that patch.  David, Alvaro, you also
> played in that area, any objections? I think this makes that part of the
> code easier to read actually. Robert, thanks for looking at that patch
> already.

I only had a quick look and don't have a grasp of what the patch
series is doing to tuple slots, but I didn't see anything I found
alarming during the read.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Pluggable Storage - Andres's take

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-07 11:56:57 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 08:33, Andres Freund  wrote:
> > Here's a cleaned up version of that patch.  David, Alvaro, you also
> > played in that area, any objections? I think this makes that part of the
> > code easier to read actually. Robert, thanks for looking at that patch
> > already.
> 
> I only had a quick look and don't have a grasp of what the patch
> series is doing to tuple slots, but I didn't see anything I found
> alarming during the read.

Thanks for looking.

Re slots - the deal basically is that going forward low level
operations, like fetching a row from a table etc, have to be done by a
slot that's compatible with the "target" table. You can get compatible
slot callbakcs by calling table_slot_callbacks(), or directly create one
by calling table_gimmegimmeslot() (likely to be renamed :)).

The problem here was that the partition root's slot was used to fetch /
store rows from a child partition. By moving mt_existing into
ResultRelInfo that's not the case anymore.

- Andres



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Jerry Jelinek
On Wed, Mar 6, 2019 at 11:02 AM Alvaro Herrera 
wrote:

> On 2019-Mar-06, Robert Haas wrote:
>
> > On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera 
> wrote:
> > > I want your dictating software.
> >
> > I'm afraid this is just me and a keyboard, but sadly for me you're not
> > the first person to accuse me of producing giant walls of text.
>
> Well, I don't have a problem reading long texts; my problem is that I'm
> unable to argue as quickly.
>
> I do buy your argument, though (if reluctantly); in particular I was
> worried to offer a parameter (to turn off zero-filling of segments) that
> would enable dangerous behavior, but then I realized we also have
> fsync=off of which the same thing can be said.  So I agree we should
> have two GUCs, properly explained, with a warning where appropriate.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

It sounds like everyone is in agreement that I should get rid  of the
single COW GUC tunable and provide two different tunables instead. I will
update the patch to go back to the original name (wal_recycle) for the
original WAL recycling behavior. The default value of that will be true to
provide the existing behavior. This matches my original proposal from last
year. I will add a new tunable (wal_init_zero) which will control the
zero-fill behavior for the WAL file. Again, the default value will be true
and provide the existing behavior. Both of these could (should) be set to
false for a COW filesystem like ZFS.

If anyone objects to this new approach, let me know, otherwise I'll start
preparing an updated patch.

Thanks for all of the feedback,
Jerry


Re: Protect syscache from bloating with negative cache entries

2019-03-06 Thread Tomas Vondra
On 3/6/19 9:17 PM, Tom Lane wrote:
> Robert Haas  writes:
>> OK, so this is getting simpler, but I'm wondering why we need
>> dlist_move_tail() at all.  It is a well-known fact that maintaining
>> LRU ordering is expensive and it seems to be unnecessary for our
>> purposes here.
> 
> Yeah ... LRU maintenance was another thing that used to be in the
> catcache logic and was thrown out as far too expensive.  Your idea
> of just using a clock sweep instead seems plausible.
> 

I agree clock sweep might be sufficient, although the benchmarks done in
this thread so far do not suggest the LRU approach is very expensive.

A simple true/false flag, as proposed by Robert, would mean we can only
do the cleanup once per the catalog_cache_prune_min_age interval, so
with the default value (5 minutes) the entries might be between 5 and 10
minutes old. That's probably acceptable, although for higher values the
range gets wider and wider ...

Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?

BTW wasn't one of the cases this thread aimed to improve a session that
accesses a lot of objects in a short period of time? That balloons the
syscache, and while this patch evicts the entries from memory, we never
actually release the memory back (because AllocSet just moves it into
the freelists) and it's unlikely to get swapped out (because other
chunks on those memory pages are likely to be still used). I've proposed
to address that by recreating the context if it gets too bloated, and I
think Alvaro agreed with that. But I haven't seen any further discussion
about that.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-06 Thread Bruce Momjian
On Wed, Mar  6, 2019 at 10:49:17AM -0800, Jeremy Schneider wrote:
> Might it make sense to generalize a little bit to secret management? It
> would be *great* if PostgreSQL could have a standard "secrets" API which
> could then use plugins or extensions to provide an internal
> implementation (software or hardware based) and/or plug in to an
> external secret management service, whether an OSS package installed on
> the box or some 3rd party service off the box.
> 
> The two obvious use cases are encryption keys (mentioned here) and
> passwords for things like logical replication, FDWs, dblinks, other
> extensions, etc. Aside from adding new encryption key secrets, the way
> PostgreSQL handles the existing secrets it already has today leaves room
> for improvement.

See this email for a possible implementation:


https://www.postgresql.org/message-id/20190222035816.uozqvc4wjyag3...@momjian.us

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 1:37 PM Robert Haas  wrote:
> I know I'm stating the obvious here, but we don't have many weeks left
> at this point.  I have not reviewed any code, but I have been
> following this thread and I'd really like to see this work go into
> PostgreSQL 12, assuming it's in good enough shape.  It sounds like
> really good stuff.

Thanks!

Barring any objections, I plan to commit the first 3 patches (plus the
amcheck "relocate" patch) within 7 - 10 days (that's almost
everything). Heikki hasn't reviewed 'Add high key "continuescan"
optimization' yet, and it seems like he should take a look at that
before I proceed with it. But that seems like the least controversial
enhancement within the entire patch series, so I'm not very worried
about it.

I'm currently working on v15, which has comment-only revisions
requested by Heikki. I expect to continue to work with him to make
sure that he is happy with the presentation. I'll also need to
revalidate the performance of the patch series following recent minor
changes to the logic for choosing a split point. That can take days.
This is why I don't want to commit the first patch without committing
at least the first three all at once -- it increases the amount of
performance validation work I'll have to do considerably. (I have to
consider both v4 and v3 indexes already, which seems like enough
work.)

Two of the later patches (one of which I plan to push as part of the
first batch of commits) use heuristics to decide where to split the
page. As a Postgres contributor, I have learned to avoid inventing
heuristics, so this automatically makes me a bit uneasy. However, I
don't feel so bad about it here, on reflection. The on-disk size of
the TPC-C indexes are reduced by 35% with the 'Add "split after new
tuple" optimization' patch (I think that the entire database is
usually about 12% smaller). There simply isn't a fundamentally better
way to get the same benefit, and I'm sure that nobody will argue that
we should just accept the fact that the most influential database
benchmark of all time has a big index bloat problem with Postgres.
That would be crazy.

That said, it's not impossible that somebody will shout at me because
my heuristics made their index bloated. I can't see how that could
happen, but I am prepared. I can always adjust the heuristics when new
information comes to light. I have fairly thorough test cases that
should allow me to do this without regressing anything else. This is a
risk that can be managed sensibly.

There is no gnawing ambiguity about the on-disk changes laid down in
the second patch (nor the first patch), though. Making on-disk changes
is always a bit scary, but making the keys unique is clearly a big
improvement architecturally, as it brings nbtree closer to the Lehman
& Yao design without breaking anything for v3 indexes (v3 indexes
simply aren't allowed to use a heap TID in their scankey). Unique keys
also allow amcheck to relocate every tuple in the index from the root
page, using the same code path as regular index scans. We'll be
relying on the uniqueness of keys within amcheck from the beginning,
before anybody teaches nbtree to perform retail index tuple deletion.

-- 
Peter Geoghegan



Re: pgsql: Removed unused variable, openLogOff.

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 02:47:16PM +, Robert Haas wrote:
> Removed unused variable, openLogOff.

Is that right for the report if data is written in chunks?  The same
patch has been proposed a couple of weeks ago, and I commented about
it as follows:
https://www.postgresql.org/message-id/20190129043439.gb3...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


RE: speeding up planning with partitions

2019-03-06 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote:
> On 2019/03/06 11:09, Imai, Yoshikazu wrote:
> > [0004 or 0005]
> > There are redundant process in add_appendrel_other_rels (or
> expand_xxx_rtentry()?).
> > I modified add_appendrel_other_rels like below and it actually worked.
> >
> >
> > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index
> > rti) {
> > ListCell   *l;
> > RelOptInfo *rel;
> >
> > /*
> >  * Add inheritance children to the query if not already done. For
> child
> >  * tables that are themselves partitioned, their children will be
> added
> >  * recursively.
> >  */
> > if (rte->rtekind == RTE_RELATION
> && !root->contains_inherit_children)
> > {
> > expand_inherited_rtentry(root, rte, rti);
> > return;
> > }
> >
> > rel = find_base_rel(root, rti);
> >
> > foreach(l, root->append_rel_list)
> > {
> > AppendRelInfo  *appinfo = lfirst(l);
> > Index   childRTindex = appinfo->child_relid;
> > RangeTblEntry  *childrte;
> > RelOptInfo *childrel;
> >
> > if (appinfo->parent_relid != rti)
> > continue;
> >
> > Assert(childRTindex < root->simple_rel_array_size);
> > childrte = root->simple_rte_array[childRTindex];
> > Assert(childrte != NULL);
> > build_simple_rel(root, childRTindex, rel);
> >
> > /* Child may itself be an inherited relation. */
> > if (childrte->inh)
> > add_appendrel_other_rels(root, childrte, childRTindex);
> > }
> > }
> 
> If you don't intialize part_rels here, then it will break any code in
> the planner that expects a partitioned rel with non-zero number of
> partitions to have part_rels set to non-NULL.  At the moment, that
> includes the code that implements partitioning-specific optimizations
> such partitionwise aggregate and join, run-time pruning etc.  No existing
> regression tests cover the cases where source inherited roles
> participates in those optimizations, so you didn't find anything that
> broke.  For an example, consider the following update query:
> 
> update p set a = p1.a + 1 from p p1 where p1.a = (select 1);
> 
> Planner will create a run-time pruning aware Append node for p (aliased
> p1), for which it will need to use the part_rels array.  Note that p in
> this case is a source relation which the above code initializes.
> 
> Maybe we should add such regression tests.

Ah, now I understand that the codes below of expand_inherited_rtentry() 
initializes part_rels of child queries after first child target and part_rels 
of those are used in partitioning-specific optimizations. Thanks for the 
explanation.

--
Yoshikazu Imai


Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 02:54:52PM +, Georgios Kokolatos wrote:
> Overall the patch looks good and according to the previous
> discussion fulfils its purpose. 
> 
> It might be worthwhile to also check for errors on close in
> SaveSlotToPath().

Thanks for the feedback, added.  I have spent some time
double-checking this stuff, and noticed that the new errors in
StartupReplicationOrigin() and CheckPointReplicationOrigin() should be
switched from ERROR to PANIC to be consistent.  One message in
dsm_impl_mmap() was not consistent either.

Are there any objections if I commit this patch?
--
Michael
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9905593661..7b39283c89 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1991,7 +1991,10 @@ qtext_load_file(Size *buffer_size)
 		return NULL;
 	}
 
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+		ereport(LOG,
+(errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", PGSS_TEXT_FILE)));
 
 	*buffer_size = stat.st_size;
 	return buf;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..bce4274362 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1202,7 +1202,10 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
  errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
 
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", path)));
 }
 
 /* ---
@@ -1300,7 +1303,11 @@ CheckPointLogicalRewriteHeap(void)
 		(errcode_for_file_access(),
 		 errmsg("could not fsync file \"%s\": %m", path)));
 			pgstat_report_wait_end();
-			CloseTransientFile(fd);
+
+			if (CloseTransientFile(fd))
+ereport(ERROR,
+		(errcode_for_file_access(),
+		 errmsg("could not close file \"%s\": %m", path)));
 		}
 	}
 	FreeDir(mappings_dir);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..974d42fc86 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -599,7 +599,7 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno)
 
 	SlruFileName(ctl, path, segno);
 
-	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 	{
 		/* expected: file doesn't exist */
@@ -621,7 +621,13 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno)
 
 	result = endpos >= (off_t) (offset + BLCKSZ);
 
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+	{
+		slru_errcause = SLRU_CLOSE_FAILED;
+		slru_errno = errno;
+		return false;
+	}
+
 	return result;
 }
 
@@ -654,7 +660,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
 	 * where the file doesn't exist, and return zeroes instead.
 	 */
-	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 	{
 		if (errno != ENOENT || !InRecovery)
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..cbd9b2cee1 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -370,7 +370,11 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 			}
 			pgstat_report_wait_end();
 		}
-		CloseTransientFile(srcfd);
+
+		if (CloseTransientFile(srcfd))
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not close file \"%s\": %m", path)));
 	}
 
 	/*
@@ -416,7 +420,6 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 (errcode_for_file_access(),
  errmsg("could not close file \"%s\": %m", tmppath)));
 
-
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
@@ -495,7 +498,6 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 (errcode_for_file_access(),
  errmsg("could not close file \"%s\": %m", tmppath)));
 
-
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 64679dd2de..21986e48fe 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1297,7 +1297,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	}
 
 	pgstat_report_wait_end();
-	CloseTransientFile(fd);
+
+	if (CloseTransientFile(fd))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", path)));
 
 	hdr = (TwoPhaseFileHeader *) buf;
 	if (hdr->magic != TWOPHASE_MAGIC)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0fdd82a287..c7047738b6 100644
--- 

Re: [bug fix] Produce a crash dump before main() on Windows

2019-03-06 Thread Kyotaro HORIGUCHI
Hello.

At Tue, 5 Mar 2019 16:45:53 +1100, Haribabu Kommi  
wrote in 
> > I don't have an idea about which is better behavior, but does
> > this work for you?
> >
> >
> > https://docs.microsoft.com/en-us/windows/desktop/wer/collecting-user-mode-dumps
> >
> 
> No, this option is not generating local dumps for postgresql, but this
> option is useful to
> generate dumps for the applications that don't have a specific WER
> reporting.

Mmm. Right. It just turn on WER for specific progarms. It
donesn't override SetErrorMode().. Sorry for the noise.


> Adding some doc changes for the users to refer what can be the issue
> windows if the
> PostgreSQL server doesn't start and there is no core file available.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Online verification of checksums

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 08:53:57PM +0100, Tomas Vondra wrote:
> Not sure. AFAICS that would to require a single transaction, and if we
> happen to add some sort of throttling (which is a feature request I'd
> expect pretty soon to make it usable on live clusters) that might be
> quite long-running. So, not great.
> 
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

It does not prevent having a SQL function which acts as a wrapper on
top of the whole routine logic, does it?  I think that it would be
nice to have the possibility to target a specific relation and a
specific page, as well as being able to check fully a relation at
once.  It gets easier to check for page ranges this way, and the
throttling can be part of the function doing a full-relation check.
--
Michael


signature.asc
Description: PGP signature


Re: few more wait events to add to docs

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 11:08:12AM -0800, Jeremy Schneider wrote:
> LWLock order in documentation:
> 1) CamelCase LWLocks: individually named - see lwlocknames.txt
> 2) lowercase LWLocks: tranches
> 2a) SLRUs - see SimpleLruInit() callers on doxygen
> 2b) Shared Buffer (buffer_content, buffer_io)
> 2c) Individually Named - see RegisterLWLockTranches() in lwlock.c

Hm, OK.  Perhaps I lack some user insight on the matter.  Thanks for
the feedback!  Still there are some areas where we could make the
micro-ordering better.  For example replication_slot_io and buffer_io
are I/O specific still they get in the middle of the page, still we
want buffer_content close by.

One thing that I think we could do is reorganize at least
alphabetically the section for "Lock".  Each item does not really rely
on others.  What do you think?

> If anything, I think we might just want to add comments to
> RegisterLWLockTranches() and lwlocknames.txt with links to the doc file
> that needs to be updated whenever a new tranche is added.

Yes, that would surely help.

> Not sure the best place for a comment on SLRUs (is SimpleLruInit a good
> place?)... but I'm kindof hopeful that we're not adding many more new
> SLRUs anyway and that people would bias toward leveraging the buffer
> cache when possible.

A reference at the top of SimpleLruInit() sounds good to me.
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 11:37, Andres Freund  wrote:
>
> On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> > Do you think it's fine to reword the docs to make this point more
> > clear, or do you see this as a fundamental problem with the patch?
>
> Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
> problem, but ...

Okay, so if I understand you correctly, you're complaining about the
fact that if the user does:

CREATE TABLE p (a int) PARTITION BY LIST(a) TABLESPACE pg_default;

that the user intended that all future partitions go to pg_default and
not whatever default_tablespace is set to at the time?

If so, that seems like a genuine concern.

I see in heap_create() we do;

/*
* Never allow a pg_class entry to explicitly specify the database's
* default tablespace in reltablespace; force it to zero instead. This
* ensures that if the database is cloned with a different default
* tablespace, the pg_class entry will still match where CREATE DATABASE
* will put the physically copied relation.
*
* Yes, this is a bit of a hack.
*/
if (reltablespace == MyDatabaseTableSpace)
reltablespace = InvalidOid;

which will zero pg_class.reltablespace if the specified tablespace
happens to match pg_database.dattablespace. This causes future
partitions to think that no tablespace was specified and therefore
DefineRelation() consults the default_tablespace.

I see that this same problem exists for partitioned indexes too:

create table listp (a int) partition by list(a);
create index on listp (a) tablespace pg_default;
set default_Tablespace = n;
create table listp1 partition of listp for values in(1);
\d listp1
   Table "public.listp1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: listp FOR VALUES IN (1)
Indexes:
"listp1_a_idx" btree (a), tablespace "n"
Tablespace: "n"

If I understand what you're saying correctly, then the listp1_a_idx
should have been created in pg_default since that's what the default
partitioned index tablespace was set to.

> I don't think the argument that the user intended to explicitly set a
> tablespace holds much water if it was just set via default_tablespace,
> rather than an explicit TABLESPACE.  I think iff you really want
> something like this feature, you'd have to mark a partition's
> reltablespace as 0 unless an *explicit* assignment of the tablespace
> happened. In which case you also would need to explicitly emit a
> TABLESPACE for the partitioned table in pg_dump, to restore that.

I think emitting an explicit tablespace in pg_dump for partitioned
tables (when non-zero) might have issues for pg_restore's
--no-tablespaces option.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pgsql: tableam: introduce table AM infrastructure.

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 03:03:44PM -0300, Alvaro Herrera wrote:
> On 2019-Mar-06, Andres Freund wrote:
>> tableam: introduce table AM infrastructure.
> 
> Thanks for doing this!!

+1.
--
Michael


signature.asc
Description: PGP signature


  1   2   >