Re: range_agg

2019-08-20 Thread Jeff Davis
On Sat, 2019-08-17 at 10:47 -0700, Paul A Jungwirth wrote:
> So I'm wondering how seriously I should take this for multiranges? I
> guess if a range type did support typmods, it would just delegate to
> the underlying element type for their meaning, and so a multirange
> should delegate it too? Is there any historical discussion around
> typemods on range types?

I did find a few references:


https://www.postgresql.org/message-id/1288029716.8645.4.camel%40jdavis-ux.asterdata.local

https://www.postgresql.org/message-id/2011091334.GB11603%40fetter.org
https://www.postgresql.org/message-id/1296974485.27157.136.camel@jdavis

I'd be interested in ways that we can use a typmod-like concept to
improve the type system. Unfortunately, typmod is just not
sophisticated enough to do very much because it's lost through function
calls. Improving that would be a separate and challenging project.

So, I wouldn't spend a lot of time on typmod for multiranges.

Regards,
Jeff Davis






Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Tue, Aug 13, 2019 at 8:11 AM Robert Haas  wrote:
> > We can probably check the fxid queue and error queue to get that
> > value.  However, I am not sure if that is sufficient because incase we
> > perform the request in the foreground, it won't be present in queues.
>
> Oh, I forgot about that requirement.  I think I can fix it so it does
> that fairly easily, but it will require a little bit of redesign which
> I won't have time to do this week.

Here's a version with a quick (possibly buggy) prototype of the
oldest-FXID support.  It also includes a bunch of comment changes,
pgindent, and a few other tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-New-undo-request-manager-now-with-UndoRequestMana.patch
Description: Binary data


Re: Serialization questions

2019-08-20 Thread Richard Guo
On Wed, Aug 21, 2019 at 9:30 AM Alex  wrote:

>
> first issue "set default_transaction_isolation to 'serializable';" on the
> both sessions,  then run:
>
> Session 1:   begin;  select * from t;  (2 rows selected);
> Session 2:   delete from t;   (committed automatically)
> Session 1:  commit;  (commit successfully).
>
> looks the reads in session 1 has no impact on the session 2 at all which
> is conflicted with the document
>


This behavior makes sense to me. The effect can be considered as we
 execute the two sessions in a serial order of first session 1 and then
 session 2.

Thanks
Richard


Re: Add "password_protocol" connection parameter to libpq

2019-08-20 Thread Jeff Davis
On Mon, 2019-08-19 at 14:51 +0900, Michael Paquier wrote:
> On Fri, Aug 16, 2019 at 02:11:57PM -0400, Jonathan S. Katz wrote:
> > To be pedantic, +1 on the channel_binding param.
> 
> Seems like we are moving in this direction then.  I don't object to
> the introduction of this parameter.

OK, new patch attached. Seems like everyone is in agreement that we
need a channel_binding param.

Regards,
Jeff Davis

From 0eb8e76d08a64979c3070761efab95d61c4ff887 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 20 Aug 2019 18:59:23 -0700
Subject: [PATCH] Add libpq parameter 'channel_binding'.

Allow clients to require channel binding to enhance security against
untrusted servers.

Author: Jeff Davis
Discussion: https://postgr.es/m/227015d8417f2b4fef03f8966dbfa5cbcc4f44da.camel%40j-davis.com
---
 doc/src/sgml/libpq.sgml   | 19 +++
 src/interfaces/libpq/fe-auth.c| 30 +-
 src/interfaces/libpq/fe-connect.c | 28 
 src/interfaces/libpq/libpq-int.h  |  2 ++
 4 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index e7295abda28..1785cb1bcc5 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1118,6 +1118,25 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  channel_binding
+  
+  
+A setting of require means that the connection must
+employ channel binding; and that the client will not respond to
+requests from the server for cleartext password or MD5
+authentication. The default setting is prefer,
+which employs channel binding if available.
+  
+  
+Channel binding is a method for the server to authenticate itself to
+the client. It is only supported over SSL connections
+with PostgreSQL 11.0 or later servers using
+the scram-sha-256 authentication method.
+  
+  
+ 
+
  
   connect_timeout
   
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index ab227421b3b..5df30913337 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -423,6 +423,13 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 
 	initPQExpBuffer(_buf);
 
+	if (conn->channel_binding[0] == 'r' && !conn->ssl_in_use)
+	{
+		printfPQExpBuffer(>errorMessage,
+		  libpq_gettext("Channel binding required, but SSL not in use\n"));
+		goto error;
+	}
+
 	if (conn->sasl_state)
 	{
 		printfPQExpBuffer(>errorMessage,
@@ -488,7 +495,8 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 goto error;
 			}
 		}
-		else if (strcmp(mechanism_buf.data, SCRAM_SHA_256_NAME) == 0 &&
+		else if (conn->channel_binding[0] != 'r' &&
+ strcmp(mechanism_buf.data, SCRAM_SHA_256_NAME) == 0 &&
  !selected_mechanism)
 			selected_mechanism = SCRAM_SHA_256_NAME;
 	}
@@ -565,6 +573,9 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 	if (initialresponse)
 		free(initialresponse);
 
+	if (strcmp(selected_mechanism, SCRAM_SHA_256_PLUS_NAME) == 0)
+		conn->channel_bound = true;
+
 	return STATUS_OK;
 
 error:
@@ -791,6 +802,12 @@ pg_fe_sendauth(AuthRequest areq, int payloadlen, PGconn *conn)
 	switch (areq)
 	{
 		case AUTH_REQ_OK:
+			if (conn->channel_binding[0] == 'r' && !conn->channel_bound)
+			{
+printfPQExpBuffer(>errorMessage,
+  libpq_gettext("Channel binding required but not offered by server\n"));
+return STATUS_ERROR;
+			}
 			break;
 
 		case AUTH_REQ_KRB4:
@@ -919,6 +936,17 @@ pg_fe_sendauth(AuthRequest areq, int payloadlen, PGconn *conn)
 			{
 char	   *password;
 
+if (conn->channel_binding[0] == 'r')
+{
+	if (areq == AUTH_REQ_PASSWORD)
+		printfPQExpBuffer(>errorMessage,
+		  libpq_gettext("Channel binding required but server requested cleartext password authentication\n"));
+	else
+		printfPQExpBuffer(>errorMessage,
+		  libpq_gettext("Channel binding required but server requested MD5 authentication\n"));
+	return STATUS_ERROR;
+}
+
 conn->password_needed = true;
 password = conn->connhost[conn->whichhost].password;
 if (password == NULL)
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index fa5af18ffac..e9627c06ec1 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -124,6 +124,7 @@ static int	ldapServiceLookup(const char *purl, PQconninfoOption *options,
 #define DefaultTty		""
 #define DefaultOption	""
 #define DefaultAuthtype		  ""
+#define DefaultChannelBinding	"prefer"
 #define DefaultTargetSessionAttrs	"any"
 #ifdef USE_SSL
 #define DefaultSSLMode "prefer"
@@ -211,6 +212,10 @@ static const internalPQconninfoOption PQconninfoOptions[] = {
 		"Database-Password-File", "", 64,
 	offsetof(struct pg_conn, pgpassfile)},
 
+	{"channel_binding", "PGCHANNELBINDING", NULL, NULL,
+	 "Channel-Binding", "", 7, /* sizeof("require") */
+	offsetof(struct 

Re: Serialization questions

2019-08-20 Thread Tatsuo Ishii
> Before understanding how postgres implements the serializable isolation
> level (I have see many paper related to it), I have question about how it
> should be.
> 
> 
> I mainly read the ideas from
> https://www.postgresql.org/docs/11/transaction-iso.html.
> 
> 
> In fact, this isolation level works exactly the same as Repeatable Read
> except that it monitors for conditions which could make execution of a
> concurrent set of serializable transactions behave in a manner inconsistent
> with all possible serial (one at a time) executions of those transactions.
> 
> 
> in repeatable read,  every statement will use the transaction start
> timestamp,  so is it in serializable isolation level?
> 
> 
> When relying on Serializable transactions to prevent anomalies, it is
> important that any data read from a permanent user table not be considered
> valid until the transaction which read it has successfully committed. This
> is true even for read-only transactions ...
> 
> 
> What does the "not be considered valid" mean?  and if it is a read-only
> transaction (assume T1),  I think it is ok to let other transaction do
> anything with the read set of T1, since it is invisible to T1(use the
> transaction start time as statement timestamp).

There are some test cases and link to the paper explaining read-only
transaction anomaly in the source tree.

src/test/isolation/specs/read-only-anomaly-2.spec

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar  wrote:
> Currently, In UnpackedUndoRecord we store all members directly which
> are set by the caller.  We store pointers to some header which are
> allocated internally by the undo layer and the caller need not worry
> about setting them.  So now you are suggesting to put other headers
> also as structures in UnpackedUndoRecord.  I as such don't have much
> problem in doing that but I think initially Robert designed
> UnpackedUndoRecord structure this way so it will be good if Robert
> provides his opinion on this.

I don't believe that's what is being suggested.  It seems to me that
the thing Andres is complaining about here has roots in the original
sketch that I did for this code.  The oldest version I can find is
here:

https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c

In this version, and I think still in the current version, there is a
two-stage marshaling strategy.  First, the individual fields from the
UnpackedUndoRecord get copied into global variables (yes, that was my
fault, too, at least in part!) which are structures. Then, the
structures get copied into the target buffer. The idea of that design
was to keep the code simple, but it didn't really work out, because
things got a lot more complicated between the time I wrote those 3244
lines of code and the >3000 lines of code that live in this patch
today. One thing that change was that we moved more and more in the
direction of considering individual fields as separate objects to be
separately included or excluded, whereas when I wrote that code I
thought we were going to have groups of related fields that stood or
fell together. That idea turned out to be wrong. (There is the
even-larger question here of whether we ought to take Heikki's
suggestion and make this whole thing a lot more generic, but let's
start by discussing how the design that we have today could be
better-implemented.)

If I understand Andres correctly, he's arguing that we ought to get
rid of the two-stage marshaling strategy.  During decoding, he wants
data to go directly from the buffer that contains it to the
UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
During insertion, he wants data to go directly from the
UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
there has to be an intermediate format, he wants it to be just a chunk
of raw bytes, rather than a bunch of individual fields like we have in
UndoPackContext currently.  I think that's a reasonable goal.  I'm not
as concerned about it as he is from a performance point of view, but I
think it would make the code look nicer, and that would be good.  If
we save CPU cycles along the way, that is also good.

In broad outline, what this means is:

1. Any field in the UndoPackContext that starts with urec_ goes away.
2. Instead of something like InsertUndoBytes((char *)
&(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
>uur_fxid, ...).
3. Similarly instead of ReadUndoBytes((char *) >urec_fxid,
...) we'd write ReadUndoBytes((char *) >uur_fxid, ...).
4. It seems slightly trickier to handle the cases where we've got a
structure instead of individual fields, like urec_hd.  But those could
be broken down into field-by-field reads and writes, e.g. in this case
one call for urec_type and a second for urec_info.
5. For uur_group and uur_logswitch, the code would need to allocate
those subsidiary structures before copying into them.

To me, that seems like it ought to be a pretty straightforward change
that just makes things simpler.  We'd probably need to pass the
UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
keep a pointer to it in the UnpackUndoContext, but that seems fine.
FinishUnpackUndo would end up just about empty, maybe entirely empty.

Is that a reasonable idea?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Cleanup isolation specs from unused steps

2019-08-20 Thread Michael Paquier
On Tue, Aug 20, 2019 at 09:54:56AM -0400, Alvaro Herrera wrote:
> On 2019-Aug-20, Tom Lane wrote:
>> If you can warn in both cases, that'd be OK perhaps.  But Alvaro's
>> description of the intended use of dry-run makes it sound like
>> it would be expected for there to be unreferenced steps, since there'd
>> be no permutations yet in the input.

v2 of the patch warns of any unused steps in dry-run mode.  If no
permutations are defined in the input spec, all steps get used
(actually that's not a factorial as the per-session ordering is
preserved), so I would expect no warnings to be generated and the
patch does that.  If the test includes some permutations, then I would
expect dry-run to complain about the steps which are defined in the
spec but not used.  The patch also does that.  Do you see a problem
with that?

> Well, Heikki/Kevin's original intention was that if no permutations are
> specified, then all possible permutations are generated internally and
> the spec is run with that.  The intended use of dry-run was to do just
> that (generate all possible permutations) and print that list, so that
> it could be trimmed down by the test author.  In this mode of operation,
> all steps are always used, so there'd be no warning printed.  However,
> when a test file has a largish number of steps then the list is, um, a
> bit long.  Before the deadlock-test hacking, you could run with such a
> list anyway and any permutations that caused a blockage would be
> reported right away as an invalid permutation -- quick enough.
> Currently it sleeps for absurdly long on those cases, so this is no
> longer feasible.
> 
> This is why I say that the current dry-run mode could be removed with no
> loss of useful functionality.

Hmm.  Even if one does not do something deadlock-specific, the list
printed could still be useful, no?  This for example works now that I
look at it:
./isolationtester -n < specs/my_spec.spec

I am wondering if we should not actually keep dry_run, but rename it
to something like --print-permutations to print the set of
permutations which would be run as part of the spec, and also have an
option which is able to print out all permutations possible, like
--print-all-permutations.  Simply ripping out the mode would be fine
by me as it does not seem to be used, but keeping it around does not
induce really much extra maintenance cost.
--
Michael


signature.asc
Description: PGP signature


Re: Serialization questions

2019-08-20 Thread Alex
On Tue, Aug 20, 2019 at 4:47 PM Alex  wrote:

> Before understanding how postgres implements the serializable isolation
> level (I have see many paper related to it), I have question about how it
> should be.
>
>
> I mainly read the ideas from
> https://www.postgresql.org/docs/11/transaction-iso.html.
>
>
> In fact, this isolation level works exactly the same as Repeatable Read
> except that it monitors for conditions which could make execution of a
> concurrent set of serializable transactions behave in a manner inconsistent
> with all possible serial (one at a time) executions of those transactions.
>
>
>
> in repeatable read,  every statement will use the transaction start
> timestamp,  so is it in serializable isolation level?
>
>
> When relying on Serializable transactions to prevent anomalies, it is
> important that any data read from a permanent user table not be considered
> valid until the transaction which read it has successfully committed. This
> is true even for read-only transactions ...
>
>
> What does the "not be considered valid" mean?  and if it is a read-only
> transaction (assume T1),  I think it is ok to let other transaction do
> anything with the read set of T1, since it is invisible to T1(use the
> transaction start time as statement timestamp).
>

first issue "set default_transaction_isolation to 'serializable';" on the
both sessions,  then run:

Session 1:   begin;  select * from t;  (2 rows selected);
Session 2:   delete from t;   (committed automatically)
Session 1:  commit;  (commit successfully).

looks the reads in session 1 has no impact on the session 2 at all which is
conflicted with the document


> Thanks
>


Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

2019-08-20 Thread Ian Barwick

On 8/16/19 12:22 AM, Tom Lane wrote:

Stephen Frost  writes:

* Tom Lane (t...@sss.pgh.pa.us) wrote:

In hopes of moving this along, I've pushed Ian's last code change,
as there seems to be no real argument about that anymore.

As for the doc changes, how about the attached revision of what
I wrote previously?  It gives some passing mention to what ALTER
SYSTEM will do, without belaboring it or going into things that
are really implementation details.



It's certainly better than what we have now.


Hearing no other comments, I've pushed that and marked this issue closed.


Thanks!


Regards

Ian Barwick


--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-17 01:43:45 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-08-16 09:44:15 -0700, Hadi Moshayedi wrote:
> >> It seems that sometimes when DELETE cascades to referencing tables we fail
> >> to acquire locks on replica identity index.
>
> > I suspect this "always" has been broken, it's just that we previously
> > didn't have checks in place that detect these cases. I don't think it's
> > likely to cause actual harm, due to the locking on the table itself when
> > dropping indexes etc.  But we still should fix it.
>
> Yeah ... see the discussion leading up to 9c703c169,
>
> https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us
>
> We didn't pull the trigger on removing the CMD_DELETE exception here,
> but I think the handwriting has been on the wall for some time.

ISTM there's a few different options here:

1a) We build all index infos, unconditionally. As argued in the thread
you reference, future tableams may eventually require that anyway,
by doing more proactive index maintenance somehow. Currently there's
however no support for such AMs via tableam (mostly because I wasn't
sure how exactly that'd look, and none of the already in-development
AMs needed it).

2a) We separate acquisition of index locks from ExecOpenIndices(), and
acquire index locks even for CMD_DELETE. Do so either during
executor startup, or as part of AcquireExecutorLocks() (the latter
on the basis that parsing/planning would have required the locks
already).

There's also corresponding *b) options, where we only do additional work
for CMD_DELETE if wal_level = logical, and the table has a replica
identity requiring use of the index during deleteions. But I think
that's clearly enough a bad idea that we can just dismiss it out of
hand.


3)  Remove the CheckRelationLockedByMe() assert from
ExtractReplicaIdentity(), at least for 12. I don't think this is an
all that convicing option, but it'd reduce churn relatively late in
beta.

4)  Add a index_open(RowExclusiveLock) to ExtractReplicaIdentity(). That
seems very unconvincing to me, because we'd do so for every row.


I think there's some appeal in going towards 2), because batching lock
acquisition into a more central place has the chance to yield some
speedups on its own, but more importantly would allow for batched
operations one day. Centralizing lock acquisition also seems like it
might make things easier to understand than today, where a lot of
different parts of the system acquire the locks, even just for
execution.  But it also seems to be likely too invasive for 12 - making
me think that 1a) is the way to go for now.

Comments?

Greetings,

Andres Freund




Re: Fix typos and inconsistencies for HEAD (take 11)

2019-08-20 Thread Alvaro Herrera
On 2019-Aug-19, Michael Paquier wrote:

> On Mon, Aug 19, 2019 at 07:04:04AM +0300, Alexander Lakhin wrote:
> > 11.23 TupleLockUpdate -> LockTupleNoKeyExclusive
> 
> Not sure about this one, so discarded for now.  Alvaro?

Yeah, Alexander proposed change is correct.  I just pushed it.

> > 11.33 visca -> vice
> 
> This one is interesting latin.

Well, it's a regular Haasism.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Inadequate executor locking of indexes

2019-08-20 Thread Andres Freund
Hi,

On 2018-11-23 17:41:26 +1300, David Rowley wrote:
> Ideally, the locking code would realise we already hold a stronger
> lock and skip the lock, but I don't see how that's realistically
> possible without probing the hash table for all stronger lock types
> first, which would likely damage the performance of locking.

That seems kind of a self-made problem to me. I don't think there's
anything really forcing us to have the locallock hashtable contain the
lockmode.  It'd not be a trivial change, but we could have the locallock
entry have enough information to allow us to avoid hitting the shared
table if we hold a stronger lock already.  The biggest complexity
probably would be that we'd need code to downgrade the shared lock we
currently hold, when the more heavyweight lock is released.


This made me look at LockAcquireExtended() just now. When we acquire a
lock that's weaker than one already held, and there's another backend
waiting for a conflicting lock, that only works if NOWAIT isn't
used. That's because only ProcSleep() gets around to checking whether we
already hold a stronger lock, but LockAcquireExtended() bails out for
NOWAIT long before that.  None of that is documented in
LockAcquireExtended(). Isn't that a bit weird?

Greetings,

Andres Freund




Re: Improve default partition

2019-08-20 Thread Alvaro Herrera
On 2019-Aug-20, Rafia Sabih wrote:

> This does not sound very convenient. I was thinking of having some
> mechanism for such insertions which automatically creates a default
> partition and gives a notice for the user to know that it is going to
> the default partition. Basically, always having a default partition.
> After all default is something that remains there by default, isn't
> it?

There are shortcomings to having a default partition, so I don't think
this would be an universally welcome feature.

I think auto-creation of partitions might be useful in some cases, but
not by default since there are caveats (possibility of deadlocks) and
not of the default partition.

The major problem I see with the default partition is that there's no
way to create a partition that overlaps existing contents of the default
partition, short of blocking everyone.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: errbacktrace

2019-08-20 Thread Peter Eisentraut
On 2019-08-13 15:24, Alvaro Herrera wrote:
> On 2019-Aug-13, Peter Eisentraut wrote:
> 
>> I have changed the configuration setting to backtrace_functions plural,
>> so that you can debug more than one location at once.  I had originally
>> wanted to do that but using existing functions like
>> SplitIdentifierString() resulted in lots of complications with error
>> handling (inside error handling!).  So here I just hand-coded the list
>> splitting.  Seems simple enough.
> 
> Hmm ... but is that the natural way to write this?  I would have thought
> you'd split the list at config-read time (the assign hook for the GUC)
> and turn it into a List of simple strings.  Then you don't have to
> loop strtok() on each errfinish().

The memory management of that seems too complicated.  The "extra"
mechanism of the check/assign hooks only supports one level of malloc.
Using a List seems impossible.  I don't know if you can safely do a
malloc-ed array of malloc-ed strings either.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Tue, Aug 20, 2019 at 5:02 AM Thomas Munro  wrote:
> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

This is definitely more concurrent, but it might be *too much*
concurrency. Suppose that backend #1 is inserting a row and updating
the transaction header for the previous transaction; meanwhile,
backend #2 is discarding the previous transaction. It could happen
that backend #1 locks the transaction header for the previous
transaction and is all set to log the insertion ... but then gets
context-switched out.  Now backend #2 swoops in and logs the discard.
Backend #1 now wakes up and finishes logging a change to a page that,
according to the logic of the WAL stream, no longer exists.

It's probably possible to make this work by ignoring WAL references to
discarded pages during replay, but that seems a bit dangerous. At
least, it loses some sanity checking that you might like to have.

It seems to me that you can avoid this if you require that a backend
that wants to set BM_DISCARDED to acquire at least a shared content
lock before doing so.  If you do that, then once a backend acquires
content lock(s) on the page(s) containing the transaction header for
the purposes of updating it, it can notice that the BM_DISCARDED flag
is set and choose not to update those pages after all.  I think that
would be a smart design choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
On 2019-08-20 09:44:23 -0700, Andres Freund wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > Aside from code changes based on review (and I have more to come of
> > those), the attached experimental patchset (also at
> > https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> > that, I hope, allows for better concurrency, reliability and
> > readability, and removes a bunch of TODO notes about questionable
> > interlocking.  However, I'm not quite done figuring out if the bufmgr
> > interaction is right and will be manageable on the undoaccess side, so
> > I'm hoping to get some feedback, not asking for anyone to rebase on
> > top of it yet.
> > 
> > Previously, there were two LWLocks used to make sure that all
> > discarding was prevented while anyone was reading or writing data in
> > any part of an undo log, and (probably more problematically) vice
> > versa.  Here's a new approach that removes that blocking:

Oh,  more point I forgot to add: Cool!




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-20 17:11:38 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
> > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:
> 
> > I don't think we can normally pin the undo buffers properly at that
> > stage. Without knowing the correct contents of the table page - which we
> > can't know without holding some form of lock preventing modifications -
> > we can't know how big our undo records are going to be. And we can't
> > just have buffers that don't exist on disk in shared memory, and we
> > don't want to allocate undo that we then don't need. So I think what
> > we'd have to do at that stage, is to "pre-allocate" buffers for the
> > maximum amount of UNDO needed, but mark the associated bufferdesc as not
> > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> > would not be set.
> >
> > So at the start of a function that will need to insert undo we'd need to
> > pre-reserve the maximum number of buffers we could potentially
> > need. That reservation stage would
> >
> > a) pin the page with the current end of the undo
> > b) if needed pin the page of older undo that we need to update (e.g. to
> >update the next pointer)
> > c) perform clock sweep etc to acquire (find or create) enough clean to
> >hold the maximum amount of undo needed. These buffers would be marked
> >as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
> >
> > I assume that we'd make a) cheap by keeping it pinned for undo logs that
> > a backend is actively attached to. b) should only be needed once in a
> > transaction, so it's not too bad. c) we'd probably need to amortize
> > across multiple undo insertions, by keeping the unused buffers pinned
> > until the end of the transaction.
> >
> > I assume that having the infrastructure c) might also make some code
> > for already in postgres easier. There's obviously some issues around
> > guaranteeing that the maximum number of such buffers isn't high.
> 
> 
> I have analyzed this further, I think there is a problem if the
> record/s will not fit into the current undo log and we will have to
> switch the log.  Because before knowing the actual record length we
> are not sure whether the undo log will switch or not and which undo
> log we will get.  And, without knowing the logno (rnode) how we are
> going to pin the buffers?  Am I missing something?

That's precisely why I was suggesting (at the start of the quoted block
above) to not associate the buffers with pages at that point. Instead
just have clean, pinned, *unassociated* buffers. Which can be
re-associated without any IO.


> Apart from this while analyzing the other code I have noticed that in
> the current PG code we have few occurrences where try to read buffer
> under the buffer lock held.

> 1.
> In gistplacetopage
> {
> ...
> for (; ptr; ptr = ptr->next)
> {
> /* Allocate new page */
> ptr->buffer = gistNewBuffer(rel);
> GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
> ptr->page = BufferGetPage(ptr->buffer);
> ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
> }
> 2. During page split we find new buffer while holding the lock on the
> current buffer.
> 
> That doesn't mean that we can't do better but I am just referring to
> the existing code where we already have such issues.

Those are pretty clearly edge-cases, whereas the undo case at hand is a
very common path. Note again that heapam.c goes to considerably trouble
to never do this for common cases.

Greetings,

Andres Freund




Re: Improve default partition

2019-08-20 Thread Dmitry Dolgov
> On Tue, Aug 20, 2019 at 4:45 PM Rafia Sabih  wrote:
>
> This does not sound very convenient. I was thinking of having some
> mechanism for such insertions which automatically creates a default
> partition and gives a notice for the user to know that it is going to
> the default partition.

If I remember correctly, there is a side effect when it's impossible to create
any new partitions, that intersects with values from default partition (one has
to detach a default partition, remove rows and attach it back). That could
probably be not always a desired outcome.

Btw, there is a somewhat similar discussion in the ongoing thread [1].

[1]: 
https://www.postgresql.org/message-id/flat/alpine.DEB.2.21.1907150711080.22273%40lancre




Re: Global temporary tables

2019-08-20 Thread Pavel Stehule
út 20. 8. 2019 v 18:42 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 20.08.2019 19:06, Pavel Stehule wrote:
>
>
>
> As I wrote at the beginning of this thread, one of the problems with
>> temporary table sis that it is not possible to use them at replica.
>> Global temp tables allows to share metadata between master and replica.
>>
>
> I am not sure if I understand to last sentence. Global temp tables should
> be replicated on replica servers. But the content should not be replicated.
> This should be session specific.
>
>
> Obviously.
> When we run OLAP queries at replica, it will be great if we can do
>
> insert into temp_table (select ...);
>
> With local temp tables it is not possible just because you can not create
> temp table at replica.
> But global temp table can be created at master and populated with data at
> replica.
>

yes


>
>
>> I perform small investigation: how difficult it will be to support
>> inserts in temp tables at replica.
>> First my impression was that it can be done in tricky but simple way.
>>
>> By making small changes changing just three places:
>> 1.  Prohibit non-select statements in read-only transactions
>> 2. Xid assignment (return FrozenTransactionId)
>> 3. Transaction commit/abort
>>
>> I managed to provide normal work with global temp tables at replica.
>> But there is one problem with this approach: it is not possible to undo
>> changes in temp tables so rollback doesn't work.
>>
>> I tried another solution, but assigning some dummy Xids to standby
>> transactions.
>> But this approach require much more changes:
>> - Initialize page for such transaction in CLOG
>> - Mark transaction as committed/aborted in XCLOG
>> - Change snapshot check in visibility function
>>
>> And still I didn't find safe way to cleanup CLOG space.
>> Alternative solution is to implement "local CLOG" for such transactions.
>> The straightforward solution is to use hashtable. But it may cause memory
>> overflow if we have long living backend which performs huge number of
>> transactions.
>> Also in this case we need to change visibility check functions.
>>
>> So I have implemented simplest solution with frozen xid and force backend
>> termination in case of transaction rollback (so user will no see
>> inconsistent behavior).
>> Attached please find global_private_temp_replica.patch which implements
>> this approach.
>> It will be nice if somebody can suggest better solution for temporary
>> tables at replica.
>>
>
> This is another hard issue. Probably backend temination should be
> acceptable solution. I don't understand well to this area, but if replica
> allows writing (to global temp tables), then replica have to have local
> CLOG.
>
>
> There are several problems:
>
> 1. How to choose XID for writing transaction at standby.  The simplest
> solution is to just add 0x7fff to the current XID.
> It eliminates possibility of conflict with normal XIDs (received from
> master).
> But requires changes in visibility functions. Visibility check function do
> not know OID of tuple owner, just XID stored in the tuple header. It should
> make a decision just based on this XID.
>
> 2. How to perform cleanup of not needed XIDs. Right now there is quite
> complex logic of how to free CLOG pages.
>

> 3. How to implement visibility rules to such XIDs.
>

in theory every session can have own CLOG. When you finish session, you can
truncate this file.

>
>
> CLOG for global temp tables can be more simple then standard CLOG. Data
> are not shared, and life of data (and number of transactions) can be low.
>
> Another solution is wait on ZHeap storage and replica can to have own UNDO
> log.
>
> I thought about implementation of special table access method for
> temporary tables.
>

+1


> I am trying to understand now if  it is the only possible approach or
> there are simpler solutions.
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> Aside from code changes based on review (and I have more to come of
> those), the attached experimental patchset (also at
> https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> that, I hope, allows for better concurrency, reliability and
> readability, and removes a bunch of TODO notes about questionable
> interlocking.  However, I'm not quite done figuring out if the bufmgr
> interaction is right and will be manageable on the undoaccess side, so
> I'm hoping to get some feedback, not asking for anyone to rebase on
> top of it yet.
> 
> Previously, there were two LWLocks used to make sure that all
> discarding was prevented while anyone was reading or writing data in
> any part of an undo log, and (probably more problematically) vice
> versa.  Here's a new approach that removes that blocking:
> 
> 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> that has been allocated, through the buffer pool (though you'd usually
> want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> the system I'm describing to deal with races).
> 
> 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> cache miss, if the smgrread implementation wants to indicate that the
> buffer has been discarded/truncated and that is expected (md.c won't
> ever do that, but undofile.c can).

Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
form of signalling.  Can't quite put my finger on why it makes me feel
queasy.


> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

Hm. When is it legitimate for a backend to write into such a buffer? I
guess that's about updating the previous transaction's next pointer? Or
progress info?


> 5.  Separating begin from discard allows the WAL logging for
> UndoLogDiscard() to do filesystem actions before logging, and other
> effects after logging, which have several nice properties if you work
> through the various crash scenarios.

Hm. ISTM we always need to log before doing some filesystem operation
(see also my recent complaint Robert and I are discussing at the bottom
of [1]). It's just that we can have a separate stage afterwards?

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com



> So now I'd like to get feedback on the sanity of this scheme.  I'm not
> saying it doesn't have bugs right now -- I've been trying to figure
> out good ways to test it and I'm not quite there yet -- but the
> concept.  One observation I have is that there were already code paths
> in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
> the potentially different discard timing for DO vs REDO.  I think
> that's a point in favour of this scheme, but I can see that it's
> inconvenient to have to deal with InvalidBuffer whenever you read.

FWIW, I'm far from convinced that those are currently quite right. See
discussion pointed to above.



> I pulled in the latest code from undoprocessing as of today, and I
> might be a bit confused about "Defect and enhancement in multi-log
> support" some of which I have squashed into the  make undolog patch.
> BTW undoprocessing builds with initialized variable warnings in xact.c
> on clang today.

I've complained about that existance of that commit multiple times
now. So far without any comments.

Greetings,

Andres Freund




Re: Global temporary tables

2019-08-20 Thread Konstantin Knizhnik



On 20.08.2019 19:06, Pavel Stehule wrote:



As I wrote at the beginning of this thread, one of the problems
with temporary table sis that it is not possible to use them at
replica.
Global temp tables allows to share metadata between master and
replica.


I am not sure if I understand to last sentence. Global temp tables 
should be replicated on replica servers. But the content should not be 
replicated. This should be session specific.


Obviously.
When we run OLAP queries at replica, it will be great if we can do

insert into temp_table (select ...);

With local temp tables it is not possible just because you can not 
create temp table at replica.
But global temp table can be created at master and populated with data 
at replica.



I perform small investigation: how difficult it will be to support
inserts in temp tables at replica.
First my impression was that it can be done in tricky but simple way.

By making small changes changing just three places:
1.  Prohibit non-select statements in read-only transactions
2. Xid assignment (return FrozenTransactionId)
3. Transaction commit/abort

I managed to provide normal work with global temp tables at replica.
But there is one problem with this approach: it is not possible to
undo changes in temp tables so rollback doesn't work.

I tried another solution, but assigning some dummy Xids to standby
transactions.
But this approach require much more changes:
- Initialize page for such transaction in CLOG
- Mark transaction as committed/aborted in XCLOG
- Change snapshot check in visibility function

And still I didn't find safe way to cleanup CLOG space.
Alternative solution is to implement "local CLOG" for such
transactions.
The straightforward solution is to use hashtable. But it may cause
memory overflow if we have long living backend which performs huge
number of transactions.
Also in this case we need to change visibility check functions.

So I have implemented simplest solution with frozen xid and force
backend termination in case of transaction rollback (so user will
no see inconsistent behavior).
Attached please find global_private_temp_replica.patch which
implements this approach.
It will be nice if somebody can suggest better solution for
temporary tables at replica.


This is another hard issue. Probably backend temination should be 
acceptable solution. I don't understand well to this area, but if 
replica allows writing (to global temp tables), then replica have to 
have local CLOG.


There are several problems:

1. How to choose XID for writing transaction at standby.  The simplest 
solution is to just add 0x7fff to the current XID.
It eliminates possibility of conflict with normal XIDs (received from 
master).
But requires changes in visibility functions. Visibility check function 
do not know OID of tuple owner, just XID stored in the tuple header. It 
should make a decision just based on this XID.


2. How to perform cleanup of not needed XIDs. Right now there is quite 
complex logic of how to free CLOG pages.


3. How to implement visibility rules to such XIDs.



CLOG for global temp tables can be more simple then standard CLOG. 
Data are not shared, and life of data (and number of transactions) can 
be low.


Another solution is wait on ZHeap storage and replica can to have own 
UNDO log.


I thought about implementation of special table access method for 
temporary tables.
I am trying to understand now if  it is the only possible approach or 
there are simpler solutions.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Global temporary tables

2019-08-20 Thread Pavel Stehule
út 20. 8. 2019 v 16:51 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 19.08.2019 18:53, Pavel Stehule wrote:
>
>
>
>
>> Certainly, default (small) temp buffer size plays roles.
>> But it this IPC host this difference is not so important.
>> Result with local temp tables and temp_buffers = 1GB: 859k TPS.
>>
>
> It is little bit unexpected result.I understand so it partially it is
> generic problem access to smaller dedicated caches versus access to bigger
> shared cache.
>
> But it is hard to imagine so access to local cache is 10% slower than
> access to shared cache. Maybe there is some bottle neck - maybe our
> implementation of local buffers are suboptimal.
>
>
> It may be caused by system memory allocator - in case of using shared
> buffers we do not need to ask OS to allocate more memory.
>

maybe, but shared buffers you have a overhead with searching free buffers
and some overhead with synchronization processes.

>
>
> Using local buffers for global temporary tables can be interesting from
> another reason - it uses temporary files, and temporary files can be
> forwarded on ephemeral IO on Amazon cloud (with much better performance
> than persistent IO).
>
>
>
> My assumption is that temporary tables almost always fit in memory. So in
> most cases there is on need to write data to file at all.
>
>
> As I wrote at the beginning of this thread, one of the problems with
> temporary table sis that it is not possible to use them at replica.
> Global temp tables allows to share metadata between master and replica.
>

I am not sure if I understand to last sentence. Global temp tables should
be replicated on replica servers. But the content should not be replicated.
This should be session specific.


> I perform small investigation: how difficult it will be to support inserts
> in temp tables at replica.
> First my impression was that it can be done in tricky but simple way.
>
> By making small changes changing just three places:
> 1.  Prohibit non-select statements in read-only transactions
> 2. Xid assignment (return FrozenTransactionId)
> 3. Transaction commit/abort
>
> I managed to provide normal work with global temp tables at replica.
> But there is one problem with this approach: it is not possible to undo
> changes in temp tables so rollback doesn't work.
>
> I tried another solution, but assigning some dummy Xids to standby
> transactions.
> But this approach require much more changes:
> - Initialize page for such transaction in CLOG
> - Mark transaction as committed/aborted in XCLOG
> - Change snapshot check in visibility function
>
> And still I didn't find safe way to cleanup CLOG space.
> Alternative solution is to implement "local CLOG" for such transactions.
> The straightforward solution is to use hashtable. But it may cause memory
> overflow if we have long living backend which performs huge number of
> transactions.
> Also in this case we need to change visibility check functions.
>
> So I have implemented simplest solution with frozen xid and force backend
> termination in case of transaction rollback (so user will no see
> inconsistent behavior).
> Attached please find global_private_temp_replica.patch which implements
> this approach.
> It will be nice if somebody can suggest better solution for temporary
> tables at replica.
>

This is another hard issue. Probably backend temination should be
acceptable solution. I don't understand well to this area, but if replica
allows writing (to global temp tables), then replica have to have local
CLOG.

CLOG for global temp tables can be more simple then standard CLOG. Data are
not shared, and life of data (and number of transactions) can be low.

Another solution is wait on ZHeap storage and replica can to have own UNDO
log.



>
>
>
>
>
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Andres Freund
Hi,

On 2019-08-20 09:08:29 -0400, Robert Haas wrote:
> On Sat, Aug 17, 2019 at 1:28 PM Andres Freund  wrote:
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> 
> So, you seem to be talking about something here which is different
> than what I thought we were talking about.  One question is whether we
> need to lock all of the buffers "ahead of time," and I think the
> answer to that question is probably "no." Since nobody else can be
> writing to those buffers, and probably also nobody can be reading them
> except maybe for some debugging tool, it should be fine if we enter
> the critical section and then lock them at the point when we write the
> bytes. I mean, there shouldn't be any contention, and I don't see any
> other problems.

Right. As long as we are as restrictive about the number of buffers per
undo record, and number of records per WAL insertions, I don't see any
need to go further than that.


> > Well, in the version of code that I was reviewing here, I don't there is
> > such a limit (there is a limit for buffers per undo record, but no limit
> > on the number of records inserted together). I think Dilip added a limit
> > since.  And we have the issue of a lot of IO happening while holding
> > content locks on several pages.  So I don't think it's a straw man at
> > all.
> 
> Hmm, what do you mean by "a lot of IO happening while holding content
> locks on several pages"? We might XLogInsert() but there shouldn't be
> any buffer I/O going on at that point.

That's my primary complain with how the code is structured right
now. Right now we potentially perform IO while holding exclusive content
locks, often multiple ones even. When acquiring target pages for undo,
currently always already hold the table page exclusive locked, and if
there's more than one buffer for undo, we'll also hold the previous
buffers locked. And acquiring a buffer will often have to write out a
dirty buffer to the OS, and a lot of times that will then also require
the kernel to flush data out.  That's imo an absolute no-go for the
general case.


> If there is, I think that should be redesigned.  We should collect
> buffer pins first, without locking.  Then lock.  Then write.  Or maybe
> lock-and-write, but only after everything's pinned.

Right. It's easy enough to do that for the locks on undo pages
themselves. The harder part is the content lock on the "table page" - we
don't accurately know how many undo buffers we will need, without
holding the table lock (or preventing modifications in some other
manner).

I tried to outline the problem and potential solutions in more detail
in:
https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de


> The fact of calling XLogInsert() while holding buffer locks is not
> great, but I don't think it's any worse here than in any other part of
> the system, because the undo buffers aren't going to be suffering
> concurrent access from any other backend, and because there shouldn't
> be more than a few of them.

Yea. That's obviously undesirable, but also fundamentally required at
least in the general case. And it's not at all specific to undo.

> [ WAL logging protocol ]

> I didn't intend to suggest that you don't want this to be documented.
> What I intended to suggest was that you seem to want to deviate from
> the documented rules, and it seems to me that we shouldn't do that
> unless we change the rules first, and I don't know what you think the
> rules should be or why those rules are safe.

IDK. We have at least five different places that at the very least bend
the rules - but with a comment explaining why it's safe in the specific
case. Personally I don't really think the generic guideline needs to
list every potential edge-case.


> > I think what primarily makes me concerned is that it's not clear to me
> > what guarantees that discard is the only reason for the block to
> > potentially be missing. I contrast to most other similar cases where WAL
> > replay simply re-creates the objects when trying to replay an action
> > affecting such an object, here we simply skip over the WAL logged
> > operation. So if e.g. the entire underlying UNDO file got lost, we
> > neither 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Mon, Aug 19, 2019 at 8:22 AM Amit Kapila  wrote:
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Uh, but a big part of the point of setting the LSN on the pages is to
keep them from being written out before the corresponding WAL is
flushed to disk. If you released and reacquired the lock, the page
could be written out during the window in the middle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Change ereport level for QueuePartitionConstraintValidation

2019-08-20 Thread Sergei Kornilov
Hello

> Sergei, can we enlist you to submit a patch for this? Namely reduce the
> log level to DEBUG1 and add a TAP test in src/test/modules/alter_table/
> that verifies that the message is or isn't emitted, as appropriate.

I created this patch.
I test message existence. Also I check message "verifying table" (generated on 
DEBUG1 from ATRewriteTable). So with manually damaged logic in 
NotNullImpliedByRelConstraints or ConstraintImpliedByRelConstraint "make check" 
may works but fails on new test during "make check-world". As we want.

regards, Sergeidiff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 60d6d7be1b..ab3fc91fae 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -20,6 +20,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  unsafe_tests \
-		  worker_spi
+		  worker_spi \
+		  alter_table
 
 $(recurse)
diff --git a/src/test/modules/alter_table/.gitignore b/src/test/modules/alter_table/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/alter_table/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/alter_table/Makefile b/src/test/modules/alter_table/Makefile
new file mode 100644
index 00..17bd882428
--- /dev/null
+++ b/src/test/modules/alter_table/Makefile
@@ -0,0 +1,17 @@
+# src/test/modules/alter_table/Makefile
+
+MODULE = alter_table
+PGFILEDESC = "alter_table - regression testing for some ALTER TABLE commands"
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/alter_table
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/alter_table/t/001_constraint_validation.pl b/src/test/modules/alter_table/t/001_constraint_validation.pl
new file mode 100644
index 00..3c5b65e796
--- /dev/null
+++ b/src/test/modules/alter_table/t/001_constraint_validation.pl
@@ -0,0 +1,213 @@
+# copy of some alter_table test to prove table verification optimisations
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 42;
+use Data::Dumper;
+
+# Initialize master node
+my $node = get_new_node('master');
+$node->init();
+$node->append_conf('postgresql.conf', 'client_min_messages = DEBUG1');
+$node->start;
+
+sub run_sql_command
+{
+	my $stderr;
+	$node->psql(
+		'postgres',
+		shift,
+		stderr=> \$stderr,
+		on_error_die  => 1,
+		on_error_stop => 1
+	);
+	return $stderr
+}
+
+sub is_table_verified
+{
+	return index(shift, 'DEBUG:  verifying table') != -1;
+}
+
+my $output;
+
+note "test alter table set not null";
+
+run_sql_command('create table atacc1 (test_a int, test_b int);');
+run_sql_command('insert into atacc1 values (1, 2);');
+
+$output = run_sql_command('alter table atacc1 alter test_a set not null;');
+ok( is_table_verified($output), 'column test_a without constraint will scan table');
+
+run_sql_command('alter table atacc1 alter test_a drop not null;');
+run_sql_command('alter table atacc1 add constraint atacc1_constr_a_valid check(test_a is not null);');
+
+# normal run will verify table data
+$output = run_sql_command('alter table atacc1 alter test_a set not null;');
+ok(!is_table_verified($output), 'with constraint will not scan table');
+ok( $output =~ m/existing constraints on column "atacc1"."test_a" are sufficient to prove that it does not contain nulls/ , 'test_a proved by constraints');
+
+run_sql_command('alter table atacc1 alter test_a drop not null;');
+
+# we have check only for test_a column, so we need verify table for test_b
+$output = run_sql_command('alter table atacc1 alter test_b set not null, alter test_a set not null;');
+ok( is_table_verified($output), 'table was scanned');
+# we may miss debug message for test_a constraint because we need verify table due test_b
+ok( ! ($output =~ m/existing constraints on column "atacc1"."test_b" are sufficient to prove that it does not contain nulls/ ) , 'test_b not proved by wrong constraints');
+
+run_sql_command('alter table atacc1 alter test_a drop not null, alter test_b drop not null;');
+# both column has check constraints
+run_sql_command('alter table atacc1 add constraint atacc1_constr_b_valid check(test_b is not null);');
+$output = run_sql_command('alter table atacc1 alter test_b set not null, alter test_a set not null;');
+ok(!is_table_verified($output), 'table was not scanned for both columns');
+ok( $output =~ m/existing constraints on column "atacc1"."test_a" are sufficient to prove that it does not contain nulls/ , 'test_a proved by constraints');
+ok( $output =~ m/existing constraints on column "atacc1"."test_b" are sufficient to prove that it does not contain nulls/ , 'test_b proved by constraints');
+run_sql_command('drop table atacc1;');
+
+note "test alter table attach partition";
+
+run_sql_command('CREATE 

Re: pg_upgrade fails with non-standard ACL

2019-08-20 Thread Stephen Frost
Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> 14.08.2019 3:28, Stephen Frost wrote:
> >* Bruce Momjian (br...@momjian.us) wrote:
> >>As much as it would be nice if the release notes covered all that, and
> >>we updated pg_upgrade to somehow handle them, it just isn't realistic.
> >>As we can see here, the problems often take years to show up, and even
> >>then there were probably many other people who had the problem who never
> >>reported it to us.
> >Yeah, the possible changes when you think about column-level privileges
> >as well really gets to be quite large..
> >
> >This is why my thinking is that we should come up with additional
> >default roles, which aren't tied to specific catalog structures but
> >instead are for a more general set of capabilities which we manage and
> >users can either decide to use, or not.  If they decide to work with the
> >individual functions then they have to manage the upgrade process if and
> >when we make changes to those functions.
> 
> Idea of having special roles looks good to me, though, I don't see
> how to define what grants are needed for each role. Let's say, we
> define role "backupuser" that obviously must have grants on
> pg_start_backup()
> and pg_stop_backup(). Should it also have access to pg_is_in_recovery()
> or for example version()?

pg_is_in_recovery() and version() are already allowed to be executed by
public, and I don't see any particular reason to change that, so I don't
believe those would need to be explicitly GRANT'd to this new role.

I would think the specific set would be those listed under:

https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-BACKUP

which currently require superuser access.

This isn't a new idea, btw, there was a great deal of discussion three
years ago around all of this.  Particularly relevant is this:

https://www.postgresql.org/message-id/20160104175516.GC3685%40tamriel.snowman.net

> >It'd be pretty neat if pg_upgrade could connect to the old and new
> >clusters concurrently and then perform a generic catalog comparison
> >between them and identify all objects which have been changed and
> >determine if there's any non-default ACLs or dependencies on the catalog
> >objects which are different between the clusters.  That seems like an
> >awful lot of work though, and I'm not sure there's really any need,
> >given that we don't change the catalog for a given major version- we
> >could just generate the list using some queries whenever we release a
> >new major version and update pg_upgrade with it.
> >
> >>The only positive is that when pg_upgrade does fail, at least we have a
> >>system that clearly points to the cause, but unfortunately usually at
> >>run-time, not at --check time.
> >Getting it to be at check time would certainly be a great improvement.
> >
> >Solving this in pg_upgrade does seem like it's probably the better
> >approach rather than trying to do it in pg_dump.  Unfortunately, that
> >likely means that all we can do is have pg_upgrade point out to the user
> >when something will fail, with recommendations on how to address it, but
> >that's also something users are likely used to and willing to accept,
> >and puts the onus on them to consider their ACL decisions when we're
> >making catalog changes, and it keeps these issues out of pg_dump.
> 
> I wrote a prototype to check API and ACL compatibility (see attachment).
> In the current implementation it fetches the list of system procedures for
> both old and new clusters
> and reports deleted functions or ACL changes during pg_upgrade check:
> 
> 
> Performing Consistency Checks
> -
> ...
> Checking for system functions API compatibility
> dbname postgres : check procsig is equal pg_stop_backup(), procacl not equal
> {anastasia=X/anastasia,backup=X/anastasia} vs {anastasia=X/anastasia}
> dbname postgres : procedure pg_stop_backup(exclusive boolean, OUT lsn
> pg_lsn, OUT labelfile text, OUT spcmapfile text) doesn't exist in
> new_cluster
> dbname postgres : procedure pg_switch_xlog() doesn't exist in new_cluster
> dbname postgres : procedure pg_xlog_replay_pause() doesn't exist in
> new_cluster
> dbname postgres : procedure pg_xlog_replay_resume() doesn't exist in
> new_cluster
> ...
> 
> I think it's a good first step in the right direction.
> Now I have questions about implementation details:
> 
> 1) How exactly should we report this incompatibility to a user?
> I think it's fine to leave the warnings and also write some hint for the
> user by analogy with other checks.
> "Reset ACL on the problem functions to default in the old cluster to
> continue"
> 
> Is it enough?

Not really sure what else we could do there..?  Did you have something
else in mind?  We could possibly provide the specific commands to run,
that seems like about the only other thing we could possibly do.

> 2) This approach can be extended to other catalog modifications, you
> mentioned above.
> 

Re: understand the pg locks in in an simple case

2019-08-20 Thread Alex
On Tue, Aug 20, 2019 at 4:59 PM Heikki Linnakangas  wrote:

> On 20/08/2019 10:23, Alex wrote:
> > I have troubles to understand the pg lock in the following simple
> > situation.
> >
> >
> > Session 1:
> >
> >
> > begin;   update  tset  a=  1  where  a=  10;
> >
> >
> > Session 2:
> >
> >
> > begin;  update  tset  a=  2  where  a=  10;
> >
> >
> > They update the same row and session 2 is blocked by session 1 without
> > surprise.
> >
> >
> > The pretty straight implementation is:
> >
> > Session 1 lock the the *tuple (ExclusiveLock)* mode.
> >
> > when session 2 lock it in exclusive mode,  it is blocked.
> >
> >
> > But when I check the pg_locks: session 1.  I can see *no tuple
> > lock*there,  when I check the session 2,   I can see a
> > *tuple(ExclusiveLock) is granted*,  but it is waiting for a
> transactionid.
> >
> >
> > since every tuple has txn information,  so it is not hard to implement
> > it this way.  but is there any benefits over the the straight way?
> >   with the current implementation, what is the point
> > of tuple(ExclusiveLock) for session 2?
>
> The reason that tuple locking works with XIDs, rather than directly
> acquiring a lock on the tuple, is that the memory allocated for the lock
> manager is limited. One transaction can lock millions of tuples, and if
> it had to hold a normal lock on every tuple, you would run out of memory
> very quickly.
>

Thank you!

so can I understand that we don't need a lock on every tuple we updated
since
1).  the number of lock may be  huge,  if we do so,  it will consume a lot
of memory
2).  the tuple header which includes xid info are unavoidable due to MVCC
requirement, and it can be used here, so we saved the individual lock

and in my above example,  when session 2 waiting for a xid lock,  it is
*granted* with a tuple lock with ExclusiveLock mode,  what is the purpose
of this lock?


> So it may seem that we don't need heavy-weight locks on individual
> tuples at all. But we still them to establish the order that backends
> are waiting. The locking protocol is:
>
> 1. Check if a tuple's xmax is set.
> 2. If it's set, obtain a lock on the tuple's TID.
> 3. Wait on the transaction to finish, by trying to acquire lock on the XID.
> 4. Update the tuple, release the lock on the XID, and on the TID.
>
> It gets more complicated if there are multixids, or you update a row you
> have earlier locked in a weaker mode, but that's the gist of it.
>
> We could skip the lock on the tuple's TID, but then if you have multiple
> backends trying to update or lock a row, it would be not be
> deterministic, who gets the lock first. For example:
>
> Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123;
> Session B: UPDATE foo SET col='b' WHERE id = 123; 
> Session C: UPDATE foo SET col='c' WHERE id = 123; 
> Session A: ROLLBACK;
>
> Without the lock on the TID, it would be indeterministic, whether
> session B or C gets to update the tuple, when A rolls back. With the
> above locking protocol, B will go first. B will acquire the lock on the
> TID, and block on the XID lock, while C will block on the TID lock held
> by B. If there were more backends trying to do the same, they would
> queue for the TID lock, too.
>
> - Heikki
>


Re: Global temporary tables

2019-08-20 Thread Konstantin Knizhnik



On 19.08.2019 18:53, Pavel Stehule wrote:




Certainly, default (small) temp buffer size plays roles.
But it this IPC host this difference is not so important.
Result with local temp tables and temp_buffers = 1GB: 859k TPS.


It is little bit unexpected result.I understand so it partially it is 
generic problem access to smaller dedicated caches versus access to 
bigger shared cache.


But it is hard to imagine so access to local cache is 10% slower than 
access to shared cache. Maybe there is some bottle neck - maybe our 
implementation of local buffers are suboptimal.


It may be caused by system memory allocator - in case of using shared 
buffers we do not need to ask OS to allocate more memory.




Using local buffers for global temporary tables can be interesting 
from another reason - it uses temporary files, and temporary files can 
be forwarded on ephemeral IO on Amazon cloud (with much better 
performance than persistent IO).





My assumption is that temporary tables almost always fit in memory. So 
in most cases there is on need to write data to file at all.



As I wrote at the beginning of this thread, one of the problems with 
temporary table sis that it is not possible to use them at replica.

Global temp tables allows to share metadata between master and replica.
I perform small investigation: how difficult it will be to support 
inserts in temp tables at replica.

First my impression was that it can be done in tricky but simple way.

By making small changes changing just three places:
1.  Prohibit non-select statements in read-only transactions
2. Xid assignment (return FrozenTransactionId)
3. Transaction commit/abort

I managed to provide normal work with global temp tables at replica.
But there is one problem with this approach: it is not possible to undo 
changes in temp tables so rollback doesn't work.


I tried another solution, but assigning some dummy Xids to standby 
transactions.

But this approach require much more changes:
- Initialize page for such transaction in CLOG
- Mark transaction as committed/aborted in XCLOG
- Change snapshot check in visibility function

And still I didn't find safe way to cleanup CLOG space.
Alternative solution is to implement "local CLOG" for such transactions.
The straightforward solution is to use hashtable. But it may cause 
memory overflow if we have long living backend which performs huge 
number of transactions.

Also in this case we need to change visibility check functions.

So I have implemented simplest solution with frozen xid and force 
backend termination in case of transaction rollback (so user will no see 
inconsistent behavior).
Attached please find global_private_temp_replica.patch which implements 
this approach.
It will be nice if somebody can suggest better solution for temporary 
tables at replica.








--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 9726020..c99701d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1028,7 +1028,7 @@ gistGetFakeLSN(Relation rel)
 {
 	static XLogRecPtr counter = FirstNormalUnloggedLSN;
 
-	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	if (RelationHasSessionScope(rel))
 	{
 		/*
 		 * Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e..e92d324 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -673,6 +673,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 			 * init fork of an unlogged relation.
 			 */
 			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ||
 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
  forkNum == INIT_FORKNUM))
 log_smgrcreate(newrnode, forkNum);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..97cc9e4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -763,7 +763,11 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
 		LockBuffer(buf, access);
-		_bt_checkpage(rel, buf);
+		/* Session temporary relation may be not yet initialized for this backend. */
+		if (blkno == BTREE_METAPAGE && GlobalTempRelationPageIsNotInitialized(rel, BufferGetPage(buf)))
+			_bt_initmetapage(BufferGetPage(buf), P_NONE, 0);
+		else
+			_bt_checkpage(rel, buf);
 	}
 	else
 	{
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5b759ec..fda7573 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -69,9 +69,10 @@ 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila  wrote:
> > Well, my main point, which so far has largely been ignored, was that we
> > may not acquire page locks when we still need to search for victim
> > buffers later. If we don't need to lock the pages up-front, but only do
> > so once we're actually copying the records into the undo pages, then we
> > don't a separate phase to acquire the locks. We can still hold all of
> > the page locks at the same time, as long as we just acquire them at the
> > later stage.
>
> Okay, IIUC, this means that we should have a separate phase where we
> call LockUndoBuffers (or something like that) before
> InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> will lock all the buffers pinned during PrepareUndoInsert.  We can
> probably call LockUndoBuffers before entering the critical section to
> avoid any kind of failure in critical section.  If so, that sounds
> reasonable to me.

I'm kind of scratching my head here, because this is clearly different
than what Andres said in the quoted text to which you were replying.
He clearly implied that we should acquire the buffer locks within the
critical section during InsertPreparedUndo, and you responded by
proposing to do it outside the critical section in a separate step.
Regardless of which way is actually better, when somebody says "hey,
let's do A!" and you respond by saying "sounds good, I'll go implement
B!" that's not really helping us to get toward a solution.

FWIW, although I also thought of doing what you are describing here, I
think Andres's proposal is probably preferable, because it's simpler.
There's not really any reason why we can't take the buffer locks from
within the critical section, and that way callers don't have to deal
with the extra step.

> >  My secondary point was that *none* of this actually is
> > documented, even if it's entirely unobvious to the reader that the
> > relevant code can only run during WAL insertion, due to being pretty far
> > removed from that.
>
> I think this can be clearly mentioned in README or someplace else.

It also needs to be adequately commented in the files and functions involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Improve default partition

2019-08-20 Thread Rafia Sabih
Hello all,

I was just playing around with table partitioning and noticed
1. When one inserts into a parent table with no partitions defined
yet, it errors out
2. Similarly, if we try to insert into a parent table a value which is
not covered in any partition and has no default partition defined, it
errors out

This does not sound very convenient. I was thinking of having some
mechanism for such insertions which automatically creates a default
partition and gives a notice for the user to know that it is going to
the default partition. Basically, always having a default partition.
After all default is something that remains there by default, isn't
it?

I will be happy to know your thoughts on this.

-- 
Regards,
Rafia Sabih




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Mon, Aug 19, 2019 at 5:16 PM Andres Freund  wrote:
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.

+1 for that approach.  I am in complete agreement.

> My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.

+1 also for properly documenting stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




ICU for global collation

2019-08-20 Thread Peter Eisentraut
Here is an initial patch to add the option to use ICU as the global
collation provider, a long-requested feature.

To activate, use something like

initdb --collation-provider=icu --locale=...

A trick here is that since we need to also still set the normal POSIX
locales, the --locale value needs to be valid as both a POSIX locale and
a ICU locale.  If that doesn't work out, there is also a way to specify
it separately, e.g.,

initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

This complexity is unfortunate, but I don't see a way around it right now.

There are also options for createdb and CREATE DATABASE to do this for a
particular database only.

Besides this, the implementation is quite small: When starting up a
database, we create an ICU collator object, store it in a global
variable, and then use it when appropriate.  All the ICU code for
creating and invoking those collators already exists of course.

For the version tracking, I use the pg_collation row for the "default"
collation.  Again, this mostly reuses existing code and concepts.

Nondeterministic collations are not supported for the global collation,
because then LIKE and regular expressions don't work and that breaks
some system views.  This needs some separate research.

To test, run the existing regression tests against a database
initialized with ICU.  Perhaps some options for pg_regress could
facilitate that.

I fear that the Localization chapter in the documentation will need a
bit of a rewrite after this, because the hitherto separately treated
concepts of locale and collation are fusing together.  I haven't done
that here yet, but that would be the plan for later.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 071a955d44b588d4633030e7a7d06c5cfb4ff838 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Aug 2019 16:02:46 +0200
Subject: [PATCH v1] Add option to use ICU as global collation provider

This adds the option to use ICU as the default collation provider for
either the whole cluster or a database.  New options for initdb,
createdb, and CREATE DATABASE are used to select this.
---
 doc/src/sgml/ref/createdb.sgml   |   9 ++
 doc/src/sgml/ref/initdb.sgml |  23 
 src/backend/access/hash/hashfunc.c   |  18 ++-
 src/backend/commands/dbcommands.c|  52 -
 src/backend/regex/regc_pg_locale.c   |   7 +-
 src/backend/utils/adt/formatting.c   |   6 +
 src/backend/utils/adt/like.c |  20 +++-
 src/backend/utils/adt/like_support.c |   2 +
 src/backend/utils/adt/pg_locale.c| 168 ---
 src/backend/utils/adt/varchar.c  |  22 +++-
 src/backend/utils/adt/varlena.c  |  26 -
 src/backend/utils/init/postinit.c|  21 
 src/bin/initdb/Makefile  |   2 +
 src/bin/initdb/initdb.c  |  63 --
 src/bin/initdb/t/001_initdb.pl   |  18 ++-
 src/bin/pg_dump/pg_dump.c|  16 +++
 src/bin/psql/describe.c  |   8 ++
 src/bin/scripts/Makefile |   2 +
 src/bin/scripts/createdb.c   |   9 ++
 src/bin/scripts/t/020_createdb.pl|  19 ++-
 src/include/catalog/pg_database.dat  |   2 +-
 src/include/catalog/pg_database.h|   3 +
 src/include/utils/pg_locale.h|   6 +
 23 files changed, 417 insertions(+), 105 deletions(-)

diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 8fc8128bf9..5b73afad91 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -85,6 +85,15 @@ Options
   
  
 
+ 
+  
--collation-provider={libc|icu}
+  
+   
+Specifies the collation provider for the database.
+   
+  
+ 
+
  
   -D tablespace
   --tablespace=tablespace
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index da5c8f5307..9ad7b2e112 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -165,6 +165,18 @@ Options
   
  
 
+ 
+  
--collation-provider={libc|icu}
+  
+   
+This option sets the collation provider for databases created in the
+new cluster.  It can be overridden in the CREATE
+DATABASE command when new databases are subsequently
+created.  The default is libc.
+   
+  
+ 
+
  
   -D directory
   --pgdata=directory
@@ -209,6 +221,17 @@ Options
   
  
 
+ 
+  
--icu-locale=locale
+  
+   
+Specifies the ICU locale if the ICU collation provider is used.  If
+this is not specified, the value from the --locale
+option is used.
+   
+  
+ 
+
  
   -k
   --data-checksums
diff --git a/src/backend/access/hash/hashfunc.c 
b/src/backend/access/hash/hashfunc.c
index 6ec1ec3df3..2f8f220549 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -255,8 

Re: Cleanup isolation specs from unused steps

2019-08-20 Thread Alvaro Herrera
On 2019-Aug-20, Tom Lane wrote:

> If you can warn in both cases, that'd be OK perhaps.  But Alvaro's
> description of the intended use of dry-run makes it sound like
> it would be expected for there to be unreferenced steps, since there'd
> be no permutations yet in the input.

Well, Heikki/Kevin's original intention was that if no permutations are
specified, then all possible permutations are generated internally and
the spec is run with that.  The intended use of dry-run was to do just
that (generate all possible permutations) and print that list, so that
it could be trimmed down by the test author.  In this mode of operation,
all steps are always used, so there'd be no warning printed.  However,
when a test file has a largish number of steps then the list is, um, a
bit long.  Before the deadlock-test hacking, you could run with such a
list anyway and any permutations that caused a blockage would be
reported right away as an invalid permutation -- quick enough.
Currently it sleeps for absurdly long on those cases, so this is no
longer feasible.

This is why I say that the current dry-run mode could be removed with no
loss of useful functionality.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: configure still looking for crypt()?

2019-08-20 Thread Tom Lane
Peter Eisentraut  writes:
> I noticed that configure is still looking for crypt() and crypt.h.
> Isn't that long obsolete?
> If so, I suggest to remove it with the attached patch.

+1

regards, tom lane




Re: Cleanup isolation specs from unused steps

2019-08-20 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Aug-20, Michael Paquier wrote:
>> On Mon, Aug 19, 2019 at 10:23:19AM -0700, Melanie Plageman wrote:
>>> Could you do the check that all steps have been used in dry_run mode
>>> instead of when running the tests for real?

>> Sure, I was hesitating to do so.  I have no issue in moving the check
>> into run_testspec().  So done as attached.

> I created the dry-run mode to be able to easily generate the set of
> possible permutations for a new test, then edit the result and put it
> back in the spec file; but after the deadlock tests were added (with
> necessary hacking of the lock-detection in isolationtester) that manner
> of operation became almost completely useless.  Maybe we need to rethink
> what facilities isolationtester offers -- possibly making dry-run have a
> completely different behavior than currently, which I doubt anybody is
> using.

Hm, does that mean that this version of the patch would fail to warn
during a normal run?  Doesn't sound good, since as Alvaro says,
hardly anyone uses dry-run.

If you can warn in both cases, that'd be OK perhaps.  But Alvaro's
description of the intended use of dry-run makes it sound like
it would be expected for there to be unreferenced steps, since there'd
be no permutations yet in the input.

regards, tom lane




Re: pg_upgrade fails with non-standard ACL

2019-08-20 Thread Anastasia Lubennikova

14.08.2019 3:28, Stephen Frost wrote:

* Bruce Momjian (br...@momjian.us) wrote:


As much as it would be nice if the release notes covered all that, and
we updated pg_upgrade to somehow handle them, it just isn't realistic.
As we can see here, the problems often take years to show up, and even
then there were probably many other people who had the problem who never
reported it to us.

Yeah, the possible changes when you think about column-level privileges
as well really gets to be quite large..

This is why my thinking is that we should come up with additional
default roles, which aren't tied to specific catalog structures but
instead are for a more general set of capabilities which we manage and
users can either decide to use, or not.  If they decide to work with the
individual functions then they have to manage the upgrade process if and
when we make changes to those functions.


Idea of having special roles looks good to me, though, I don't see
how to define what grants are needed for each role. Let's say, we
define role "backupuser" that obviously must have grants on 
pg_start_backup()

and pg_stop_backup(). Should it also have access to pg_is_in_recovery()
or for example version()?


It'd be pretty neat if pg_upgrade could connect to the old and new
clusters concurrently and then perform a generic catalog comparison
between them and identify all objects which have been changed and
determine if there's any non-default ACLs or dependencies on the catalog
objects which are different between the clusters.  That seems like an
awful lot of work though, and I'm not sure there's really any need,
given that we don't change the catalog for a given major version- we
could just generate the list using some queries whenever we release a
new major version and update pg_upgrade with it.


The only positive is that when pg_upgrade does fail, at least we have a
system that clearly points to the cause, but unfortunately usually at
run-time, not at --check time.

Getting it to be at check time would certainly be a great improvement.

Solving this in pg_upgrade does seem like it's probably the better
approach rather than trying to do it in pg_dump.  Unfortunately, that
likely means that all we can do is have pg_upgrade point out to the user
when something will fail, with recommendations on how to address it, but
that's also something users are likely used to and willing to accept,
and puts the onus on them to consider their ACL decisions when we're
making catalog changes, and it keeps these issues out of pg_dump.



I wrote a prototype to check API and ACL compatibility (see attachment).
In the current implementation it fetches the list of system procedures 
for both old and new clusters

and reports deleted functions or ACL changes during pg_upgrade check:


Performing Consistency Checks
-
...
Checking for system functions API compatibility
dbname postgres : check procsig is equal pg_stop_backup(), procacl not 
equal {anastasia=X/anastasia,backup=X/anastasia} vs {anastasia=X/anastasia}
dbname postgres : procedure pg_stop_backup(exclusive boolean, OUT lsn 
pg_lsn, OUT labelfile text, OUT spcmapfile text) doesn't exist in 
new_cluster

dbname postgres : procedure pg_switch_xlog() doesn't exist in new_cluster
dbname postgres : procedure pg_xlog_replay_pause() doesn't exist in 
new_cluster
dbname postgres : procedure pg_xlog_replay_resume() doesn't exist in 
new_cluster

...

I think it's a good first step in the right direction.
Now I have questions about implementation details:

1) How exactly should we report this incompatibility to a user?
I think it's fine to leave the warnings and also write some hint for the 
user by analogy with other checks.
"Reset ACL on the problem functions to default in the old cluster to 
continue"


Is it enough?

2) This approach can be extended to other catalog modifications, you 
mentioned above.
I don't see what else can break pg_upgrade in the same way. However, I 
don't mind
implementing more checks, while I work on this issue, if you can point 
on them.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 36e23da269d635ebfbc9891d37e650f8c3bfa6de
Author: Anastasia 
Date:   Mon Aug 19 19:38:51 2019 +0300

attempt to check proc signatures and ACL in pg_upgrade

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index b4cf6da..c12340a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -90,6 +90,7 @@ check_and_dump_old_cluster(bool live_check)
 
 	get_loadable_libraries();
 
+	get_catalog_procedures(_cluster);
 
 	/*
 	 * Check for various failure cases
@@ -148,6 +149,8 @@ check_new_cluster(void)
 	check_databases_are_compatible();
 
 	check_loadable_libraries();
+	get_catalog_procedures(_cluster);
+	check_catalog_procedures(_cluster, _cluster);
 
 	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
 		check_hard_link();
diff --git 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Robert Haas
On Sat, Aug 17, 2019 at 1:28 PM Andres Freund  wrote:
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.

So, you seem to be talking about something here which is different
than what I thought we were talking about.  One question is whether we
need to lock all of the buffers "ahead of time," and I think the
answer to that question is probably "no." Since nobody else can be
writing to those buffers, and probably also nobody can be reading them
except maybe for some debugging tool, it should be fine if we enter
the critical section and then lock them at the point when we write the
bytes. I mean, there shouldn't be any contention, and I don't see any
other problems.

The other question is whether need to hold all of the buffer locks at
the same time, and that seems a lot more problematic to me.  It's hard
to say exactly whether this unsafe, because it depends on exactly what
you think we're doing here, and I don't see that you've really spelled
that out.  The normal thing to do is call PageSetLSN() on every page
before releasing the buffer lock, and that means holding all the
buffer locks until after we've called XLogInsert(). Now, you could
argue that we should skip setting the page LSN because the data ahead
of the insertion pointer is effectively invisible anyway, but I bet
that causes problems with checksums, at least, since they rely on the
page LSN being accurate to know whether to emit WAL when a buffer is
written. You could argue that we could do the XLogInsert() first and
only after that lock and dirty the pages one by one, but I think that
might break checkpoint interlocking, since it would then be possible
for the checkpoint scan to pass over a buffer that does not appear to
need writing for the current checkpoint but later gets dirtied and
stamped with an LSN that would have caused it to be written had it
been there at the time the checkpoint scan reached it. I really can't
rule out the possibility that there's some way to make something in
this area work, but I don't know what it is, and I think it's a fairly
risky area to go tinkering.

> Well, in the version of code that I was reviewing here, I don't there is
> such a limit (there is a limit for buffers per undo record, but no limit
> on the number of records inserted together). I think Dilip added a limit
> since.  And we have the issue of a lot of IO happening while holding
> content locks on several pages.  So I don't think it's a straw man at
> all.

Hmm, what do you mean by "a lot of IO happening while holding content
locks on several pages"? We might XLogInsert() but there shouldn't be
any buffer I/O going on at that point.  If there is, I think that
should be redesigned.  We should collect buffer pins first, without
locking.  Then lock.  Then write.  Or maybe lock-and-write, but only
after everything's pinned.  The fact of calling XLogInsert() while
holding buffer locks is not great, but I don't think it's any worse
here than in any other part of the system, because the undo buffers
aren't going to be suffering concurrent access from any other backend,
and because there shouldn't be more than a few of them.

> > 2. The write-ahead logging protocol says that you're supposed to lock
> > all the buffers at once.  See src/backend/access/transam/README.  If
> > you want to go patch that file, then this patch can follow whatever
> > the locking rules in the patched version are.  But until then, the
> > patch should follow *the actual rules* not some other protocol based
> > on a hand-wavy explanation in an email someplace. Otherwise, you've
> > got the same sort of undocumented disaster-waiting-to-happen that you
> > keep complaining about in other parts of this patch.  We need fewer of
> > those, not more!
>
> But that's not what I'm asking for? I don't even know where you take
> from that I don't want this to be documented. I'm mainly asking for a
> comment explaining why the current behaviour is what it is. Because I
> don't think an *implicit* "normal WAL logging rules" is sufficient
> explanation, because all the locking here happens one or two layers away
> from the WAL logging site - so it's absolutely *NOT* obvious that that's
> the explanation. And I don't think any of the locking sites actually 

mingw32 floating point diff

2019-08-20 Thread Peter Eisentraut
Running the regression tests on mingw32, I get the following diff in
circle.out:

@@ -111,8 +111,8 @@
   WHERE (c1.f1 < c2.f1) AND ((c1.f1 <-> c2.f1) > 0)
   ORDER BY distance, area(c1.f1), area(c2.f1);
  five |  one   |  two   | distance
---+++--
-  | <(3,5),0>  | <(1,2),3>  | 0.60555127546399
+--+++---
+  | <(3,5),0>  | <(1,2),3>  | 0.605551275463989
   | <(3,5),0>  | <(5,1),3>  | 1.47213595499958
   | <(100,200),10> | <(100,1),115>  |   74
   | <(100,200),10> | <(1,2),100>| 111.370729772479

I only get this on master/PG12, but not on PG11, so I suspect that the
new floating-point output routines could be the root of the issue.

This happens only with the 32-bit build (mingw32), but not with a 64-bit
build (mingw64).

Any suggestions on how to analyze this further?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




configure still looking for crypt()?

2019-08-20 Thread Peter Eisentraut
I noticed that configure is still looking for crypt() and crypt.h.
Isn't that long obsolete?

If so, I suggest to remove it with the attached patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 749310380a47fc6e51f35b56fd612adf76491c6c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Aug 2019 14:27:09 +0200
Subject: [PATCH] Remove configure crypt detection

---
 configure |   71 +--
 configure.in  |3 -
 src/backend/libpq/crypt.c |3 -
 src/include/pg_config.h.in|6 -
 src/include/pg_config.h.win32 |6 -
 src/include/port.h|3 -
 src/port/crypt.c  | 1067 -
 src/tools/msvc/Mkvcbuild.pm   |2 +-
 8 files changed, 2 insertions(+), 1159 deletions(-)
 delete mode 100644 src/port/crypt.c

diff --git a/configure b/configure
index d1dc54697d..f14709ed1e 100755
--- a/configure
+++ b/configure
@@ -11209,62 +11209,6 @@ if test "$ac_res" != no; then :
 
 fi
 
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing 
crypt" >&5
-$as_echo_n "checking for library containing crypt... " >&6; }
-if ${ac_cv_search_crypt+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  ac_func_search_save_LIBS=$LIBS
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-/* Override any GCC internal prototype to avoid an error.
-   Use char because int might match the return type of a GCC
-   builtin and then its argument prototype would still apply.  */
-#ifdef __cplusplus
-extern "C"
-#endif
-char crypt ();
-int
-main ()
-{
-return crypt ();
-  ;
-  return 0;
-}
-_ACEOF
-for ac_lib in '' crypt; do
-  if test -z "$ac_lib"; then
-ac_res="none required"
-  else
-ac_res=-l$ac_lib
-LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
-  fi
-  if ac_fn_c_try_link "$LINENO"; then :
-  ac_cv_search_crypt=$ac_res
-fi
-rm -f core conftest.err conftest.$ac_objext \
-conftest$ac_exeext
-  if ${ac_cv_search_crypt+:} false; then :
-  break
-fi
-done
-if ${ac_cv_search_crypt+:} false; then :
-
-else
-  ac_cv_search_crypt=no
-fi
-rm conftest.$ac_ext
-LIBS=$ac_func_search_save_LIBS
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_crypt" >&5
-$as_echo "$ac_cv_search_crypt" >&6; }
-ac_res=$ac_cv_search_crypt
-if test "$ac_res" != no; then :
-  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
-
-fi
-
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing 
shm_open" >&5
 $as_echo_n "checking for library containing shm_open... " >&6; }
 if ${ac_cv_search_shm_open+:} false; then :
@@ -12760,7 +12704,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h copyfile.h crypt.h fp_class.h getopt.h ieeefp.h 
ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h 
sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h 
sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h copyfile.h fp_class.h getopt.h ieeefp.h ifaddrs.h 
langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h sys/procctl.h 
sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h 
sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" 
"$ac_includes_default"
@@ -15782,19 +15726,6 @@ done
 
 fi
 
-ac_fn_c_check_func "$LINENO" "crypt" "ac_cv_func_crypt"
-if test "x$ac_cv_func_crypt" = xyes; then :
-  $as_echo "#define HAVE_CRYPT 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" crypt.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS crypt.$ac_objext"
- ;;
-esac
-
-fi
-
 ac_fn_c_check_func "$LINENO" "dlopen" "ac_cv_func_dlopen"
 if test "x$ac_cv_func_dlopen" = xyes; then :
   $as_echo "#define HAVE_DLOPEN 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index 33edfd765b..805cf8617b 100644
--- a/configure.in
+++ b/configure.in
@@ -1118,7 +1118,6 @@ AC_SEARCH_LIBS(dlopen, dl)
 AC_SEARCH_LIBS(socket, [socket ws2_32])
 AC_SEARCH_LIBS(shl_load, dld)
 AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
-AC_SEARCH_LIBS(crypt, crypt)
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
@@ -1273,7 +1272,6 @@ AC_HEADER_STDBOOL
 AC_CHECK_HEADERS(m4_normalize([
atomic.h
copyfile.h
-   crypt.h
fp_class.h
getopt.h
ieeefp.h
@@ -1692,7 +1690,6 @@ else
 fi
 
 AC_REPLACE_FUNCS(m4_normalize([
-   crypt
dlopen
fls
getopt
diff --git a/src/backend/libpq/crypt.c b/src/backend/libpq/crypt.c
index 6e273dc9bb..784fb227aa 100644
--- a/src/backend/libpq/crypt.c
+++ b/src/backend/libpq/crypt.c
@@ -14,9 +14,6 @@
 #include "postgres.h"
 
 #include 
-#ifdef HAVE_CRYPT_H
-#include 
-#endif
 
 #include 

Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Dilip Kumar
On Wed, Aug 14, 2019 at 10:35 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund  wrote:

> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would
>
> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>hold the maximum amount of undo needed. These buffers would be marked
>as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.


I have analyzed this further, I think there is a problem if the
record/s will not fit into the current undo log and we will have to
switch the log.  Because before knowing the actual record length we
are not sure whether the undo log will switch or not and which undo
log we will get.  And, without knowing the logno (rnode) how we are
going to pin the buffers?  Am I missing something?

Thomas do you think we can get around this problem?

Apart from this while analyzing the other code I have noticed that in
the current PG code we have few occurrences where try to read buffer
under the buffer lock held.
1.
In gistplacetopage
{
...
for (; ptr; ptr = ptr->next)
{
/* Allocate new page */
ptr->buffer = gistNewBuffer(rel);
GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
ptr->page = BufferGetPage(ptr->buffer);
ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
}
2. During page split we find new buffer while holding the lock on the
current buffer.

That doesn't mean that we can't do better but I am just referring to
the existing code where we already have such issues.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make SQL/JSON error code names match SQL standard

2019-08-20 Thread Alexander Korotkov
On Tue, Aug 20, 2019 at 11:49 AM Peter Eisentraut
 wrote:
> I propose the attached patch to make the new SQL/JSON error code names
> match the SQL standard.  The existing minor differences don't seem
> necessary.

Thank you for noticing!
+1 for pushing this

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Zedstore - compressed in-core columnar storage

2019-08-20 Thread Heikki Linnakangas

On 20/08/2019 05:04, Justin Pryzby wrote:

it looks like zedstore
with lz4 gets ~4.6x for our largest customer's largest table.  zfs using
compress=gzip-1 gives 6x compression across all their partitioned
tables,
and I'm surprised it beats zedstore .


I did a quick test, with 10 million random IP addresses, in text format. 
I loaded it into a zedstore table ("create table ips (ip text) using 
zedstore"), and poked around a little bit to see how the space is used.


postgres=# select lokey, nitems, ncompressed, totalsz, uncompressedsz, 
freespace from  pg_zs_btree_pages('ips') where attno=1 and level=0 limit 10;

 lokey | nitems | ncompressed | totalsz | uncompressedsz | freespace
---++-+-++---
 1 |  4 |   4 |6785 |   7885 |  1320
   537 |  5 |   5 |7608 |   8818 |   492
  1136 |  4 |   4 |6762 |   7888 |  1344
  1673 |  5 |   5 |7548 |   8776 |   540
  2269 |  4 |   4 |6841 |   7895 |  1256
  2807 |  5 |   5 |7555 |   8784 |   540
  3405 |  5 |   5 |7567 |   8772 |   524
  4001 |  4 |   4 |6791 |   7899 |  1320
  4538 |  5 |   5 |7596 |   8776 |   500
  5136 |  4 |   4 |6750 |   7875 |  1360
(10 rows)

There's on average about 10% of free space on the pages. We're losing 
quite a bit to to ZFS compression right there. I'm sure there's some 
free space on the heap pages as well, but ZFS compression will squeeze 
it out.


The compression ratio is indeed not very good. I think one reason is 
that zedstore does LZ4 in relatively small chunks, while ZFS surely 
compresses large blocks in one go. Looking at the above, there is on 
average 125 datums packed into each "item" (avg(hikey-lokey) / nitems). 
I did a quick test with the "lz4" command-line utility, compressing flat 
files containing random IP addresses.


$ lz4 /tmp/125-ips.txt
Compressed filename will be : /tmp/125-ips.txt.lz4
Compressed 1808 bytes into 1519 bytes ==> 84.02% 


$ lz4 /tmp/550-ips.txt
Compressed filename will be : /tmp/550-ips.txt.lz4
Compressed 7863 bytes into 6020 bytes ==> 76.56% 


$ lz4 /tmp/750-ips.txt
Compressed filename will be : /tmp/750-ips.txt.lz4
Compressed 10646 bytes into 8035 bytes ==> 75.47%

The first case is roughly what we do in zedstore currently: we compress 
about 125 datums as one chunk. The second case is roughty what we would 
get, if we collected on 8k worth of datums and compressed them all as 
one chunk. And the third case simulates the case we would allow the 
input to be larger than 8k, so that the compressed chunk just fits on an 
8k page. Not too much difference between the second and third case, but 
its pretty clear that we're being hurt by splitting the input into such 
small chunks.


The downside of using a larger compression chunk size is that random 
access becomes more expensive. Need to give the on-disk format some more 
thought. Although I actually don't feel too bad about the current 
compression ratio, perfect can be the enemy of good.


- Heikki




Serialization questions

2019-08-20 Thread Alex
Before understanding how postgres implements the serializable isolation
level (I have see many paper related to it), I have question about how it
should be.


I mainly read the ideas from
https://www.postgresql.org/docs/11/transaction-iso.html.


In fact, this isolation level works exactly the same as Repeatable Read
except that it monitors for conditions which could make execution of a
concurrent set of serializable transactions behave in a manner inconsistent
with all possible serial (one at a time) executions of those transactions.


in repeatable read,  every statement will use the transaction start
timestamp,  so is it in serializable isolation level?


When relying on Serializable transactions to prevent anomalies, it is
important that any data read from a permanent user table not be considered
valid until the transaction which read it has successfully committed. This
is true even for read-only transactions ...


What does the "not be considered valid" mean?  and if it is a read-only
transaction (assume T1),  I think it is ok to let other transaction do
anything with the read set of T1, since it is invisible to T1(use the
transaction start time as statement timestamp).


Thanks


Re: Make SQL/JSON error code names match SQL standard

2019-08-20 Thread Pavel Stehule
út 20. 8. 2019 v 10:49 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> I propose the attached patch to make the new SQL/JSON error code names
> match the SQL standard.  The existing minor differences don't seem
> necessary.
>

+1

Pavel

>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Fixing typos and inconsistencies

2019-08-20 Thread Michael Paquier
On Mon, Aug 19, 2019 at 10:19:19PM +0300, Alexander Lakhin wrote:
> Now that the unicums checking is finished, I would like to share the
> script I used to find them.
> Maybe it can be useful to recheck the source tree from time to time...
> I don't think that the check could be fully automated, but with some
> eyeballing it allows to maintain a more consistent state.

Cool, thanks for sharing.

We still have a couple of things left though, one which stood out
recently was a reference to TupleLockUpdate in heapam.c.  This relates
to multixact, so I am not actually sure to which lock level this was
referring to.
--
Michael


signature.asc
Description: PGP signature


Re: understand the pg locks in in an simple case

2019-08-20 Thread Heikki Linnakangas

On 20/08/2019 10:23, Alex wrote:
I have troubles to understand the pg lock in the following simple 
situation.



Session 1:


begin;   update  tset  a=  1  where  a=  10;


Session 2:


begin;  update  tset  a=  2  where  a=  10;


They update the same row and session 2 is blocked by session 1 without 
surprise.



The pretty straight implementation is:

Session 1 lock the the *tuple (ExclusiveLock)* mode.

when session 2 lock it in exclusive mode,  it is blocked.


But when I check the pg_locks: session 1.  I can see *no tuple 
lock*there,  when I check the session 2,   I can see a 
*tuple(ExclusiveLock) is granted*,  but it is waiting for a transactionid.



since every tuple has txn information,  so it is not hard to implement 
it this way.  but is there any benefits over the the straight way?  
  with the current implementation, what is the point 
of tuple(ExclusiveLock) for session 2?


The reason that tuple locking works with XIDs, rather than directly 
acquiring a lock on the tuple, is that the memory allocated for the lock 
manager is limited. One transaction can lock millions of tuples, and if 
it had to hold a normal lock on every tuple, you would run out of memory 
very quickly.


So it may seem that we don't need heavy-weight locks on individual 
tuples at all. But we still them to establish the order that backends 
are waiting. The locking protocol is:


1. Check if a tuple's xmax is set.
2. If it's set, obtain a lock on the tuple's TID.
3. Wait on the transaction to finish, by trying to acquire lock on the XID.
4. Update the tuple, release the lock on the XID, and on the TID.

It gets more complicated if there are multixids, or you update a row you 
have earlier locked in a weaker mode, but that's the gist of it.


We could skip the lock on the tuple's TID, but then if you have multiple 
backends trying to update or lock a row, it would be not be 
deterministic, who gets the lock first. For example:


Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123;
Session B: UPDATE foo SET col='b' WHERE id = 123; 
Session C: UPDATE foo SET col='c' WHERE id = 123; 
Session A: ROLLBACK;

Without the lock on the TID, it would be indeterministic, whether 
session B or C gets to update the tuple, when A rolls back. With the 
above locking protocol, B will go first. B will acquire the lock on the 
TID, and block on the XID lock, while C will block on the TID lock held 
by B. If there were more backends trying to do the same, they would 
queue for the TID lock, too.


- Heikki




Re: understand the pg locks in in an simple case

2019-08-20 Thread Laurenz Albe
Alex wrote:
> But when I check the pg_locks: session 1.  I can see no tuple lock
> there,  when I check the session 2,   I can see a
> tuple(ExclusiveLock) is granted,  but it is waiting for a
> transactionid. 
> 
> since every tuple has txn information,  so it is not hard to
> implement it this way.  but is there any benefits over the the
> straight way?   with the current implementation, what is the point
> of tuple(ExclusiveLock) for session 2?

>From what I can tell the reason is that pg_locks reports the
information from the shared memory locking table, and tuple locks
are *not* maintained there, but in the "xmax" of the row itself.

To see all tuple locks in pg_locks would require a sequential
scan of all tables which have certain locks on them, which is not
going to happen.

Yours,
Laurenz Albe





Make SQL/JSON error code names match SQL standard

2019-08-20 Thread Peter Eisentraut
I propose the attached patch to make the new SQL/JSON error code names
match the SQL standard.  The existing minor differences don't seem
necessary.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 026299f325e7d2afbcd6c77b58e2968e88e5f700 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Aug 2019 10:46:32 +0200
Subject: [PATCH] Make SQL/JSON error code names match SQL standard

---
 src/backend/utils/adt/jsonpath_exec.c | 36 +--
 src/backend/utils/errcodes.txt| 22 
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath_exec.c 
b/src/backend/utils/adt/jsonpath_exec.c
index 8ab563fded..22ec578d77 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -336,7 +336,7 @@ jsonb_path_match(PG_FUNCTION_ARGS)
 
if (!silent)
ereport(ERROR,
-   (errcode(ERRCODE_SINGLETON_JSON_ITEM_REQUIRED),
+   
(errcode(ERRCODE_SINGLETON_SQL_JSON_ITEM_REQUIRED),
 errmsg("single boolean result is expected")));
 
PG_RETURN_NULL();
@@ -601,7 +601,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
return jperError;
 
ereport(ERROR,
-   
(errcode(ERRCODE_JSON_MEMBER_NOT_FOUND), \
+   
(errcode(ERRCODE_SQL_JSON_MEMBER_NOT_FOUND), \
 errmsg("JSON object 
does not contain key \"%s\"",

pnstrdup(key.val.string.val,

 key.val.string.len;
@@ -613,7 +613,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
{
Assert(found);
RETURN_ERROR(ereport(ERROR,
-
(errcode(ERRCODE_JSON_MEMBER_NOT_FOUND),
+
(errcode(ERRCODE_SQL_JSON_MEMBER_NOT_FOUND),
  
errmsg("jsonpath member accessor can only be applied to an object";
}
break;
@@ -642,7 +642,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
res = executeNextItem(cxt, jsp, NULL, jb, 
found, true);
else if (!jspIgnoreStructuralErrors(cxt))
RETURN_ERROR(ereport(ERROR,
-
(errcode(ERRCODE_JSON_ARRAY_NOT_FOUND),
+
(errcode(ERRCODE_SQL_JSON_ARRAY_NOT_FOUND),
  
errmsg("jsonpath wildcard array accessor can only be applied to an array";
break;
 
@@ -690,7 +690,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
 index_from > index_to ||
 index_to >= size))
RETURN_ERROR(ereport(ERROR,
-   
 (errcode(ERRCODE_INVALID_JSON_SUBSCRIPT),
+   
 (errcode(ERRCODE_INVALID_SQL_JSON_SUBSCRIPT),

  errmsg("jsonpath array subscript is out of bounds";
 
if (index_from < 0)
@@ -747,7 +747,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
else if (!jspIgnoreStructuralErrors(cxt))
{
RETURN_ERROR(ereport(ERROR,
-
(errcode(ERRCODE_JSON_ARRAY_NOT_FOUND),
+
(errcode(ERRCODE_SQL_JSON_ARRAY_NOT_FOUND),
  
errmsg("jsonpath array accessor can only be applied to an array";
}
break;
@@ -801,7 +801,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
{
   

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-20 Thread Kyotaro Horiguchi
Hello.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190819.185959.118543656.horikyota@gmail.com>
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

I saw the assertion failure.  It's a part of intended
behavior. In this patch, relcache doesn't hold the whole history
of relfilenodes so we cannot remove useless pending syncs
perfectly. On the other hand they are harmless except that they
cause extra sync of files that are removed immediately. So I
choosed that once registered pending syncs are not removed.

If we want consistency here, we need to record creator subxid in
PendingRelOps (PendingRelDelete) struct and rather large work at
subtransaction end.

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >   SubTransactionId rd_createSubid;/* rel was created in current 
> > > xact */
> > >   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > > assigned in
> > >   
> > >  * current xact */
> > > + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > > assigned
> > > + 
> > >  * first in current xact */
> > 
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and 
> > audit
> > all the lines printed.  Many bits of code need to look at all three,
> > e.g. RelationClose().
> 
> Agreed. I'll recheck that.
> 
> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
> 
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

So I'll add this in the next version to see how it looks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




understand the pg locks in in an simple case

2019-08-20 Thread Alex
I have troubles to understand the pg lock in the following simple
situation.


Session 1:


begin;  update t set a = 1 where a = 10;


Session 2:


begin; update t set a = 2 where a = 10;


They update the same row and session 2 is blocked by session 1 without
surprise.


The pretty straight implementation is:

Session 1 lock the the *tuple (ExclusiveLock)* mode.

when session 2 lock it in exclusive mode,  it is blocked.


But when I check the pg_locks: session 1.  I can see *no tuple lock*
there,  when I check the session 2,   I can see a *tuple(ExclusiveLock) is
granted*,  but it is waiting for a transactionid.


since every tuple has txn information,  so it is not hard to implement it
this way.  but is there any benefits over the the straight way?   with the
current implementation, what is the point of tuple(ExclusiveLock) for
session 2?


Re: Fixing typos and inconsistencies

2019-08-20 Thread Thomas Munro
On Tue, Aug 20, 2019 at 3:05 PM Alexander Lakhin  wrote:
> Now that the unicums checking is finished, I would like to share the
> script I used to find them.
> Maybe it can be useful to recheck the source tree from time to time...
> I don't think that the check could be fully automated, but with some
> eyeballing it allows to maintain a more consistent state.

Very clever!  Thanks for doing that, and sorry for all my typos.

-- 
Thomas Munro
https://enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-08-20 Thread Amit Kapila
On Tue, Aug 20, 2019 at 2:46 AM Andres Freund  wrote:
>
> On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund  wrote:
> > >
> > > > Well, I don't understand why you're on about this.  We've discussed it
> > > > a number of times but I'm still confused.
> > >
> > > There's two reasons here:
> > >
> > > The primary one in the context here is that if we do *not* have to lock
> > > the buffers all ahead of time, we can simplify the interface. We
> > > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > > we're doing right now, so we'd need another phase, called by the "user"
> > > during undo insertion. But if we do not need to lock the buffers before
> > > the insertion over all starts, the inserting location doesn't have to
> > > care.
> > >
> > > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > > time was imo fairly unconvincing. Following the "recipe" for WAL
> > > insertions is a good idea when writing a new run-of-the-mill WAL
> > > inserting location - but when writing a new fundamental facility, that
> > > already needs to modify how WAL works, then I find that much less
> > > convincing.
> > >
> >
> > One point to remember in this regard is that we do need to modify the
> > LSN in undo pages after writing WAL, so all the undo pages need to be
> > locked by that time or we again need to take the lock on them.
>
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.
>

Okay, IIUC, this means that we should have a separate phase where we
call LockUndoBuffers (or something like that) before
InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
will lock all the buffers pinned during PrepareUndoInsert.  We can
probably call LockUndoBuffers before entering the critical section to
avoid any kind of failure in critical section.  If so, that sounds
reasonable to me.

>  My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.
>

I think this can be clearly mentioned in README or someplace else.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: FETCH FIRST clause PERCENT option

2019-08-20 Thread Kyotaro Horiguchi
Hi,

At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen  
wrote in 
> Hi
> On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi 
> wrote:
> 
> >
> > I have some comments.
> >
> > This patch uses distinct parameters for exact number and
> > percentage. On the othe hand planner has a notion of
> > tuple_fraction covering the both. The same handling is also used
> > for tuple number estimation. Using the same scheme will make
> > things far simpler. See the comment of grouping_planner().
> >
> >
> Its because of data type difference .In planner the data type is the same

I meant that, with the usage of tuple_fraction, one variable can
represent both the absolute limit and relative limit. This
simplifies parse structs.

> > In executor part, addition to LimiteState.position, this patch
> > uses LimiteState.backwardPosition to count how many tuples we're
> > back from the end of the current tuplestore. But things will get
> > simpler by just asking the tuplestore for the number of holding
> > tuples.
> >
> >
> backwardPosition hold how many tuple returned in backward scan

I meant that we don't need to hold the number in estate.

> > +slot = node->subSlot;
> >
> > Why is this needed? The variable is properly set before use and
> > the assignment is bogus in the first place.
> >
> >
> its because Tuplestore needs initialized slot.

I meant that the initilized slot is overwritten before first use.

> > The new code block in LIMIT_RESCAN in ExecLimit is useless since
> > it is exatctly the same with existing code block. Why didn't you
> > use the existing if block?
> >
> 
> But they test different scenario

I meant that the two different scenario can share the code block.

> > >if (node->limitOption == PERCENTAGE)
> > +{
> > +node->perExactCount = ceil(node->percent *
> > node->position / 100.0);
> > +
> > +
> >
> > node->position is the number of tuples returned to upper node so
> > far (not the number of tuples this node has read from the lower
> > node so far).  I don't understand what the expression means.
> >
> 
> node->position hold the number of tuples this node has read from the lower
> node so far. see LIMIT_RESCAN state

Reallly? node->position is incremented when
tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the
tuple to the caller immediately...

> > +if (node->perExactCount == node->perExactCount + 1)
> > +node->perExactCount++;
> >
> > What? The condition never be true. As the result, the following
> > if block below won't run.
> >
> 
> it became true according to the number of tuple returned in from the lower
> node so far
> and percentage specification.

Mmm. How do you think X can be equal to  (X + 1)?

> > >/*
> > + * Return the tuple up to the number of exact count in
> > OFFSET
> > + * clause without percentage value consideration.
> > + */
> > +if (node->perExactCount > 0)
> > +{
> > +
> >
> >
> >
> >
> > +/*
> > + * We may needed this tuple in backward scan so put it into
> > + * tuplestore.
> > + */
> > +if (node->limitOption == PERCENTAGE)
> > +{
> > +tuplestore_puttupleslot(node->tupleStore, slot);
> > +tuplestore_advance(node->tupleStore, true);
> > +}
> >
> > "needed"->"need" ? The comment says that this is needed for
> > backward scan, but it is actually required even in forward
> > scan. More to the point, tuplestore_advance lacks comment.
> 
> 
> ok
> 
> 
> >
> > Anyway, the code in LIMIT_RESCAN is broken in some ways. For
> > example, if it is used as the inner scan of a NestLoop, the
> > tuplestore accumulates tuples by every scan. You will see that
> > the tuplestore stores up to 1000 tuples (10 times of the inner)
> > by the following query.
> >
> 
> It this because in percentage we scan the whole table

It's useless and rather harmful that the tuplestore holds
indefinite number of duplicate set of the whole tuples from the
lower node. We must reuse tuples already stored in the tuplestore
or clear it before the next round.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-20 Thread Noah Misch
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch  wrote in 
> <20190818035230.gb3021...@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> 
> Now TwoPhaseFileHeader has two new members for (commit-time)
> pending syncs. Pending-syncs are useless on wal-replay, but that
> is needed for commit-prepared.

There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files.  (A crash may
have already destroyed their data.)  PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.

A failure in these new syncs needs to prevent the transaction from being
marked committed.  Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit().  I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may 
> > be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

I hadn't thought that far.  What do you think is best?

> > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the 
> > relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole 
> > file
> > at that point."  Please write the part to WAL-log the contents of small 
> > files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
every buffer header containing a buffer of the current database.  The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >* If it does commit, we'll have done the table_finish_bulk_insert() at
> > >* the bottom of this routine first.
> > >*
> > > -  * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -  * is not always set correctly, since in rare cases 
> > > rd_newRelfilenodeSubid
> > > -  * can be cleared before the end of the transaction. The exact case is
> > > -  * when a relation sets a new relfilenode twice in same transaction, yet
> > > -  * the second one fails in an aborted subtransaction, e.g.
> > > -  *
> > > -  * BEGIN;
> > > -  * TRUNCATE t;
> > > -  * SAVEPOINT save;
> > > -  * TRUNCATE t;
> > > -  * ROLLBACK TO save;
> > > -  * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

Did you build with --enable-cassert?

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >   SubTransactionId rd_createSubid;/* rel was created in current 
> > > xact */
> > >   SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode 
> > > assigned in
> > >   
> > >  * current xact */
> > > + SubTransactionId rd_firstRelfilenodeSubid;  /* new relfilenode 
> > > assigned
> > > + 
> > >  * first in current xact