Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-07 Thread Amit Langote
On 2018/08/08 8:09, Tom Lane wrote:
> Rushabh Lathia  writes:
>> Consider the below case:
> 
> I initially thought the rule might be messing stuff up, but you can get
> the same result without the rule by writing out the transformed query
> by hand:
> 
> regression=# explain UPDATE pt_p1 SET a = 3 from pt
>   WHERE pt.a = 2 and pt.a = pt_p1.a;
> ERROR:  child rel 2 not found in append_rel_array
> 
> With enable_partition_pruning=off this goes through without an error.
> 
> I suspect the join pruning stuff is getting confused by the overlap
> between the two partitioning trees involved in the join; although the
> fact that one of them is the target rel must be related too, because
> if you just write a SELECT for this join it's fine.
> 
> I rather doubt that this case worked before 1b54e91fa ... no time
> to look closer today, though.

The code pointed out by Rushabh indeed seems to be the culprit in this case.

/*
 * The prunequal is presented to us as a qual for 'parentrel'.
 * Frequently this rel is the same as targetpart, so we can skip
 * an adjust_appendrel_attrs step.  But it might not be, and then
 * we have to translate.  We update the prunequal parameter here,
 * because in later iterations of the loop for child partitions,
 * we want to translate from parent to child variables.
 */
if (parentrel != subpart)
{
intnappinfos;
AppendRelInfo **appinfos = find_appinfos_by_relids(root,
subpart->relids, );

prunequal = (List *) adjust_appendrel_attrs(root, (Node *)
prunequal,
nappinfos,
appinfos);
pfree(appinfos);
}

This code is looking for the case where we have to translate prunequal's
Vars from UNION ALL parent varno to actual partitioned parent's varno, but
the detection test (if (parentrel != subpart)) turns out to be unreliable,
as shown by this report.  I think the test should be if (parent->relid !=
subpart->relid).  Comparing pointers as is done now is fine without
inheritance_planner being involved, but not when it *is* involved, because
of the following piece of code in it that overwrites RelOptInfos:

/*
 * We need to collect all the RelOptInfos from all child plans into
 * the main PlannerInfo, since setrefs.c will need them.  We use the
 * last child's simple_rel_array (previous ones are too short), so we
 * have to propagate forward the RelOptInfos that were already built
 * in previous children.
 */
Assert(subroot->simple_rel_array_size >= save_rel_array_size);
for (rti = 1; rti < save_rel_array_size; rti++)
{
RelOptInfo *brel = save_rel_array[rti];

if (brel)
subroot->simple_rel_array[rti] = brel;
}
save_rel_array_size = subroot->simple_rel_array_size;
save_rel_array = subroot->simple_rel_array;
save_append_rel_array = subroot->append_rel_array;

With this, the RelOptInfos that would've been used to create Paths that
are currently under the subroot's final rel's best path, would no longer
be accessible through that subroot, because they're overwritten by the
corresponding ones in save_rel_array.  Subsequently,
create_modifytable_plan() passes the 'subroot' to create_plan_recurse()
that will recurse down to make_parttionedrel_pruneinfo() via
create_append_plan().  The 'parentrel' RelOptInfo that's fetched off of
AppendPath is no longer reachable from 'subroot', because it's been
overwritten as mentioned above.  'subpart', the RelOptInfo (of the same RT
index) fetched from 'subroot' is thus not the same as 'parentrel'.  So,
the if (parentrel != subpart) test is mistakenly satisfied, leading to the
failure of finding the needed AppendRelInfo, which makes sense, because
'subpart' is not really a child of anything.

Attached is a patch which modifies the if test to compare relids instead
of RelOptInfo pointers.

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 752810d0e4..67f0dc5e59 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -384,7 +384,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo 
*parentrel,
 * because in later iterations of the loop for child 
partitions,
 * we want to translate from parent to child variables.
 */
-   if (parentrel != subpart)
+   if (parentrel->relid != subpart->relid)
{
int nappinfos;
AppendRelInfo **appinfos = 
find_appinfos_by_relids(root,


Re: Early WIP/PoC for inlining CTEs

2018-08-07 Thread Andres Freund
Hi,

On 2018-08-08 16:55:22 +1200, Thomas Munro wrote:
> On Fri, Jul 27, 2018 at 8:10 PM, David Fetter  wrote:
> > On Fri, Jul 27, 2018 at 02:55:26PM +1200, Thomas Munro wrote:
> >> On Thu, Jul 26, 2018 at 7:14 AM, David Fetter  wrote:
> >> > Please find attached the next version, which passes 'make check'.
> >>
> >> ... but not 'make check-world' (contrib/postgres_fdw's EXPLAIN is 
> >> different).
> >
> > Please find attached a patch that does.
> >
> > It doesn't always pass make installcheck-world, but I need to sleep
> > rather than investigate that at the moment.
> 
> One observation I wanted to share: CTE scans inhibit parallelism today
> (something we might eventually want to fix with shared tuplestores).
> This patch therefore allows parallelism in some WITH queries, which
> seems like a very valuable thing.

Might be interesting to see how big a difference it makes for
TPC-DS. Currently the results are bad (as in many queries don't finish
in a relevant time) because it uses CTEs so widely, and there's often
predicates outside the CTE that could be pushed down.

- Andres



Re: Early WIP/PoC for inlining CTEs

2018-08-07 Thread Thomas Munro
On Fri, Jul 27, 2018 at 8:10 PM, David Fetter  wrote:
> On Fri, Jul 27, 2018 at 02:55:26PM +1200, Thomas Munro wrote:
>> On Thu, Jul 26, 2018 at 7:14 AM, David Fetter  wrote:
>> > Please find attached the next version, which passes 'make check'.
>>
>> ... but not 'make check-world' (contrib/postgres_fdw's EXPLAIN is different).
>
> Please find attached a patch that does.
>
> It doesn't always pass make installcheck-world, but I need to sleep
> rather than investigate that at the moment.

One observation I wanted to share: CTE scans inhibit parallelism today
(something we might eventually want to fix with shared tuplestores).
This patch therefore allows parallelism in some WITH queries, which
seems like a very valuable thing.   Example:

postgres=# create table foo as select generate_series(1, 100) i;
SELECT 100
postgres=# create table bar as select generate_series(1, 100) i;
SELECT 100
postgres=# create table baz as select generate_series(1, 100) i;
SELECT 100
postgres=# analyze;
ANALYZE

=== unpatched master ===

postgres=# explain analyze with cte as (select * from foo join bar
using (i)) select count(*) from cte join baz using (i);
QUERY PLAN
---
 Aggregate  (cost=149531.00..149531.01 rows=1 width=8) (actual
time=4400.951..4400.951 rows=1 loops=1)
   CTE cte
 ->  Hash Join  (cost=30832.00..70728.00 rows=100 width=4)
(actual time=551.243..1961.319 rows=100 loops=1)
   Hash Cond: (foo.i = bar.i)
   ->  Seq Scan on foo  (cost=0.00..14425.00 rows=100
width=4) (actual time=0.048..219.238 rows=100 loops=1)
   ->  Hash  (cost=14425.00..14425.00 rows=100 width=4)
(actual time=550.477..550.478 rows=100 loops=1)
 Buckets: 131072  Batches: 16  Memory Usage: 3227kB
 ->  Seq Scan on bar  (cost=0.00..14425.00
rows=100 width=4) (actual time=0.031..213.238 rows=100
loops=1)
   ->  Hash Join  (cost=30832.00..76303.00 rows=100 width=0)
(actual time=1090.162..4279.945 rows=100 loops=1)
 Hash Cond: (cte.i = baz.i)
 ->  CTE Scan on cte  (cost=0.00..2.00 rows=100
width=4) (actual time=551.247..2564.529 rows=100 loops=1)
 ->  Hash  (cost=14425.00..14425.00 rows=100 width=4)
(actual time=538.833..538.833 rows=100 loops=1)
   Buckets: 131072  Batches: 16  Memory Usage: 3227kB
   ->  Seq Scan on baz  (cost=0.00..14425.00 rows=100
width=4) (actual time=0.039..208.658 rows=100 loops=1)
 Planning Time: 0.291 ms
 Execution Time: 4416.732 ms
(16 rows)

=== 0001-Inlining-CTEs-v0005.patch  ===

postgres=# explain analyze with cte as (select * from foo join bar
using (i)) select count(*) from cte join baz using (i);

QUERY PLAN

 Finalize Aggregate  (cost=57854.78..57854.79 rows=1 width=8) (actual
time=1441.663..1441.664 rows=1 loops=1)
   ->  Gather  (cost=57854.57..57854.78 rows=2 width=8) (actual
time=1440.506..1474.974 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=56854.57..56854.58 rows=1
width=8) (actual time=1435.017..1435.018 rows=1 loops=3)
   ->  Parallel Hash Join  (cost=30856.01..55812.90
rows=416667 width=0) (actual time=1135.164..1393.437 rows=33
loops=3)
 Hash Cond: (foo.i = baz.i)
 ->  Parallel Hash Join  (cost=15428.00..32202.28
rows=416667 width=8) (actual time=457.786..753.374 rows=33
loops=3)
   Hash Cond: (foo.i = bar.i)
   ->  Parallel Seq Scan on foo
(cost=0.00..8591.67 rows=416667 width=4) (actual time=0.094..87.666
rows=33 loops=3)
   ->  Parallel Hash  (cost=8591.67..8591.67
rows=416667 width=4) (actual time=217.222..217.222 rows=33
loops=3)
 Buckets: 131072  Batches: 16  Memory
Usage: 3520kB
 ->  Parallel Seq Scan on bar
(cost=0.00..8591.67 rows=416667 width=4) (actual time=0.061..84.631
rows=33 loops=3)
 ->  Parallel Hash  (cost=8591.67..8591.67
rows=416667 width=4) (actual time=227.240..227.241 rows=33
loops=3)
   Buckets: 131072  Batches: 16  Memory Usage: 3520kB
   ->  Parallel Seq Scan on baz
(cost=0.00..8591.67 rows=416667 width=4) (actual time=0.060..84.270
rows=33 loops=3)
 Planning Time: 0.407 ms
 Execution Time: 1475.113 ms
(18 rows)

-- 
Thomas Munro
http://www.enterprisedb.com



Why do we expand tuples in execMain.c?

2018-08-07 Thread Andres Freund
Hi,

I noticed
if (HeapTupleHeaderGetNatts(tuple.t_data) <
RelationGetDescr(erm->relation)->natts)
{
copyTuple = heap_expand_tuple(,

  RelationGetDescr(erm->relation));
}
else
{
/* successful, copy tuple */
copyTuple = heap_copytuple();
}

in EvalPlanQualFetchRowMarks, and I'm somewhat confused why it's there?
If it's required here, why isn't it required in dozens of other places?

Greetings,

Andres Freund



Re: Shared buffer access rule violations?

2018-08-07 Thread Peter Geoghegan
On Tue, Aug 7, 2018 at 6:43 PM, Asim R P  wrote:
> Please find attached a patch to mark a shared buffer as read-write or
> read-only using mprotect().  The idea is to catch violations of shared
> buffer access rules.  This patch was useful to detect the access
> violations reported in this thread.  The mprotect() calls are enabled
> by -DMPROTECT_BUFFER compile time flag.

I wonder if it would be a better idea to enable Valgrind's memcheck to
mark buffers as read-only or read-write. We've considered doing
something like that for years, but for whatever reason nobody followed
through.

Using Valgrind would have the advantage of making it possible to mark
memory as undefined or as noaccess. I can imagine doing even fancier
things by making that distinction within buffers. For example, marking
the hole in the middle of a page/buffer as undefined, while still
marking the entire buffer noaccess when there is no pin held.

-- 
Peter Geoghegan



Re: Shared buffer access rule violations?

2018-08-07 Thread Asim R P
Please find attached a patch to mark a shared buffer as read-write or
read-only using mprotect().  The idea is to catch violations of shared
buffer access rules.  This patch was useful to detect the access
violations reported in this thread.  The mprotect() calls are enabled
by -DMPROTECT_BUFFER compile time flag.

Asim
From ed7dcf633600b3a527ed52ffacd1b779da8b0235 Mon Sep 17 00:00:00 2001
From: Asim R P 
Date: Wed, 18 Jul 2018 18:32:40 -0700
Subject: [PATCH 1/2] Facility to detect shared buffer access violations

Using mprotect() to allow/disallow read/write access to one or more
shared buffers, this patch enables detection of shared buffer access
violations.  A new compile time flag "MPROTECT_BUFFERS"
(CFLAGS=-DMPROTECT_BUFFERS) is introduced to enable the detection.

A couple of violations have already been caught and fixed using this
facility: 130beba36d6dd46b8c527646f9f2433347cbfb11
---
 src/backend/storage/buffer/buf_init.c |  31 +++
 src/backend/storage/buffer/bufmgr.c   | 122 ++
 src/backend/storage/ipc/shmem.c   |  18 
 3 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 144a2cee6f..22e3d2821c 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,6 +14,11 @@
  */
 #include "postgres.h"
 
+#ifdef MPROTECT_BUFFERS
+#include 
+#include "miscadmin.h"
+#endif
+
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 
@@ -24,6 +29,28 @@ LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+#ifdef MPROTECT_BUFFERS
+/*
+ * Protect the entire shared buffers region such that neither read nor write
+ * is allowed.  Protection will change for specific buffers when accessed
+ * through buffer manager's interface.  The intent is to catch violation of
+ * buffer access rules.
+ */
+static void
+ProtectMemoryPoolBuffers()
+{
+	Size bufferBlocksTotalSize = mul_size((Size)NBuffers, (Size) BLCKSZ);
+	if (IsUnderPostmaster && IsNormalProcessingMode() &&
+		mprotect(BufferBlocks, bufferBlocksTotalSize, PROT_NONE))
+	{
+		ereport(ERROR,
+(errmsg("unable to set memory level to %d, error %d, "
+		"allocation size %ud, ptr %ld", PROT_NONE,
+		errno, (unsigned int) bufferBlocksTotalSize,
+		(long int) BufferBlocks)));
+	}
+}
+#endif
 
 /*
  * Data Structures:
@@ -149,6 +176,10 @@ InitBufferPool(void)
 	/* Initialize per-backend file flush context */
 	WritebackContextInit(,
 		 _flush_after);
+
+#ifdef MPROTECT_BUFFERS
+	ProtectMemoryPoolBuffers();
+#endif
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..2ef3c75b6a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -31,6 +31,9 @@
 #include "postgres.h"
 
 #include 
+#ifdef MPROTECT_BUFFERS
+#include 
+#endif
 #include 
 
 #include "access/xlog.h"
@@ -177,6 +180,75 @@ static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
 
+#ifdef MPROTECT_BUFFERS
+#define ShouldMemoryProtect(buf) (IsUnderPostmaster &&		  \
+  IsNormalProcessingMode() && \
+  !BufferIsLocal(buf->buf_id+1) && \
+  !BufferIsInvalid(buf->buf_id+1))
+
+static inline void BufferMProtect(volatile BufferDesc *bufHdr, int protectionLevel)
+{
+	if (ShouldMemoryProtect(bufHdr))
+	{
+		if (mprotect(BufHdrGetBlock(bufHdr), BLCKSZ, protectionLevel))
+		{
+			ereport(ERROR,
+	(errmsg("unable to set memory level to %d, error %d, "
+			"block size %d, ptr %ld", protectionLevel,
+			errno, BLCKSZ, (long int) BufHdrGetBlock(bufHdr;
+		}
+	}
+}
+#endif
+
+static inline void ReleaseContentLock(volatile BufferDesc *buf)
+{
+	LWLockRelease(BufferDescriptorGetContentLock(buf));
+
+#ifdef MPROTECT_BUFFERS
+	/* make the buffer read-only after releasing content lock */
+	if (!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)))
+		BufferMProtect(buf, PROT_READ);
+#endif
+}
+
+
+static inline void AcquireContentLock(volatile BufferDesc *buf, LWLockMode mode)
+{
+#ifdef MPROTECT_BUFFERS
+	const bool newAcquisition =
+		!LWLockHeldByMe(BufferDescriptorGetContentLock(buf));
+
+	LWLockAcquire(BufferDescriptorGetContentLock(buf), mode);
+
+	/* new acquisition of content lock, allow read/write memory access */
+	if (newAcquisition)
+		BufferMProtect(buf, PROT_READ|PROT_WRITE);
+#else
+	LWLockAcquire(BufferDescriptorGetContentLock(buf), mode);
+#endif
+}
+
+static inline bool ConditionalAcquireContentLock(volatile BufferDesc *buf,
+ LWLockMode mode)
+{
+#ifdef MPROTECT_BUFFERS
+	const bool newAcquisition =
+		!LWLockHeldByMe(BufferDescriptorGetContentLock(buf));
+
+	if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf), mode))
+	{
+		/* new 

Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-07 Thread Tom Lane
Rushabh Lathia  writes:
> Consider the below case:

I initially thought the rule might be messing stuff up, but you can get
the same result without the rule by writing out the transformed query
by hand:

regression=# explain UPDATE pt_p1 SET a = 3 from pt
  WHERE pt.a = 2 and pt.a = pt_p1.a;
ERROR:  child rel 2 not found in append_rel_array

With enable_partition_pruning=off this goes through without an error.

I suspect the join pruning stuff is getting confused by the overlap
between the two partitioning trees involved in the join; although the
fact that one of them is the target rel must be related too, because
if you just write a SELECT for this join it's fine.

I rather doubt that this case worked before 1b54e91fa ... no time
to look closer today, though.

regards, tom lane



Re: Page freezing, FSM, and WAL replay

2018-08-07 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Aug 07, 2018 at 12:19:13PM -0400, Alvaro Herrera wrote:
>> Now that the minors have been tagged, I'm considering pushing this
>> shortly to 9.6 - master.

> I may be missing something, but the next round of minor releases is not
> tagged yet, and only stamped.  So I think that you should hold on a bit
> more.

I've not heard any complaints from packagers, so I'm expecting to apply
the tags shortly.  I don't see any reason Alvaro can't go ahead when
he's ready.

regards, tom lane



Re: Make foo=null a warning by default.

2018-08-07 Thread Bruce Momjian
hOn Mon, Jul 16, 2018 at 11:37:28AM -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 16/07/18 18:10, Tom Lane wrote:
> >> TBH I'm not really excited about investing any work in this area at all.
> >> Considering how seldom we hear any questions about transform_null_equals
> >> anymore[1], I'm wondering if we couldn't just rip the "feature" out
> >> entirely.
> 
> > Yeah, I was wondering about that too. But Fabien brought up a completely 
> > new use-case for this: people learning SQL. For beginners who don't 
> > understand the behavior of NULLs yet, I can see a warning or error being 
> > useful training wheels. Perhaps a completely new "training_wheels=on" 
> > option, which could check may for many other beginner errors, too, would 
> > be better for that.
> 
> Agreed --- but what we'd want that to do seems only vaguely related to
> the existing behavior of transform_null_equals.  As an example, we
> intentionally made transform_null_equals *not* trigger on
> 
>   CASE x WHEN NULL THEN ...
> 
> but a training-wheels warning for that would likely be reasonable.
> 
> For that matter, many of the old threads about this are complaining
> about nulls that aren't simple literals in the first place.  I wonder
> whether a training-wheels feature that whined *at runtime* about null
> WHERE-qual or case-test results would be more useful than a parser
> check.

I will again say I would love to see this as part of a wholesale
"novice" mode which warns of generally bad SQL practices.  I don't see
this one item alone as sufficiently useful.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Page freezing, FSM, and WAL replay

2018-08-07 Thread Michael Paquier
On Tue, Aug 07, 2018 at 12:19:13PM -0400, Alvaro Herrera wrote:
> Now that the minors have been tagged, I'm considering pushing this
> shortly to 9.6 - master.

I may be missing something, but the next round of minor releases is not
tagged yet, and only stamped.  So I think that you should hold on a bit
more.
--
Michael


signature.asc
Description: PGP signature


Re: Negotiating the SCRAM channel binding type

2018-08-07 Thread Heikki Linnakangas

On 07/08/18 22:34, Bruce Momjian wrote:

On Thu, Jul 12, 2018 at 11:26:30AM +0300, Heikki Linnakangas wrote:

On 12/07/18 07:14, Michael Paquier wrote:

On Wed, Jul 11, 2018 at 03:01:03PM +0300, Heikki Linnakangas wrote:

I started digging into this more closely, and ran into a little problem. If
channel binding is not used, the client sends a flag to the server to
indicate if it's because the client doesn't support channel binding, or
because it thought that the server doesn't support it. This is the
gs2-cbind-flag. How should the flag be set, if the server supports channel
binding type A, while client supports only type B? The purpose of the flag
is to detect downgrade attacks, where a man-in-the-middle removes the PLUS
variants from the list of supported mechanisms. If we treat incompatible
channel binding types as "client doesn't support channel binding", then the
downgrade attack becomes possible (the attacker can replace the legit PLUS
variants with bogus channel binding types that the client surely doesn't
support). If we treat it as "server doesn't support channel binding", then
we won't downgrade to the non-channel-binding variant, in the legitimate
case that the client and server both support channel binding, but with
incompatible types.

What we'd really want, is to include the full list of server's supported
mechanisms as part of the exchange, not just a boolean "y/n" flag. But
that's not what the spec says :-(.


Let's not break the spec :)  I understand from it that the client is in
charge of the choice, so we are rather free to choose the reaction the
client should have.  In the second phase of the exchange, the client
communicates back to the server the channel binding it has decided to
choose, it is not up to the server to pick up one if the client thinks
that it can use multiple ones.


The case where the client and the server both support the same channel
binding type, we're OK. The problematic case is when e.g. the server only
supports tls-unique and the client only supports tls-server-end-point. What
we would (usually) like to happen, is to fall back to not using channel
binding. But it's not clear how to make that work, and still protect from
downgrade attacks. If the client responds "y", meaning "the client supports
channel binding, but it looks like the server doesn't", then the server will
reject the authentication, because it did actually support channel binding.
Just not the same one that the client did. If the client could reply "y",
and also echo back the server's list of supported channel bindings in the
same message, that would solve the problem. But the spec doesn't do that.

 

I know this is an academic question now, but I am not sure this is true.
A man-in-the-middle attacker could say they don't support channel
binding to the real client and real server and pretend to be the real
server.  What would work is to hash the secret in with the supported
channel binding list.  This is how TLS works --- all previous messages
are combined with the secret into a transmitted hash to prevent a MITM
from changing it.


Yeah, that is what I meant. Currently, when client chooses to not use 
channel binding, it the sends a single flag, y/n, to indicate whether it 
thinks the server supports channel binding or not. That flag is included 
in the hashes used in the authentication, so if a MITM tries to change 
it, the authentication will fail. If instead of a single flag, it 
included a list of channel binding types supported by the server, that 
would solve the problem with supporting multiple channel binding types.


- Heikki



Re: Negotiating the SCRAM channel binding type

2018-08-07 Thread Bruce Momjian
On Thu, Jul 12, 2018 at 11:26:30AM +0300, Heikki Linnakangas wrote:
> On 12/07/18 07:14, Michael Paquier wrote:
> >On Wed, Jul 11, 2018 at 03:01:03PM +0300, Heikki Linnakangas wrote:
> >>I started digging into this more closely, and ran into a little problem. If
> >>channel binding is not used, the client sends a flag to the server to
> >>indicate if it's because the client doesn't support channel binding, or
> >>because it thought that the server doesn't support it. This is the
> >>gs2-cbind-flag. How should the flag be set, if the server supports channel
> >>binding type A, while client supports only type B? The purpose of the flag
> >>is to detect downgrade attacks, where a man-in-the-middle removes the PLUS
> >>variants from the list of supported mechanisms. If we treat incompatible
> >>channel binding types as "client doesn't support channel binding", then the
> >>downgrade attack becomes possible (the attacker can replace the legit PLUS
> >>variants with bogus channel binding types that the client surely doesn't
> >>support). If we treat it as "server doesn't support channel binding", then
> >>we won't downgrade to the non-channel-binding variant, in the legitimate
> >>case that the client and server both support channel binding, but with
> >>incompatible types.
> >>
> >>What we'd really want, is to include the full list of server's supported
> >>mechanisms as part of the exchange, not just a boolean "y/n" flag. But
> >>that's not what the spec says :-(.
> >
> >Let's not break the spec :)  I understand from it that the client is in
> >charge of the choice, so we are rather free to choose the reaction the
> >client should have.  In the second phase of the exchange, the client
> >communicates back to the server the channel binding it has decided to
> >choose, it is not up to the server to pick up one if the client thinks
> >that it can use multiple ones.
> 
> The case where the client and the server both support the same channel
> binding type, we're OK. The problematic case is when e.g. the server only
> supports tls-unique and the client only supports tls-server-end-point. What
> we would (usually) like to happen, is to fall back to not using channel
> binding. But it's not clear how to make that work, and still protect from
> downgrade attacks. If the client responds "y", meaning "the client supports
> channel binding, but it looks like the server doesn't", then the server will
> reject the authentication, because it did actually support channel binding.
> Just not the same one that the client did. If the client could reply "y",
> and also echo back the server's list of supported channel bindings in the
> same message, that would solve the problem. But the spec doesn't do that.


I know this is an academic question now, but I am not sure this is true.
A man-in-the-middle attacker could say they don't support channel
binding to the real client and real server and pretend to be the real
server.  What would work is to hash the secret in with the supported
channel binding list.  This is how TLS works --- all previous messages
are combined with the secret into a transmitted hash to prevent a MITM
from changing it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Page freezing, FSM, and WAL replay

2018-08-07 Thread Alvaro Herrera
On 2018-Aug-02, Alvaro Herrera wrote:

> After considering several possible solutions, I propose to have
> heap_xlog_visible compute free space for any page being marked frozen;
> Pavan adds to that to have heap_xlog_clean compute free space for all
> pages also.  This means that if we later promote this standby and VACUUM
> skips all-frozen pages, their FSM numbers are going to be up-to-date
> anyway.  Patch attached.

Now that the minors have been tagged, I'm considering pushing this
shortly to 9.6 - master.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump: sortDumpableObjectsByTypeName() doesn't always do that

2018-08-07 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Jacob Champion  writes:
> > Thanks! We have made some progress towards a repro but we're having
> > problems putting it into the pg_dump suite.
> 
> Yeah, I find the pg_dump test suite to be pretty much unreadable too.
> Don't worry about it --- I don't think we need to memorialize a test
> case for this.

This would be a similar case to the test:

COPY fk_reference_test_table second

and really wouldn't be very difficult to add, you just have to realize
that the regexp has be set up to match the output of pg_dump and that's
it's a multi-line one, as that test case is.

For my 2c, this seems like a pretty reasonable test to add and wouldn't
be very hard to do.  If you'd like help, please reach out.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: pg_dump: sortDumpableObjectsByTypeName() doesn't always do that

2018-08-07 Thread Tom Lane
Jacob Champion  writes:
> Thanks! We have made some progress towards a repro but we're having
> problems putting it into the pg_dump suite.

Yeah, I find the pg_dump test suite to be pretty much unreadable too.
Don't worry about it --- I don't think we need to memorialize a test
case for this.

> Here's a simple case for you:

It is nice to have that in case anyone wants to verify the fix manually,
though.  Thanks!

regards, tom lane



Re: pg_dump: sortDumpableObjectsByTypeName() doesn't always do that

2018-08-07 Thread Jacob Champion
On Tue, Aug 7, 2018 at 10:24 AM Tom Lane  wrote:
> I don't see any reason to insist on a test case before pushing this
> fix, so I did that.  (As I expected, the fix doesn't change any existing
> regression test results.)

Thanks! We have made some progress towards a repro but we're having
problems putting it into the pg_dump suite. Here's a simple case for
you:

CREATE USER me;
CREATE SCHEMA a;
ALTER DEFAULT PRIVILEGES IN SCHEMA a GRANT UPDATE ON TABLES TO me;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON SEQUENCES TO me;
ALTER DEFAULT PRIVILEGES GRANT SELECT ON SEQUENCES TO me;

We dumped this from a newly created database (commit e0ee9305 from
master) and got the following order:

ALTER DEFAULT PRIVILEGES ... IN SCHEMA public REVOKE ALL ON SEQUENCES...
ALTER DEFAULT PRIVILEGES ... IN SCHEMA public GRANT SELECT ON SEQUENCES...
...
ALTER DEFAULT PRIVILEGES ... GRANT SELECT ON SEQUENCES...
...
ALTER DEFAULT PRIVILEGES ... IN SCHEMA a REVOKE ALL ON TABLES...
ALTER DEFAULT PRIVILEGES ... IN SCHEMA a GRANT UPDATE ON TABLES...

In this case, schema a should dump before public. If you switch the
first two ALTER DEFAULT PRIVILEGES lines in the reproduction SQL, you
should get a different (correct) ordering.

--Jacob



Re: Typo in doc or wrong EXCLUDE implementation

2018-08-07 Thread Bruce Momjian


This email was sent to docs, but I think it is a hackers issue.  The
person is asking why exclusion constraints aren't marked as UNIQUE
indexes that can be used for referential integrity.  I think the reason
is that non-equality exclusion constraints, like preventing overlap, but
don't uniquely identify a specific value, and I don't think we want to
auto-UNIQUE just for equality exclusion constraints.

---

On Tue, Jul 10, 2018 at 09:34:36AM +, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/10/static/sql-createtable.html
> Description:
> 
> Hi.
> 
> https://www.postgresql.org/docs/current/static/sql-createtable.html#sql-createtable-exclude
> If all of the specified operators test for equality, this is equivalent to a
> UNIQUE constraint
> 
> Exclusion constraints are implemented using an index
> 
> 
> ALTER TABLE person
>   add constraint person_udx_person_id2
>   EXCLUDE USING gist (
> person_id WITH = 
>   ) 
> ;
> 
> tucha=> ALTER TABLE "person_x_person" ADD CONSTRAINT
> "person_x_person_fk_parent_person_id"
> tucha->   FOREIGN KEY ("parent_person_id")
> tucha->   REFERENCES "person" ("person_id")
> tucha->   ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE;
> ERROR:  there is no unique constraint matching given keys for referenced
> table "person"
> 
> because gist does not support unique indexes, I try with 'btree'
> 
> 
> ALTER TABLE person
>   add constraint person_udx_person_id2
>   EXCLUDE USING btree (
> person_id WITH =
>   )
> ;
> 
> \d person
> ...
> "person_udx_person_id2" EXCLUDE USING btree (person_id WITH =)
> 
> tucha=> ALTER TABLE "person_x_person" ADD CONSTRAINT
> "person_x_person_fk_parent_person_id"
> tucha->   FOREIGN KEY ("parent_person_id")
> tucha->   REFERENCES "person" ("person_id")
> tucha->   ON DELETE CASCADE ON UPDATE NO ACTION DEFERRABLE;
> ERROR:  there is no unique constraint matching given keys for referenced
> table "person"
> 
> Why postgres does not add unique flag. Despite on: "this is equivalent to a
> UNIQUE constraint"
> I thought it should be:
> "person_udx_person_id2" UNIQUE EXCLUDE USING btree (person_id WITH =)
> 
> PS. 
> > For example, you can specify a constraint that no two rows in the table
> contain overlapping circles (see Section 8.8) by using the && operator.
> 
> Also I expect that this:
> ALTER TABLE person
>   add constraint person_udx_person_id
>   EXCLUDE USING gist (
> person_id WITH =,
> tstzrange(valid_from, valid_till, '[)' ) WITH &&
>   )
> 
> also should raise UNIQUE flag for exclusion thus we can use it in FK


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Tom Lane
Don Seiler  writes:
> On Tue, Aug 7, 2018 at 11:29 AM, Tom Lane  wrote:
>> Well, if you're going to insist on that part, it's probably not worth
>> making the application_name GUC have inconsistent behavior.

> OK so just to make sure I understand:

> 1. We want to make a generic, central ascii-lobotomizing function similar
> to check_application_name that we can re-use there and for other checks (eg
> user name).
> 2. Change check_application_name to call this function (or just call this
> function instead of check_application_name()?)

check_application_name's API is dictated by the GUC check-hook interface,
and doesn't really make sense for this other use.  So the first part of
that, not the second.

> 3. Call this function when storing the value in the port struct.

I'm not sure where exactly is the most sensible place to call it,
but trying to minimize the number of places that know about this
kluge seems like a good principle.

regards, tom lane



Re: pg_dump: sortDumpableObjectsByTypeName() doesn't always do that

2018-08-07 Thread Tom Lane
Jacob Champion  writes:
> On Mon, Aug 6, 2018 at 12:45 PM Tom Lane  wrote:
>> Ah, gotcha.  But whether the behavior is sane or not, it'd be reproducible
>> for any specific input dataset on any specific platform (unless you've got
>> a quicksort that actually uses randomized pivots; but ours doesn't, and
>> I think that pg_dump does use src/port/qsort.c).  So that partially
>> answers Andrew's question as to why we've not seen instability in the
>> buildfarm's results.

> I completely missed that the qsort in use was part of libpgport; that
> should make it much easier to repro. We'll give it a shot.

I don't see any reason to insist on a test case before pushing this
fix, so I did that.  (As I expected, the fix doesn't change any existing
regression test results.)

regards, tom lane



Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Don Seiler
On Tue, Aug 7, 2018 at 11:29 AM, Tom Lane  wrote:

> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> But having said that, I don't exactly see why you couldn't force it
> >> with an ultimately-redundant SetConfigOption call to put the value
> >> in place before the ereport happens.  The GUC machinery is surely
> >> functional before we do authorization.
>
> > If that's the approach you think makes the most sense, I wouldn't object
> > to it.  I will point out that we'd end up with the application name in
> > the log line if it's also included in log_line_prefix, but that's what
> > happens with "user" anyway, isn't it?, so that doesn't seem to be a big
> > deal.  I do think it's still good to have appplication_name explicitly
> > in the log message for users who want to just log application_name on
> > connection and not have it on every single log line.
>
> Well, if you're going to insist on that part, it's probably not worth
> making the application_name GUC have inconsistent behavior.
>
> regards, tom lane
>


OK so just to make sure I understand:

1. We want to make a generic, central ascii-lobotomizing function similar
to check_application_name that we can re-use there and for other checks (eg
user name).
2. Change check_application_name to call this function (or just call this
function instead of check_application_name()?)
3. Call this function when storing the value in the port struct.

Please let me know if I'm missing/misunderstanding anything.

Don.

-- 
Don Seiler
www.seiler.us


Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-08-07 Thread Rushabh Lathia
Hi,

Consider the below case:

CREATE TABLE pt (a INT, b INT, c INT) PARTITION BY RANGE(a);
CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (1) to (6) PARTITION BY
RANGE (b);
CREATE TABLE pt_p1_p1 PARTITION OF pt_p1 FOR VALUES FROM (11) to (44);
CREATE TABLE pt_p1_p2 PARTITION OF pt_p1 FOR VALUES FROM (44) to (66);
INSERT INTO pt (a,b,c) VALUES
(1,11,111),(2,22,222),(3,33,333),(4,44,444),(5,55,555);

-- rule on root partition to first level child,
CREATE RULE pt_rule_ptp1 AS ON UPDATE TO pt DO INSTEAD UPDATE pt_p1 SET a =
new.a WHERE a = old.a;

-- Below command end up with error
UPDATE pt SET a = 3 WHERE a = 2;
ERROR:  child rel 1 not found in append_rel_array

Here update on the partition table fail, if it has rule which is define on
partition table - to redirect record on the child table.

While looking further, I found the test started failing with below commit:

commit 1b54e91faabf3764b6786915881e514e42dccf89
Author: Tom Lane 
Date:   Wed Aug 1 19:42:46 2018 -0400

Fix run-time partition pruning for appends with multiple source rels.


Error coming from below code, where its try to adjust the appendrel
attribute and end up with error from find_appinfos_by_relids().

/*
 * The prunequal is presented to us as a qual for 'parentrel'.
 * Frequently this rel is the same as targetpart, so we can skip
 * an adjust_appendrel_attrs step.  But it might not be, and
then
 * we have to translate.  We update the prunequal parameter
here,
 * because in later iterations of the loop for child partitions,
 * we want to translate from parent to child variables.
 */
if (parentrel != subpart)
{
intnappinfos;
AppendRelInfo **appinfos = find_appinfos_by_relids(root,

subpart->relids,

);

prunequal = (List *) adjust_appendrel_attrs(root, (Node *)
prunequal,
nappinfos,
appinfos);

pfree(appinfos);
}


Regards,

On Thu, Aug 2, 2018 at 8:36 PM, Alvaro Herrera 
wrote:

> On 2018-Aug-01, Tom Lane wrote:
>
> > David Rowley  writes:
> > > On 20 July 2018 at 01:03, David Rowley 
> wrote:
> > >> I've attached a patch intended for master which is just v2 based on
> > >> post 5220bb7533.
> >
> > I've pushed the v3 patch with a lot of editorial work (e.g. cleaning
> > up comments you hadn't).
>
> Thanks Tom, much appreciated.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


-- 
Rushabh Lathia
www.EnterpriseDB.com


Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> But having said that, I don't exactly see why you couldn't force it
>> with an ultimately-redundant SetConfigOption call to put the value
>> in place before the ereport happens.  The GUC machinery is surely
>> functional before we do authorization.

> If that's the approach you think makes the most sense, I wouldn't object
> to it.  I will point out that we'd end up with the application name in
> the log line if it's also included in log_line_prefix, but that's what
> happens with "user" anyway, isn't it?, so that doesn't seem to be a big
> deal.  I do think it's still good to have appplication_name explicitly
> in the log message for users who want to just log application_name on
> connection and not have it on every single log line.

Well, if you're going to insist on that part, it's probably not worth
making the application_name GUC have inconsistent behavior.

regards, tom lane



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2018-08-07 Thread Fabien COELHO



Hello Marina,


v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
- a patch for the RandomState structure (this is used to reset a client's 
random seed during the repeating of transactions after serialization/deadlock 
failures).


About this v10 part 1:

Patch applies cleanly, compile, global & local make check both ok.

The random state is cleanly separated so that it will be easy to reset it 
on client error handling ISTM that the pgbench side is deterministic with

the separation of the seeds for different uses.

Code is clean, comments are clear.

I'm wondering what is the rational for the "xseed" field name? In 
particular, what does the "x" stands for?


--
Fabien.



Re: Standby trying "restore_command" before local WAL

2018-08-07 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> That's how I read this part of RestoreArchivedFile:
> 
> https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogarchive.c#L110
> 
> The very first thing it does is checking if the local file exists, and if it
> does then calling unlink().

Oh, yeah, once we've decided to go down that route we're doing that,
wasn't thinking it through completely, sorry for the noise.

> >>We'd need invoking some other command before restore_command, which would do
> >>this comparison and decide whether to use local or remote WAL.
> >
> >Ugh, that sounds like a lot of additional complication that I'm not sure
> >would be all that useful or sensible for this particular case, if that's
> >what it would require.  I'd rather we try to figure out some way to get
> >rid of restore_command entirely instead of bolting on more things around
> >it.
> 
> The simpler the better of course.
> 
> All I'm saying is that (assuming my understanding of RestoreArchivedFile is
> correct) we can't just do that in the current restore_command. We do need a
> way to ask the archive for some metadata/checksums, and restore_command is
> too late.

Yeah, I see what you mean there.  An alternative might be to *not*
unlink the file, in case the restore command wants to check if it's
valid and matches what's in the archive, and instead restore the
requested file somewhere else and then perform an unlink/mv after
the restore command has been run.

I'm still not entirely sure that this is a particularly good idea, but
it would at least avoid having another command for a user to have to
configure.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Standby trying "restore_command" before local WAL

2018-08-07 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 08/06/2018 09:32 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >>On 08/06/2018 06:11 PM, Stephen Frost wrote:
> >WAL checksums are per WAL record, not across the whole file...  And,
> >yes, this seems like something we could probably write code to ensure
> >we're doing correctly, possibly even without changing the existing WAL
> >format or maybe we would have to, but either way, seems like additional
> >code that would need to be written and some serious thought put into it
> >to make sure it's correct.  I'm all for that, just to be clear, but what
> >I don't think we can do is move forward with a change to just prefer
> >pg_wal over restore command.
> 
> While the checksums are per-record, the record also contains xl_prev, so
> it's effectively a chain of checksums covering the whole file. The other
> thing you can do is look at the first record of the next segment and use the
> xl_prev to detect if the previous segment was not partial.
> 
> But I do agree doing this properly may require writing some new code and
> checking that those cases are detected correctly.

Right, might be possible but isn't something we necessairly guarantee
will always work correctly today in this situation where we've got old
WAL files that were pulled over a period of time (imagine that we copy
WAL file ABC before PG was done with it, but we don't get around to
copying WAL file DEF until much later after ABC has been completed and
DEF is only partial, or other such scenarios).  Remember, the scenario
we're talking about here is where someone did a pg_start_backup, took
their time copying all the files in PGDATA, including pg_wal, and then
at some later point ran pg_stop_backup.  We have no idea when those WAL
files were copied and they could have been anywhere between the
pg_start_backup point and the pg_stop_backup point.

> (Note: There was a discussion about replacing xl_prev with LSN of the
> current record, and IMHO that would work just as well, although it might be
> a bit more expensive for some operations.)

I haven't thought very much about this so don't have much of an opinion
on it one way or the other at this point.

> >CRC's are per WAL record, and possibly some WAL records might not be ok
> >to replay, or at least we need to make sure that we replay the right set
> >of WAL in the right order even when there are partial WAL files being
> >given to PG (that aren't named that way...).  The more I think about
> >this, I think we really need to avoid partial WAL files entirely- what
> >are we going to do when we get to the end of one?  We'd need to request
> >the full one from the restore command anyway, so seems like we should
> >just go ahead and get it from the archive, the question is if there's an
> >easy/cheap way to detect partial WAL files in pg_wal.
> 
> As explained above, I don't think this is actually a problem. The checksums
> do cover the whole file thanks to chaining, and there are ways to detect
> partial segments. IMHO it's fine if we replay a segment and then find out it
> was partial and that we need to fetch it from archive anyway and re-apply it
> - it should not be very common case, except when the user does something
> silly.

As long as we *do* go off and try to fetch that WAL file and replay it,
and don't assume that the end of that partial WAL file means the end of
WAL replay, then I think you may be right and that it'd be fine, but it
does seem a bit risky to me.

> >As I mention above, seems like what we should really be doing is having
> >a way to know when a WAL file is full but still in pg_wal for some
> >reason (hasn't been replayed yet due to intential lag on the replica, or
> >unintentional lag on the replica, etc), and perhaps we even only do that
> >on replicas or have tools like pg_basebackup and pgbackrest do that when
> >they're doing a restore, so that it's clear to PG that all these files
> >are full WAL and they can replay them completely.
> 
> IMHO we can deduce if from first xl_prev of the next segment (assuming we
> have the next segment available locally, which we likely have except for the
> last one).

See above for why I don't think it's actually that simple..

> >I was actually just thinking about having pgbackrest do exactly that. :)
> >We already have the full sha256 checksum of every WAL file in the
> >pgbackrest repository, it seems like it wouldn't be too hard to
> >calculate the checksum of the files in pg_wal and ask the repo what the
> >checksums are for those WAL files and then use the files in pg_wal if
> >they checksums match.  I can already imagine the argument coming back
> >though- that'd require additional testing to ensure it's done correctly
> >and I'm really not sure that it'd end up being all that much better
> >except in some very special circumstances where there's a low bandwidth
> >connection to the repo and often a lot of WAL left over in 

Percona Live Europe Open Source Database Conference 2018 Call For Papers

2018-08-07 Thread Lorraine Pocklington
*Hi The Percona Live Europe Open Source Database Conference is being held
this year in Frankfurt from November 5-7, 2018. In case you are not
familiar with the conference, our brief is to offer a platform for the
presentation of technical and business papers on any topic that’s relevant
to open source databases and their application. The conference is well
established and has a great reputation amongst the MySQL and MariaDB
communities for the depth of learning and the open exchange of ideas that
it offers. Core to the underlying principles of the conference is that the
event provides an unbiased representation of open source databases. Past
events have included PostgreSQL talks, but this year for the first time we
are featuring a track dedicated to PostgreSQL. The theme for this year is
“Connect. Accelerate. Innovate.” and we’d warmly welcome submissions from
the PostgreSQL community. Proposals are reviewed by our independent
conference committee and if your paper is chosen, then you will enjoy the
benefit of free access to all three days of the event. Please find more
information about the kind of content we’d love to see this year
 on the conference website. You
can also access information about our past events
 if you would like to find out
more.If you have any questions, please don’t hesitate to get in touch, I’d
be happy to help. *
Kind regards,
Lorraine

-- 
*Lorraine Pocklington*
*Community Manager*
*Percona*
www.percona.com
Email: lorraine.pockling...@percona.com
+44 (0)7805 640 077
Skype: live:lorraine.pocklington_1
Timezone: GMT (UK)


Re: [Patch] Create a new session in postmaster by calling setsid()

2018-08-07 Thread Paul Guo
On Thu, Aug 2, 2018 at 10:30 PM, Tom Lane  wrote:

> Paul Guo  writes:
> > [ make the postmaster execute setsid() too ]
>
> I'm a bit skeptical of this proposal.  Forcing the postmaster to
> dissociate from its controlling terminal is a good thing in some
> scenarios, but probably less good in others, and manual postmaster
> starts are probably mostly in the latter class.
>
> I wonder whether having "pg_ctl start" do a setsid() would accomplish
> the same result with less chance of breaking debugging usages.
>
> regards, tom lane
>

Yes, if considering the case of starting postmaster manually, we can not
create
a new session in postmaster, so pg_ctl seems to be a good place for setsid()
call. Attached a newer patch. Thanks.


0001-Create-a-new-session-for-postmaster-in-pg_ctl-by-cal.patch
Description: Binary data


Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Stephen Frost
Greetings Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Moreover, if you don't check it then the appname recorded
> >> by log_connections would not match appearances for the same session
> >> later in the log, which puts the entire use-case for this patch into
> >> question.  So no, this concern must not be dismissed.
> 
> > If the call to check_application_name() fails then I had been expecting
> > the connection to fail.  If we continue to let the connection go on then
> > we've already got an issue as someone might pass in an application name
> > that isn't being set to the GUC and isn't being therefore used in the
> > existing log_line_prefix where it should be.
> 
> No, check_application_name doesn't reject funny names, it just silently
> modifies them in-place.

Blargh, that doesn't seem particularly good to me, but I guess no one
has been comlaining about it either.

> >> However ... I've not looked at the patch, but I thought the idea was to
> >> allow assignment of that GUC to occur before the log_connections log entry
> >> is emitted, so that it'd show up in the entry's log_line_prefix.  Wouldn't
> >> check_application_name happen automatically at that point?
> 
> > We log that message quite early and it certainly didn't look trivial to
> > set up the GUC to be already in place at that point, so the plan was to
> > simply spit out what gets passed in (as we were doing for "user", if I'm
> > remembering that code correctly...).
> 
> Hm.  Well, the code isn't exactly complicated, you could duplicate it.
> Or maybe better refactor to allow it to be called from $wherever.
> Looks like check_cluster_name, for one, could also share use of
> an ascii-lobotomizing subroutine.

Yeah, I'd be alright with either of those approaches.

> But having said that, I don't exactly see why you couldn't force it
> with an ultimately-redundant SetConfigOption call to put the value
> in place before the ereport happens.  The GUC machinery is surely
> functional before we do authorization.

If that's the approach you think makes the most sense, I wouldn't object
to it.  I will point out that we'd end up with the application name in
the log line if it's also included in log_line_prefix, but that's what
happens with "user" anyway, isn't it?, so that doesn't seem to be a big
deal.  I do think it's still good to have appplication_name explicitly
in the log message for users who want to just log application_name on
connection and not have it on every single log line.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Negotiating the SCRAM channel binding type

2018-08-07 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Tue, Aug 07, 2018 at 02:32:27PM +0530, Robert Haas wrote:
> > On Sun, Aug 5, 2018 at 4:30 PM, Heikki Linnakangas  wrote:
> >> Well, it'd be useless for users, there is no reason to switch off channel
> >> binding if both the client and server support it. It might not add any
> >> security you care about, but it won't do any harm either. The
> >> non-channel-binding codepath is still exercised with non-SSL connections.
> > 
> > Is that true?  What if it makes a connection fail that you wanted to
> > succeed?  Suppose we discover a bug that makes connections using
> > channel binding fail on Thursdays.
> 
> Well, as things stand today on HEAD, if channel binding has a bug, this
> makes the SCRAM authentication not able to work over SSL, hence you need
> to either drop SSL, SCRAM or patch libpq so as the client tells the
> server that it does not want to use channel binding.  None of those are
> appealing.  Before 7729113, the client still had the choice to enforce
> channel binding to not be used over SSL, which I think is a win to
> bypass any potential future bugs.  On top of that, we can test
> automatically for *any* future SSL implementations that (SSL + SCRAM +
> no channel binding) actually works properly, which is, it seems at least
> to me, a good thing to get more confidence when a new SSL implementation
> is added.

Uh, really?  We can come up with all kinds of potential bugs that might
exist in the world but I don't think we should be writing in options for
everything that might fail due to some bug existing that we don't know
about.  Also, we aren't going to release support for a new SSL library
in a minor release, so if we end up needing this option due to some SSL
library that we really want to support not having channel binding
support then we can add the option then (or realize that maybe we
shouldn't be bothering with adding support for an SSL implementation
that doesn't support channel binding  it's not exactly a new thing
these days and there's very good reasons for having it).

Now- if we thought that maybe there was some connection pooling solution
that could be made to work with SSL+SCRAM if channel binding is turned
off, then that's a use-case that maybe we should try and support, but
this notion that we need to be able to turn it off because there might
be a bug is hogwash, imv.  Now, I haven't seen a pooling solution
actually figure out a way to do SSL+SCRAM even without channel binding,
and there might not even be a way, so I'm currently a -1 on adding an
option to disable it, but if someone turned up tomorrow with an credible
approach to doing that, then I'd +1 adding the option.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Stephen Frost
Greetings Don,

* Don Seiler (d...@seiler.us) wrote:
> On Tue, Aug 7, 2018 at 8:46 AM, Stephen Frost  wrote:
> > * Don Seiler (d...@seiler.us) wrote:
> > > On Mon, Jul 30, 2018 at 5:20 AM, Peter Eisentraut <
> > > peter.eisentr...@2ndquadrant.com> wrote:
> > >
> > > > On 13/07/2018 20:20, Don Seiler wrote:
> > > > > See attached for latest revision.
> > > >
> > > > This doesn't compile with SSL enabled because there is a comma missing.
> > >
> > > Hmm I'll check this out tonight. Sorry I wasn't able to get to this until
> > > now.
> >
> > Thanks.
> 
> Attaching a new patch fixing the missing comma. Sorry for missing this,
> obviously my previous test builds didn't have SSL enabled. Mea culpa.

No worries, it happens. :)  Thanks for working on a new patch.

> I'll send a new one if we require a change for the check_application_name()
> concern. I need the git rebase/autosquash practice anyway.

Sounds like maybe we should be looking at if the concerns raised by Tom
might be applicable to the other strings being written out by that log
message, and if not, why not, and can we do whatever we're doing for
those for application_name as well.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Don Seiler (d...@seiler.us) wrote:
> >> Is the concern that any user can set their client's application name value
> >> to any string they want? Is there a reason we can't call
> >> check_application_name() before setting it in the Port struct in
> >> postmaster.c?
> 
> > I've not looked very closely, but I don't think it's necessairly a big
> > deal to print out the application name provided by the client system
> > into the log before we run check_application_name(), as long as there
> > isn't any risk that printing it out or passing it around without
> > performing that check will cause incorrect operation or such.
> 
> I think the issue is exactly that putting a malformed appname into the
> postmaster log could confuse log-reading apps (eg by causing encoding
> problems).  

Evidently, I need to go look at this, since it seems like this would be
the exact same risk with "user", which is part of why I wasn't terribly
concerned about it here.  Considering the concern raised here, if we're
serious about that issue, then I wonder if we have an existing issue.

> Moreover, if you don't check it then the appname recorded
> by log_connections would not match appearances for the same session
> later in the log, which puts the entire use-case for this patch into
> question.  So no, this concern must not be dismissed.

If the call to check_application_name() fails then I had been expecting
the connection to fail.  If we continue to let the connection go on then
we've already got an issue as someone might pass in an application name
that isn't being set to the GUC and isn't being therefore used in the
existing log_line_prefix where it should be.

> However ... I've not looked at the patch, but I thought the idea was to
> allow assignment of that GUC to occur before the log_connections log entry
> is emitted, so that it'd show up in the entry's log_line_prefix.  Wouldn't
> check_application_name happen automatically at that point?

We log that message quite early and it certainly didn't look trivial to
set up the GUC to be already in place at that point, so the plan was to
simply spit out what gets passed in (as we were doing for "user", if I'm
remembering that code correctly...).

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Don Seiler
On Tue, Aug 7, 2018 at 8:46 AM, Stephen Frost  wrote:

> Greetings,
>
> * Don Seiler (d...@seiler.us) wrote:
> > On Mon, Jul 30, 2018 at 5:20 AM, Peter Eisentraut <
> > peter.eisentr...@2ndquadrant.com> wrote:
> >
> > > On 13/07/2018 20:20, Don Seiler wrote:
> > > > See attached for latest revision.
> > >
> > > This doesn't compile with SSL enabled because there is a comma missing.
> >
> > Hmm I'll check this out tonight. Sorry I wasn't able to get to this until
> > now.
>
> Thanks.
>

Attaching a new patch fixing the missing comma. Sorry for missing this,
obviously my previous test builds didn't have SSL enabled. Mea culpa.

I'll send a new one if we require a change for the check_application_name()
concern. I need the git rebase/autosquash practice anyway.

Thanks!
Don.

-- 
Don Seiler
www.seiler.us


0001-Changes-to-add-application_name-to-Port-struct-so-we.patch
Description: Binary data


Re: garbage variable in GNUmakefile.in

2018-08-07 Thread Tom Lane
Daniel Gustafsson  writes:
> I happened to notice that there is a variable 'garbage' in GNUmakefile.in,
> which was defined in 32163099d7c43a0244f463eb4e525c711e6e97a3 ~18 years ago,
> but which seems to not be used?  Is 'garbage' a special make variable or is a
> leftover which can be removed?

It's not a special make variable, so far as I can find in the gmake manual.

It looks like the idea was to list filename patterns that could be subject
to automatic removal during "make clean", but Peter backed off listing it
in the clean target and then forgot to remove the variable.

I couldn't find any contemporaneous discussion in the archives, but I do
vaguely recall that at some point --- maybe not just then --- we talked
about having such auto removal and decided it was a bad idea.  Editor
backup filenames, for instance, vary depending on your weapon of choice;
and core file names vary too, these days.  So there's a project policy
that we wouldn't enable any such removal by the makefiles.  (These days
people would probably turn to git commands for such cleanup, anyway.)

In short, yeah, I think we could nuke that.  But maybe Peter remembers
it differently.

regards, tom lane



Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Tom Lane
Stephen Frost  writes:
> * Don Seiler (d...@seiler.us) wrote:
>> Is the concern that any user can set their client's application name value
>> to any string they want? Is there a reason we can't call
>> check_application_name() before setting it in the Port struct in
>> postmaster.c?

> I've not looked very closely, but I don't think it's necessairly a big
> deal to print out the application name provided by the client system
> into the log before we run check_application_name(), as long as there
> isn't any risk that printing it out or passing it around without
> performing that check will cause incorrect operation or such.

I think the issue is exactly that putting a malformed appname into the
postmaster log could confuse log-reading apps (eg by causing encoding
problems).  Moreover, if you don't check it then the appname recorded
by log_connections would not match appearances for the same session
later in the log, which puts the entire use-case for this patch into
question.  So no, this concern must not be dismissed.

However ... I've not looked at the patch, but I thought the idea was to
allow assignment of that GUC to occur before the log_connections log entry
is emitted, so that it'd show up in the entry's log_line_prefix.  Wouldn't
check_application_name happen automatically at that point?

regards, tom lane



Re: [HACKERS] logical decoding of two-phase transactions

2018-08-07 Thread Nikhil Sontakke
Hi Arseny,

> - Decoding transactions at PREPARE record changes rules of the "we ship
>   all commits after lsn 'x'" game. Namely, it will break initial
>   tablesync: what if consistent snapshot was formed *after* PREPARE, but
>   before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead
>   of getting inital contents + continious stream of changes the receiver
>   will miss the prepared xact contents and raise 'prepared xact doesn't
>   exist' error. I think the starting point to address this is to forbid
>   two-phase decoding of xacts with lsn of PREPARE less than
>   snapbuilder's start_decoding_at.
>

It will be the job of the plugin to return a consistent answer for
every GID that is encountered. In this case, the plugin will decode
the transaction at COMMIT PREPARED time and not at PREPARE time.

> - Currently we will call abort_prepared cb even if we failed to actually
>   prepare xact due to concurrent abort. I think it is confusing for
>   users. We should either handle this by remembering not to invoke
>   abort_prepared in these cases or at least document this behaviour,
>   leaving this problem to the receiver side.
>

The point is, when we reach the "ROLLBACK PREPARED", we have no idea
if the "PREPARE" was aborted by this rollback happening concurrently.
So it's possible that the 2PC has been successfully decoded and we
would have to send the rollback to the other side which would need to
check if it needs to rollback locally.


> - I find it suspicious that DecodePrepare completely ignores actions of
>   SnapBuildCommitTxn. For example, to execute invalidations, the latter
>   sets base snapshot if our xact (or subxacts) did DDL and the snapshot
>   not set yet. My fantasy doesn't hint me the concrete example
>   where this would burn at the moment, but it should be considered.
>

I had discussed this area with Petr and we didn't see any issues as well, then.

> Now, the bikeshedding.
>
> First patch:
> - I am one of those people upthread who don't think that converting
>   flags to bitmask is beneficial -- especially given that many of them
>   are mutually exclusive, e.g. xact can't be committed and aborted at
>   the same time. Apparently you have left this to the committer though.
>

Hmm, there seems to be divided opinion on this. I am willing to go
back to using the booleans if there's opposition and if the committer
so wishes. Note that this patch will end up adding 4/5 more booleans
in that case (we add new ones for prepare, commit prepare, abort,
rollback prepare etc).

>
> Second patch:
> - Applying gives me
> Applying: Support decoding of two-phase transactions at PREPARE
> .git/rebase-apply/patch:871: trailing whitespace.
>
> +  row. The change_cb callback may access system or
> +  user catalog tables to aid in the process of outputting the row
> +  modification details. In case of decoding a prepared (but yet
> +  uncommitted) transaction or decoding of an uncommitted transaction, 
> this
> +  change callback is ensured sane access to catalog tables regardless of
> +  simultaneous rollback by another backend of this very same transaction.
>
> I don't think we should explain this, at least in such words. As
> mentioned upthread, we should warn about allowed systable_* accesses
> instead. Same for message_cb.
>

Looks like you are looking at an earlier patchset. The latest patchset
has removed the above.

>
> +   /*
> +* Tell the reorderbuffer about the surviving subtransactions. We 
> need to
> +* do this because the main transaction itself has not committed 
> since we
> +* are in the prepare phase right now. So we need to be sure the 
> snapshot
> +* is setup correctly for the main transaction in case all changes
> +* happened in subtransanctions
> +*/
>
> While we do certainly need to associate subxacts here, the explanation
> looks weird to me. I would leave just the 'Tell the reorderbuffer about
> the surviving subtransactions' as in DecodeCommit.
>
>
> }
> -
> /*
>  * There's a speculative insertion remaining, just clean in 
> up, it
>  * can't have been successful, otherwise we'd gotten a 
> confirmation
>
> Spurious newline deletion.
>
>
> - I would rename ReorderBufferCommitInternal to ReorderBufferReplay:
>   we replay the xact there, not commit.
>
> - If xact is empty, we will not prepare it (and call cb),
>   even if the output plugin asked us. However, we will call
>   commit_prepared cb.
>
> - ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the
>   same and should be merged with comments explaining that the answer
>   must be stable.
>
> - filter_prepare_cb callback existence is checked in both decode.c and
>   in filter_prepare_cb_wrapper.
>
> +   /*
> +* The transaction may or may not exist (during restarts for example).
> +* Anyways, 2PC transactions do not contain any 

Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Stephen Frost
Greetings,

* Don Seiler (d...@seiler.us) wrote:
> On Mon, Jul 30, 2018 at 5:20 AM, Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
> 
> > On 13/07/2018 20:20, Don Seiler wrote:
> > > See attached for latest revision.
> >
> > This doesn't compile with SSL enabled because there is a comma missing.
> 
> Hmm I'll check this out tonight. Sorry I wasn't able to get to this until
> now.

Thanks.

> > This implementation doesn't run the application_name through
> > check_application_name(), so it could end up logging application_name
> > values that are otherwise not acceptable.  I'm not sure of the best way
> > to fix that.
> 
> Is the concern that any user can set their client's application name value
> to any string they want? Is there a reason we can't call
> check_application_name() before setting it in the Port struct in
> postmaster.c?

I've not looked very closely, but I don't think it's necessairly a big
deal to print out the application name provided by the client system
into the log before we run check_application_name(), as long as there
isn't any risk that printing it out or passing it around without
performing that check will cause incorrect operation or such.

I'm guessing it's not easy because check_application_name() isn't easily
available from where we're wanting to print it out or earlier, but,
again, haven't looked closely.  If you aren't sure or run into issues,
feel free to ping me on slack and I'll be happy to help.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Reopen logfile on SIGHUP

2018-08-07 Thread Alexander Kuzmenkov
I think the latest v4 patch addresses the concerns raised upthread. I'm 
marking the commitfest entry as RFC.


--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [PATCH] Include application_name in "connection authorized" log message

2018-08-07 Thread Don Seiler
On Mon, Jul 30, 2018 at 5:20 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 13/07/2018 20:20, Don Seiler wrote:
> > See attached for latest revision.
>
> This doesn't compile with SSL enabled because there is a comma missing.
>

Hmm I'll check this out tonight. Sorry I wasn't able to get to this until
now.


> This implementation doesn't run the application_name through
> check_application_name(), so it could end up logging application_name
> values that are otherwise not acceptable.  I'm not sure of the best way
> to fix that.


Is the concern that any user can set their client's application name value
to any string they want? Is there a reason we can't call
check_application_name() before setting it in the Port struct in
postmaster.c?

Don.

-- 
Don Seiler
www.seiler.us


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-07 Thread Andres Freund
On 2018-08-08 01:23:51 +1200, David Rowley wrote:
> On 8 August 2018 at 00:47, Andres Freund  wrote:
> > On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> >> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> >> than an AccessExclusiveLock.
> >> 2. Do all the normal partition attach partition validation.
> >> 3. Insert pg_partition record with partvalid = true.
> >> 4. Invalidate relcache entry for the partitioned table
> >> 5. Any loops over a partitioned table's PartitionDesc must check
> >> PartitionIsValid(). This will return true if the current snapshot
> >> should see the partition or not. The partition is valid if partisvalid
> >> = true and the xmin precedes or is equal to the current snapshot.
> >
> > How does this protect against other sessions actively using the relcache
> > entry? Currently it is *NOT* safe to receive invalidations for
> > e.g. partitioning contents afaics.
> 
> I'm not proposing that sessions running older snapshots can't see that
> there's a new partition. The code I have uses PartitionIsValid() to
> test if the partition should be visible to the snapshot. The
> PartitionDesc will always contain details for all partitions stored in
> pg_partition whether they're valid to the current snapshot or not.  I
> did it this way as there's no way to invalidate the relcache based on
> a point in transaction, only a point in time.

I don't think that solves the problem that an arriving relcache
invalidation would trigger a rebuild of rd_partdesc, while it actually
is referenced by running code.

You'd need to build infrastructure to prevent that.

One approach would be to make sure that everything relying on
rt_partdesc staying the same stores its value in a local variable, and
then *not* free the old version of rt_partdesc (etc) when the refcount >
0, but delay that to the RelationClose() that makes refcount reach
0. That'd be the start of a framework for more such concurrenct
handling.

Regards,

Andres Freund



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-07 Thread David Rowley
On 8 August 2018 at 00:47, Andres Freund  wrote:
> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I'm not proposing that sessions running older snapshots can't see that
there's a new partition. The code I have uses PartitionIsValid() to
test if the partition should be visible to the snapshot. The
PartitionDesc will always contain details for all partitions stored in
pg_partition whether they're valid to the current snapshot or not.  I
did it this way as there's no way to invalidate the relcache based on
a point in transaction, only a point in time.

I'm open to better ideas, of course.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



garbage variable in GNUmakefile.in

2018-08-07 Thread Daniel Gustafsson
I happened to notice that there is a variable 'garbage' in GNUmakefile.in,
which was defined in 32163099d7c43a0244f463eb4e525c711e6e97a3 ~18 years ago,
but which seems to not be used?  Is 'garbage' a special make variable or is a
leftover which can be removed?

cheers ./daniel



garbage.diff
Description: Binary data


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-07 Thread Simon Riggs
On 7 August 2018 at 13:47, Andres Freund  wrote:
> Hi,
>
> On 2018-08-08 00:40:12 +1200, David Rowley wrote:
>> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
>> than an AccessExclusiveLock.
>> 2. Do all the normal partition attach partition validation.
>> 3. Insert pg_partition record with partvalid = true.
>> 4. Invalidate relcache entry for the partitioned table
>> 5. Any loops over a partitioned table's PartitionDesc must check
>> PartitionIsValid(). This will return true if the current snapshot
>> should see the partition or not. The partition is valid if partisvalid
>> = true and the xmin precedes or is equal to the current snapshot.
>
> How does this protect against other sessions actively using the relcache
> entry? Currently it is *NOT* safe to receive invalidations for
> e.g. partitioning contents afaics.

I think you may be right in the general case, but ISTM possible to
invalidate/refresh just the list of partitions.

If so, that idea would seem to require some new, as-yet not invented mechanism.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-07 Thread Andres Freund
Hi,

On 2018-08-08 00:40:12 +1200, David Rowley wrote:
> 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
> than an AccessExclusiveLock.
> 2. Do all the normal partition attach partition validation.
> 3. Insert pg_partition record with partvalid = true.
> 4. Invalidate relcache entry for the partitioned table
> 5. Any loops over a partitioned table's PartitionDesc must check
> PartitionIsValid(). This will return true if the current snapshot
> should see the partition or not. The partition is valid if partisvalid
> = true and the xmin precedes or is equal to the current snapshot.

How does this protect against other sessions actively using the relcache
entry? Currently it is *NOT* safe to receive invalidations for
e.g. partitioning contents afaics.

- Andres



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-07 Thread David Rowley
On 3 August 2018 at 01:25, David Rowley  wrote:
> 1. Do all the normal partition attach partition validation.
> 2. Insert a record into pg_partition with partisvalid=false
> 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table.
> 4. Obtain a session-level AccessExclusiveLock on the partition being attached.
> 5. Commit.
> 6. Start a new transaction.
> 7. Wait for snapshots older than our own to be released.
> 8. Mark the partition as valid
> 9. Invalidate relcache for the partitioned table.
> 10. release session-level locks.

So I was thinking about this again and realised this logic is broken.
All it takes is a snapshot that starts after the ATTACH PARTITION
started and before it completed.  This snapshot will have the new
partition attached while it's possibly still open which could lead to
non-repeatable reads in a repeatable read transaction.  The window for
this to occur is possibly quite large given that the ATTACH
CONCURRENTLY can wait a long time for older snapshots to finish.

Here's my updated thinking for an implementation which seems to get
around the above problem:

ATTACH PARTITION CONCURRENTLY:

1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather
than an AccessExclusiveLock.
2. Do all the normal partition attach partition validation.
3. Insert pg_partition record with partvalid = true.
4. Invalidate relcache entry for the partitioned table
5. Any loops over a partitioned table's PartitionDesc must check
PartitionIsValid(). This will return true if the current snapshot
should see the partition or not. The partition is valid if partisvalid
= true and the xmin precedes or is equal to the current snapshot.

#define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \
|| (!(pd)->is_valid[(i)] \
&& TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)])))

DETACH PARTITION CONCURRENTLY:

1. Obtain ShareUpdateExclusiveLock on partition being detached
(instead of the AccessShareLock that non-concurrent detach uses)
2. Update the pg_partition record, set partvalid = false.
3. Commit
4. New transaction.
5. Wait for transactions which hold a snapshot older than the one held
when updating pg_partition to complete.
6. Delete the pg_partition record.
7. Perform other cleanup, relpartitionparent = 0, pg_depend etc.

DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above)

1. Make vacuum of a partition check for pg_partition.partvalid =
false, if xmin of tuple is old enough, perform a partition cleanup by
doing steps 6+7 above.

A VACUUM FREEZE must run before transaction wraparound, so this means
a partition can never reattach itself when the transaction counter
wraps.

I believe I've got the attach and detach working correctly now and
also isolation tests that appear to prove it works. I've also written
the failed detach cleanup code into vacuum. Unusually, since foreign
tables can also be partitions this required teaching auto-vacuum to
look at foreign tables, only in the sense of checking for failed
detached partitions. It also requires adding vacuum support for
foreign tables too.  It feels a little bit weird to modify auto-vacuum
to look at foreign tables, but I really couldn't see another way to do
this.

I'm now considering if this all holds together in the event the
pg_partition tuple of an invalid partition becomes frozen. The problem
would be that PartitionIsValid() could return the wrong value due to
TransactionIdPrecedesOrEquals(GetCurrentTransactionId(),
(pd)->xmin[(i)]). this code is trying to keep the detached partition
visible to older snapshots, but if pd->xmin[i] becomes frozen, then
the partition would become invisible.  However, I think this won't be
a problem since a VACUUM FREEZE would only freeze tuples that are also
old enough to have failed detaches cleaned up earlier in the vacuum
process.

Also, we must disallow a DEFAULT partition from being attached to a
partition with a failed DETACH CONCURRENTLY as it wouldn't be very
clear what the default partition's partition qual would be, as this is
built based on the quals of all attached partitions.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Cached plans and statement generalization

2018-08-07 Thread Konstantin Knizhnik




On 07.08.2018 13:02, Yamaji, Ryo wrote:


I want to confirm one point.
If I will have reviewed the autoprepare patch, then are you ready to register
the patch at commit fest in the near future? I fear that autoprepare patch do
not registered at commit fest in the future (for example, you are so busy), and
do not applied to PostgreSQL. If you are not ready to register the patch, I 
think
I want to register at commit fest instead of you.


I have registered the patch for next commitfest.
For some reasons it doesn't find the latest autoprepare-10.patch and 
still refer to autoprepare-6.patch as the latest attachement.






I agree it may be more useful to limit amount of memory used by prepare
queries, rather than number of prepared statements.
But it is just more difficult to calculate and maintain (I am not sure
that just looking at CacheMemoryContext is enough for it).
Also, if working set of queries (frequently repeated queries) doesn't
fir in memory, then autoprepare will be almost useless (because with
high probability
prepared query will be thrown away from the cache before it can be
reused). So limiting queries from "application side" seems to be more
practical.

I see. But I fear that autoprepare process uses irregularity amount of memory
when autoprepare_limit is specified number of prepared statements. I think
that there is scene that autoprepare process use a lot of memory (ex. it
need to prepare a lot of long queries), then other processes (ex. other
backend process in PostgreSQL or process other than PostgreSQL) cannot use
memory. I hope to specify limit amount of memory in the future.


Right now each prepared statement has two memory contexts: one for raw 
parse tree used as hash table key and another for cached plan itself.
May be it will be possible to combine them. To calculate memory consumed 
by cached plans, it will be necessary to calculate memory usage 
statistic for all this contexts (which requires traversal of all 
context's chunks) and sum them. It is much more complex and expensive 
than current check: (++autoprepare_cached_plans > autoprepare_limit)
although I so not think that it will have measurable impact on 
performance...
May be there should be some faster way to estimate memory consumed by 
prepared statements.


So, the current autoprepare_limit allows to limit number of autoprepared 
statements and prevent memory overflow caused by execution of larger 
number of different statements.
The question is whether we need more precise mechanism which will take 
in account difference between small and large queries. Definitely simple 
query can require 10-100 times less memory than complex query. But 
memory contexts themselves (even with small block size) somehow minimize 
difference in memory footprint of different queries, because of chunked 
allocation.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Constraint documentation

2018-08-07 Thread Lætitia Avrot
Hi Peter,

I understand what you're pointing at and I agree that it could be a good
thing to be able to dump/restore a table without problem.

My point was that check constraints weren't supposed to be used that way
theorically (or maybe i'm mistaken ?) so I thought maybe we should just
inform the user that this kind of use of a check constraint is a misuse of
that feature.

Maybe it's not the right way to say it. I can remove the part about pg_dump
if it's too confusing...

Regards,

Lætitia



Le mer. 27 juin 2018 à 08:44, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> a écrit :

> On 6/26/18 09:49, Lætitia Avrot wrote:
> > +   
> > +
> > + Check constraints are not designed to enforce business rules
> across tables.
> > + Avoid using check constraints with a function accessing other
> tables and
> > + use  instead. Although PostgreSQL won't
> prevent you
> > + from doing so, beware that dumps generated by
> pg_dump
> > + or pg_dumpall may be hard
> > + to restore because of such checks, as the underlying dependencies
> are not
> > + taken into account.
> > +
> > +   
>
> In a way, I think this is attacking the wrong problem.  It is saying
> that you should use triggers instead of check constraints in certain
> circumstances.  But triggers are also used as an implementation detail
> of constraints.  While it currently doesn't exist, a deferrable check
> constraint would probably be implemented as a trigger.  It's not the
> triggerness that fixes this problem.  The problem is more generally that
> if a function uses a table, then pg_dump can't know about the ordering.
> It happens to work for triggers because triggers are dumped after all
> tables, as a performance optimization, and we could very well dump check
> constraints differently as well.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
*Think! Do you really need to print this email ? *
*There is no Planet B.*


Re: Standby trying "restore_command" before local WAL

2018-08-07 Thread Tomas Vondra




On 08/06/2018 09:32 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On 08/06/2018 06:11 PM, Stephen Frost wrote:

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On 08/06/2018 05:19 PM, Stephen Frost wrote:

* David Steele (da...@pgmasters.net) wrote:

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal.  Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
  This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials.  You will almost
certainly not have everything you need to get to consistency.


Yeah. But as Simon said, we do have fairly strong protections about applying
corrupted WAL - every record is CRC-checked. So why not to fall-back to the
restore_command only if the locally available WAL is not fully consistent?


"Corrupted" doesn't necessairly only mean "the data file was munged by
the storage somehow."  In this case, corrupted could be an old and only
partial WAL file, in which case we'd possibly be missing WAL that needs
to be replayed to bring the cluster back to a valid state, no?


Why wouldn't that be detected by checksums? Also, why wouldn't this be
handled correctly by the logic I proposed, i.e. falling-back to remote WAL
segment if the file is incomplete/broken in some sense?


WAL checksums are per WAL record, not across the whole file...  And,
yes, this seems like something we could probably write code to ensure
we're doing correctly, possibly even without changing the existing WAL
format or maybe we would have to, but either way, seems like additional
code that would need to be written and some serious thought put into it
to make sure it's correct.  I'm all for that, just to be clear, but what
I don't think we can do is move forward with a change to just prefer
pg_wal over restore command.



While the checksums are per-record, the record also contains xl_prev, so 
it's effectively a chain of checksums covering the whole file. The other 
thing you can do is look at the first record of the next segment and use 
the xl_prev to detect if the previous segment was not partial.


But I do agree doing this properly may require writing some new code and 
checking that those cases are detected correctly.


(Note: There was a discussion about replacing xl_prev with LSN of the 
current record, and IMHO that would work just as well, although it might 
be a bit more expensive for some operations.)



But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time.  In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.


Agreed.


Maybe, not sure.


The argument that David makes above in scenario #2 certainly looks
entirely likely to me and I don't think we've got any real protections
against that.  The current common use-cases happen to work around the
risk because tools like pg_basebackup ignore the existing pg_wal
directory when doing the backup and instead populate it with exactly the
correct WAL that's needed, and in cases where a restore command is
specified will always pull back only valid WAL, but I don't think we can
decide that this scenario (#2 from above):


I'm probably missing something, but why couldn't we detect that using CRC.
Or make sure we can detect that, e.g. by adding some additional info into
each WAL segment?


CRC's are per WAL record, and possibly some WAL records might not be ok
to replay, or at least we need to make sure that we replay the right set
of WAL in the right order even when there are partial WAL files being
given to PG (that aren't named that way...).  The more I think about
this, I think we really need to avoid partial WAL files entirely- what
are we going to do when we get to the end of one?  We'd need to request
the full one from the restore command anyway, so seems like we should
just go ahead and get it from the archive, the question is if there's an
easy/cheap way to detect partial WAL files in pg_wal.



As explained above, I don't think this is actually a problem. The 
checksums do cover the whole file thanks to chaining, and there are ways 
to detect partial segments. IMHO it's fine if we replay a segment and 
then find out it was partial and that we need to fetch it from archive 
anyway and re-apply it - it should not be very common case, except when 
the user does something silly.



Ideally we could have a default that is safe in each scenario with

Re: partition tree inspection functions

2018-08-07 Thread Jesper Pedersen

Hi,

On 08/07/2018 03:32 AM, Amit Langote wrote:

Do we need a pg_partition_level that expects the individual partition OID
to be passed to it or can we do with the information we get from the
revised pg_partition_children?  In earlier revisions,
pg_partition_children returned only the partition OIDs, so we needed to
provide pg_partition_* functions for getting the parent, root parent,
level, etc. separately.  I mean to ask if is there a need for having these
functions separately if the revised pg_partition_children already outputs
that information?



I'm thinking of the case where we only have information about a leaf 
partition, and we need its root partition and the actual level in the 
partition tree.


If we had

 SELECT pg_partition_root_parent('leafpart');

and

 SELECT pg_partition_level('leafpart');

-- we don't need the pg_partition_level('leafpart', 'parentpart') 
function now.


We can use pg_partition_children() for the rest.


pg_partition_leaf_children()'s output can be obtained as follows, after
adding isleaf column to pg_partition_children's output:



Yes, this is great.

Thanks !

Best regards,
 Jesper



Re: Negotiating the SCRAM channel binding type

2018-08-07 Thread Michael Paquier
On Tue, Aug 07, 2018 at 02:32:27PM +0530, Robert Haas wrote:
> On Sun, Aug 5, 2018 at 4:30 PM, Heikki Linnakangas  wrote:
>> Well, it'd be useless for users, there is no reason to switch off channel
>> binding if both the client and server support it. It might not add any
>> security you care about, but it won't do any harm either. The
>> non-channel-binding codepath is still exercised with non-SSL connections.
> 
> Is that true?  What if it makes a connection fail that you wanted to
> succeed?  Suppose we discover a bug that makes connections using
> channel binding fail on Thursdays.

Well, as things stand today on HEAD, if channel binding has a bug, this
makes the SCRAM authentication not able to work over SSL, hence you need
to either drop SSL, SCRAM or patch libpq so as the client tells the
server that it does not want to use channel binding.  None of those are
appealing.  Before 7729113, the client still had the choice to enforce
channel binding to not be used over SSL, which I think is a win to
bypass any potential future bugs.  On top of that, we can test
automatically for *any* future SSL implementations that (SSL + SCRAM +
no channel binding) actually works properly, which is, it seems at least
to me, a good thing to get more confidence when a new SSL implementation
is added.
--
Michael


signature.asc
Description: PGP signature


RE: [HACKERS] Cached plans and statement generalization

2018-08-07 Thread Yamaji, Ryo
> -Original Message-
> From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
> Sent: Friday, August 3, 2018 7:02 AM
> To: Yamaji, Ryo/山地 亮 
> Cc: PostgreSQL mailing lists 
> Subject: Re: [HACKERS] Cached plans and statement generalization
>
> Thank you.
> Unfortunately expression_tree_walker is not consistent with copyObject:
> I tried to use this walker to destroy raw parse tree of autoprepared
> statement, but looks like some nodes are not visited by
> expression_tree_walker. I have to create separate memory context for each
> autoprepared statement.
> New version  of the patch is attached.

Thank you for attaching the patch.
The improvement to plan cache context by the new patch was confirmed.


> This patch will be included in next release of PgPro EE.
> Concerning next commit fest - I am not sure.
> At previous commitfest it was returned with feedback that it "has received no 
> review or comments since last May".
> May be your review will help to change this situation.

I want to confirm one point.
If I will have reviewed the autoprepare patch, then are you ready to register
the patch at commit fest in the near future? I fear that autoprepare patch do
not registered at commit fest in the future (for example, you are so busy), and
do not applied to PostgreSQL. If you are not ready to register the patch, I 
think
I want to register at commit fest instead of you.


> I agree it may be more useful to limit amount of memory used by prepare 
> queries, rather than number of prepared statements.
> But it is just more difficult to calculate and maintain (I am not sure 
> that just looking at CacheMemoryContext is enough for it).
> Also, if working set of queries (frequently repeated queries) doesn't 
> fir in memory, then autoprepare will be almost useless (because with 
> high probability
> prepared query will be thrown away from the cache before it can be 
> reused). So limiting queries from "application side" seems to be more 
> practical.

I see. But I fear that autoprepare process uses irregularity amount of memory
when autoprepare_limit is specified number of prepared statements. I think
that there is scene that autoprepare process use a lot of memory (ex. it
need to prepare a lot of long queries), then other processes (ex. other
backend process in PostgreSQL or process other than PostgreSQL) cannot use
memory. I hope to specify limit amount of memory in the future.


Re: [HACKERS] logical decoding of two-phase transactions

2018-08-07 Thread Arseny Sher


Andres Freund  writes:

>> - On decoding of aborted xacts. The idea to throw an error once we
>>   detect the abort is appealing, however I think you will have problems
>>   with subxacts in the current implementation. What if subxact issues
>>   DDL and then aborted, but main transaction successfully committed?
>
> I don't see a fundamental issue here. I've not reviewed the current
> patchset meaningfully, however. Do you see a fundamental issue here?

Hmm, yes, this is not an issue for this patch because after reading
PREPARE record we know all aborted subxacts and won't try to decode
their changes. However, this will be raised once we decide to decode
in-progress transactions. Checking for all subxids is expensive;
moreover, WAL doesn't provide all of them until commit... it might be
easier to prevent vacuuming of aborted stuff while decoding needs it.
Matter for another patch, anyway.

>> - Currently we will call abort_prepared cb even if we failed to actually
>>   prepare xact due to concurrent abort. I think it is confusing for
>>   users. We should either handle this by remembering not to invoke
>>   abort_prepared in these cases or at least document this behaviour,
>>   leaving this problem to the receiver side.
>
> What precisely do you mean by "concurrent abort"?

With current patch, the following is possible:
* We start decoding of some prepared xact;
* Xact aborts (ABORT PREPARED) for any reason;
* Decoding processs notices this on catalog scan and calls abort()
  callback;
* Later decoding process reads abort record and calls abort_prepared
  callback.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Negotiating the SCRAM channel binding type

2018-08-07 Thread Robert Haas
On Sun, Aug 5, 2018 at 4:30 PM, Heikki Linnakangas  wrote:
> Well, it'd be useless for users, there is no reason to switch off channel
> binding if both the client and server support it. It might not add any
> security you care about, but it won't do any harm either. The
> non-channel-binding codepath is still exercised with non-SSL connections.

Is that true?  What if it makes a connection fail that you wanted to
succeed?  Suppose we discover a bug that makes connections using
channel binding fail on Thursdays.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fix hints on CREATE PROCEDURE errors

2018-08-07 Thread Amit Langote
Hi.

On 2018/08/07 3:32, Jeremy Evans wrote:
> When testing out CREATE PROCEDURE with 11 beta 2, I noticed, the hints
> in the errors reference DROP FUNCTION, which doesn't work for
> procedures.

Good catch.

> DROP ROUTINE works for both functions and procedures, so
> this patch should work for both.

Not sure if we should fix it the way your patch does, but it seems you
forgot to include changes to the expected output of some of the regression
tests affected by this patch.  Attached updated patch includes those.

Thanks,
Amit
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 9b4015d0d4..84ee8d686d 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -409,7 +409,7 @@ ProcedureCreate(const char *procedureName,
ereport(ERROR,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot change return type of 
existing function"),
-errhint("Use DROP FUNCTION %s first.",
+errhint("Use DROP ROUTINE %s first.",
 
format_procedure(HeapTupleGetOid(oldtup);
 
/*
@@ -434,7 +434,7 @@ ProcedureCreate(const char *procedureName,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot change return 
type of existing function"),
 errdetail("Row type defined by 
OUT parameters is different."),
-errhint("Use DROP FUNCTION %s 
first.",
+errhint("Use DROP ROUTINE %s 
first.",
 
format_procedure(HeapTupleGetOid(oldtup);
}
 
@@ -477,7 +477,7 @@ ProcedureCreate(const char *procedureName,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot change 
name of input parameter \"%s\"",

old_arg_names[j]),
-errhint("Use DROP 
FUNCTION %s first.",
+errhint("Use DROP 
ROUTINE %s first.",
 
format_procedure(HeapTupleGetOid(oldtup);
}
}
@@ -501,7 +501,7 @@ ProcedureCreate(const char *procedureName,
ereport(ERROR,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot remove 
parameter defaults from existing function"),
-errhint("Use DROP FUNCTION %s 
first.",
+errhint("Use DROP ROUTINE %s 
first.",
 
format_procedure(HeapTupleGetOid(oldtup);
 
proargdefaults = SysCacheGetAttr(PROCNAMEARGSNSP, 
oldtup,
@@ -527,7 +527,7 @@ ProcedureCreate(const char *procedureName,
ereport(ERROR,

(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
 errmsg("cannot change 
data type of existing parameter default value"),
-errhint("Use DROP 
FUNCTION %s first.",
+errhint("Use DROP 
ROUTINE %s first.",
 
format_procedure(HeapTupleGetOid(oldtup);
newlc = lnext(newlc);
}
diff --git a/src/test/regress/expected/polymorphism.out 
b/src/test/regress/expected/polymorphism.out
index 67e70c8c14..ec88621dc3 100644
--- a/src/test/regress/expected/polymorphism.out
+++ b/src/test/regress/expected/polymorphism.out
@@ -1081,7 +1081,7 @@ select dfunc(10,20);
 create or replace function dfunc(a variadic int[]) returns int as
 $$ select array_upper($1, 1) $$ language sql;
 ERROR:  cannot remove parameter defaults from existing function
-HINT:  Use DROP FUNCTION dfunc(integer[]) first.
+HINT:  Use DROP ROUTINE dfunc(integer[]) first.
 \df dfunc
  List of functions
  Schema | Name  | Result data type |   Argument data types 
  | Type 
@@ -1294,13 +1294,13 @@ returns record as $$
   select $1, $2;
 $$ language sql;
 ERROR:  cannot change name of input parameter 

Re: partition tree inspection functions

2018-08-07 Thread Amit Langote
Hi,

On 2018/08/03 21:35, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 08/03/2018 04:28 AM, Amit Langote wrote:
>> That's a good idea, thanks.
>>
>> Actually, by the time I sent the last version of the patch or maybe few
>> versions before that, I too had started thinking if we shouldn't just have
>> a SETOF RECORD function like you've outlined here, but wasn't sure of the
>> fields it should have.  (relid, parentid, level) seems like a good start,
>> or maybe that's just what we need.
>>
> 
> I think there should be a column that identifies leaf partitions (bool
> isleaf), otherwise it isn't obvious in complex scenarios.

Ah, getting isleaf directly from pg_partition_children would be better
than an application figuring that out by itself.

>> Note that the level that's returned for each table is computed wrt the
>> root table passed to the function and not the actual root partition.
> 
> If you are given a leaf partition as input, then you will have to keep
> executing the query until you find the root, and count those. So, I think
> it should be either be the level to the root, or there should be another
> column that lists that (rootlevel).

The function pg_partition_children is to get partitions found under a
given root table.  If you pass a leaf partition to it, then there is
nothing under, just the leaf partition itself, and its level wrt itself is
0.  That's what Robert said too, to which you replied:

On 2018/08/03 22:11, Jesper Pedersen wrote:
> We had the 2 pg_partition_level() functions and
> pg_partition_leaf_children() in v8, so it would be good to get those back.

Do we need a pg_partition_level that expects the individual partition OID
to be passed to it or can we do with the information we get from the
revised pg_partition_children?  In earlier revisions,
pg_partition_children returned only the partition OIDs, so we needed to
provide pg_partition_* functions for getting the parent, root parent,
level, etc. separately.  I mean to ask if is there a need for having these
functions separately if the revised pg_partition_children already outputs
that information?

pg_partition_leaf_children()'s output can be obtained as follows, after
adding isleaf column to pg_partition_children's output:

select * from pg_partition_children('') where isleaf;


Attached updated patch adds isleaf to pg_partition_children's output.

Thanks,
Amit
From a5c9388b984ed7e70d3ed074c9ef3eb6de975d4a Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 3 Aug 2018 17:06:05 +0900
Subject: [PATCH v10] Add pg_partition_children to report partitions

It returns set of records one for each partition containing the
partition name, parent name, and level in the partition tree with
given table as root
---
 doc/src/sgml/func.sgml   | 37 
 src/backend/catalog/partition.c  | 89 
 src/include/catalog/pg_proc.dat  |  9 +++
 src/test/regress/expected/partition_info.out | 84 ++
 src/test/regress/parallel_schedule   |  2 +-
 src/test/regress/serial_schedule |  1 +
 src/test/regress/sql/partition_info.sql  | 35 +++
 7 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/partition_info.out
 create mode 100644 src/test/regress/sql/partition_info.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..6b10aa3b3d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19995,6 +19995,43 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
 The function returns the number of new collation objects it created.

 
+   
+Partitioning Information Functions
+
+ 
+  Name Return Type 
Description
+ 
+
+  
+  
+   
pg_partition_children(regclass)
+   setof record
+   
+List name, parent name, level, and whether it's a leaf partition for
+each partition contained in the partition tree with given root table,
+including the root table itself
+   
+  
+ 
+
+   
+
+   
+To check the total size of the data contained in
+measurement table described in
+, one could use the
+following query:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(relid))) as total_size from 
pg_partition_children('measurement');
+ total_size 
+
+ 24 kB
+(1 row)
+
+
+
   
 
   
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 558022647c..5c919a5bee 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -23,12 +23,15 @@
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_partitioned_table.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
 #include "nodes/makefuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
+#include 

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-08-07 Thread Andrey Lepikhov

Hi,
I wrote a background worker (hcleaner) to demonstrate application of 
Retail IndexTuple deletion (see patch at attachment).
Like Autovacuum it utilizes concept of one launcher and many workers. 
But one worker correspond to one database.


Short description:
Backend collects dirty block numbers by a hash table at the point in 
code immediately after heap_page_prune() call. Backend send a package of 
dirty block numbers (not one-by-one!) by socket at the end of 
transaction or if hash table is full.

Launcher transfers block numbers to correspond workers.
Worker collects dead tuples from a block, clean index relations, clean 
heap block. It uses conditional locking with waiting list approach if 
heap block are busy.


hcleaner has experimental status, but make check-world passed
.
For benchmarking i used xiaomi notebook with intel Core i5 8gen processor.

BENCHMARK
--

test: pgbench -i -s 100 && pgbench -c 25 -j 8 -M prepared -P 60 -T 3600
autovacuum = off

master:
---
number of transactions actually processed: 6373215
latency average = 14.122 ms
latency stddev = 9.458 ms
tps = 1770.291436 (including connections establishing)
tps = 1770.293191 (excluding connections establishing)

VACUUM verbose pgbench_accounts:

INFO: vacuuming "public.pgbench_accounts"
INFO: scanned index "pgbench_accounts_pkey" to remove 237496 row versions
DETAIL: CPU: user: 4.67 s, system: 0.27 s, elapsed: 8.05 s
INFO: "pgbench_accounts": removed 237496 row versions in 167652 pages
DETAIL: CPU: user: 7.54 s, system: 3.40 s, elapsed: 26.10 s
INFO: index "pgbench_accounts_pkey" now contains 1000 row versions 
in 27422 pages

DETAIL: 237496 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pgbench_accounts": found 165275 removable, 1000 nonremovable 
row versions in 167840 out of 167840 pages

DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 6373796
There were 82282 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 20.33 s, system: 5.88 s, elapsed: 51.38 s.

patched:

number of transactions actually processed: 6338593
latency average = 14.199 ms
latency stddev = 13.988 ms
tps = 1760.685922 (including connections establishing)
tps = 1760.688038 (excluding connections establishing)

VACUUM verbose pgbench_accounts:

INFO: vacuuming "public.pgbench_accounts"
INFO: scanned index "pgbench_accounts_pkey" to remove 1804 row versions
DETAIL: CPU: user: 1.84 s, system: 0.05 s, elapsed: 3.34 s
INFO: "pgbench_accounts": removed 1804 row versions in 1468 pages
DETAIL: CPU: user: 0.06 s, system: 0.03 s, elapsed: 1.42 s
INFO: index "pgbench_accounts_pkey" now contains 1000 row versions 
in 27422 pages

DETAIL: 1618 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pgbench_accounts": found 168561 removable, 1000 nonremovable 
row versions in 169466 out of 169466 pages

DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 6339174
There were 75478 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 12.27 s, system: 4.03 s, elapsed: 31.43 s.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 4bc8e4806b1bb1ea13dcd9e9201d392e3741adb9 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 7 Aug 2018 11:22:13 +0500
Subject: [PATCH 5/5] Background Cleaner

---
 .../postgres_fdw/expected/postgres_fdw.out|  30 ++--
 contrib/postgres_fdw/sql/postgres_fdw.sql |   4 +-
 src/backend/access/heap/pruneheap.c   |   2 +
 src/backend/access/nbtree/nbtree.c|   3 +-
 src/backend/access/transam/xact.c |   4 +
 src/backend/commands/vacuumlazy.c |  45 +++--
 src/backend/postmaster/Makefile   |   2 +-
 src/backend/postmaster/pgstat.c   |  36 
 src/backend/postmaster/postmaster.c   | 160 +-
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 src/backend/storage/lmgr/proc.c   |   5 +-
 src/backend/tcop/postgres.c   |  12 ++
 src/backend/utils/hash/Makefile   |   2 +-
 src/backend/utils/init/miscinit.c |   3 +-
 src/backend/utils/init/postinit.c |  11 +-
 src/include/commands/vacuum.h |   3 +
 src/include/pgstat.h  |  11 ++
 src/include/storage/pmsignal.h|   2 +
 src/test/regress/expected/rules.out   |   2 +-
 src/test/regress/expected/triggers.out|   2 +-
 src/test/regress/input/constraints.source |  12 +-
 src/test/regress/output/constraints.source|  34 ++--
 

Re: Optimizer misses big in 10.4 with BRIN index

2018-08-07 Thread Arcadiy Ivanov

On 07/26/2018 07:27 AM, Tomas Vondra wrote:
Arcadiy, can you provide plans with parallel query disabled? Or even 
better, produce a test case that reproduces this (using synthetic 
data, anonymized data or something like that, if needed).
So I basically spent most of the time trying to create a reproducible 
case and I can say I failed. I can however reproduce this with specific 
large data set with specific data distribution, but not an artificial 
one. Unfortunately data is restricted so I have to obfuscate the data 
structures. That said, I am able to reproduce this sporadically on local 
machine after exporting from RDS 10.4. On some days reproduction happens 
always, on others it's hard to reproduce at all. So here it goes in 
sequential order:




** Data is imported into database
==
db0=# \d+ schema0.table0
 Table "schema0.table0"
   Column   |  Type  | Collation | Nullable | Default | 
Storage  | Stats target | Description

++---+--+-+--+--+-
 field1 | character varying(128) |   | not null | | 
extended |  |
 field2 | character varying(128) |   | not null | | 
extended |  |
 field3 | character varying(128) |   | not null | | 
extended |  |
 field4 | bigint |   | not null | | 
plain    |  |
 field5 | bigint |   | not null | | 
plain    |  |
 field6 | bigint |   | not null | | 
plain    |  |
 data   | jsonb  |   | not null | | 
extended |  |

Indexes:
    "date_idx" brin (((data ->> 'date'::text)::numeric)) WITH 
(autosummarize='true')
Options: autovacuum_vacuum_scale_factor=0.0, 
autovacuum_vacuum_threshold=5000, autovacuum_analyze_scale_factor=0.0, 
autovacuum_analyze_threshold=5000, fillfactor=75



** Data distribution in the table
==
db0# SELECT (data->>'date')::numeric as date, count(1) FROM 
schema0.table0 GROUP BY date ORDER BY date;

    date    |  count
+-
 1527580800 |   5
 1527753600 |  72
 152784 |  36
 1528012800 |  18
 1528099200 |  72
 1528185600 |    7266
 1528272000 | 336
 1528358400 | 230
 1528444800 |  90
 1528704000 |  76
 1528790400 |   4
 1528876800 |    9060
 1528934400 |   6
 1528963200 |  33
 1529193600 |   6
 152928 |  18
 1529294400 |  541344
 1529380800 | 1113121
 1529467200 |  794082
 1529553600 |  604752
 1529812800 |  135252
 1529899200 |   63222
 1529985600 |   31134
 1530072000 |   25392
 1530158400 |  90
 1530417600 |  48
 1530504000 | 144
 1530518400 |   1
 1530590400 |    1958
 1530604800 |   1
 1530763200 |   48614
 1530849600 |   9
 1531022400 |   3
 1531108800 |   3
 1531195200 |   1
*** This is the range we're interested in
 1531281600 |    3713
***
 1531627200 | 300
 1531713600 |    1113
 153180 |  30
 1531886400 |   36000
 1532232000 |   24900
 1532577600 |  409364
 1532664000 |    2050
 1532836800 |  334782
 1532923200 |    6771
(45 rows)

** No analysis was run after import. Let's see how we do.
===
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid, 
(data->>'date')::numeric as date FROM schema0.table0 WHERE 
(data->>'date')::numeric >= '1531267200'

    AND (data->>'date')::numeric <= '1531353600'
    ORDER BY ctid, (data->>'date')::numeric
    DESC LIMIT 1000;
QUERY PLAN
---
 Limit  (cost=1102962.86..1103079.54 rows=1000 width=38) (actual 
time=7234.091..7234.697 rows=1000 loops=1)

   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=9 read=350293
   ->  Gather Merge  (cost=1102962.86..1105002.57 rows=17482 width=38) 
(actual time=7234.089..7234.589 rows=1000 loops=1)

 Output: ctid, (((data ->> 'date'::text))::numeric)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=9 read=350293
 ->  Sort  (cost=1101962.84..1101984.69 rows=8741 width=38) 
(actual time=7227.938..7227.985 rows=964 loops=3)

   Output: ctid, (((data ->> 'date'::text))::numeric)
   Sort Key: table0.ctid, (((table0.data ->> 
'date'::text))::numeric) DESC

   Sort Method: quicksort  Memory: 102kB
   Buffers: shared hit=144 read=1048851
   Worker 0: actual time=7223.770..7223.820 rows=1272 loops=1
 Buffers: shared hit=68 read=348293
   Worker 1: actual 

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-08-07 Thread Amit Langote
Thanks Ashutosh, and sorry that I somehow missed replying to this.

On 2018/07/13 22:50, Ashutosh Bapat wrote:
> On Thu, Jul 12, 2018 at 2:29 PM, Amit Langote wrote:
>> I have modified the comments around this code in the updated patch.
> 
> +/*
> + * Each member in 'saved_schema' contains a ColumnDef containing
> + * partition-specific options for the column.  Below, we merge the
> + * information from each into the ColumnDef of the same name found in
> + * the original 'schema' list before deleting it from the list.  So,
> + * once we've finished processing all the columns from the original
> + * 'schema' list, there shouldn't be any ColumnDefs left that came
> + * from the 'saved_schema' list.
> + */
> 
> This is conveyed by the prologue of the function.
> 
> 
> +/*
> + * Combine using OR so that the NOT NULL constraint
> + * in the parent's definition doesn't get lost.
> + */
> This too is specified in prologue as
>  * Constraints (including NOT NULL constraints) for the child table
>  * are the union of all relevant constraints, from both the child schema
>  * and parent tables.
> 
> So, I don't think we need any special mention of OR, "union" conveys
> the intention.

OK, I have removed both comments.

>>> On a side note, I see
>>> coldef->constraints = restdef->constraints;
>>> Shouldn't we merge the constraints instead of just overwriting those?
>>
>> Actually, I checked that coldef->constraints is always NIL in this case.
>> coldef (ColumnDef) is constructed from a member in the parent's TupleDesc
>> earlier in that function and any constraints that may be present in the
>> parent's definition of the column are saved in a separate variable that's
>> returned to the caller as one list containing "old"/inherited constraints.
>>  So, no constraints are getting lost here.
> 
> In case of multiple inheritance coldef->constraints is "union" of
> constraints from all the parents as described in the prologue.

AFAICS, the prologue doesn't mention *just* coldef->constraints.  It talks
about both the constraints that are to be specified using various
ColumnDef fields (is_not_null, raw_default, cooked_default, etc.) and
constraints that are to be returned in the *supconstr output list.  Both
types of constraints are obtained by union'ing corresponding values from
all parents and child's own definition.

> But in
> case of partitioning there is only one parent and hence
> coldef->constraints is NULL and hence just overwriting it works. I
> think we need comments mentioning this special case.

That's what I wrote in this comment:

   /*
* Parent's constraints, if any, have been saved in
* 'constraints', which are passed back to the caller,
* so it's okay to overwrite the variable like this.
*/

> Also, I am not sure whether we really need all conditions related to
> raw_default and cooked_default. Do you have any testcase showing the
> need for those changes?

Without the patch (example below is tested on PG 10, but same is true with
PG 11 and HEAD too):

create table parent (a int, b int default -1) partition by list (a);
create table child partition of parent (b not null) for values in (0);

\d parent
  Table "public.parent"
 Column │  Type   │ Collation │ Nullable │Default
┼─┼───┼──┼───
 a  │ integer │   │  │
 b  │ integer │   │  │ '-1'::integer
Partition key: LIST (a)
Number of partitions: 1 (Use \d+ to list them.)

\d child
   Table "public.child"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │  │
 b  │ integer │   │ not null │
Partition of: parent FOR VALUES IN (0)

Note that child didn't inherit -1 as default for b.  That happens, because
the partition-specific ColumnDef for b, created to contain the NOT NULL
constraint, has its raw_default and cooked_default fields set to NULL.
The current code blindly assigns that ColumnDef's raw_default and
cooked_default to the inherited ColumnDef's corresponding fields.
Overriding raw_default like that is fine, because partition-specified
default get priority, but not cooked_default, which may contain the
inherited default.

That's not an issue if child's definition doesn't specify any of its own
constraints:

create table parent (a int, b int default -1) partition by list (a);
create table child partition of parent for values in (0);

\d child
  Table "public.child"
 Column │  Type   │ Collation │ Nullable │Default
┼─┼───┼──┼───
 a  │ integer │   │  │
 b  │ integer │   │