Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-06-11 Thread Michael Paquier
On Sun, Mar 18, 2018 at 08:49:01AM +0900, Michael Paquier wrote:
> On Fri, Mar 16, 2018 at 06:02:25AM +, Tsunakawa, Takayuki wrote:
>> Ouch, you're right.  If memory allocation fails, the startup process
>> would emit a LOG message and continue to fetch new WAL records.  Then,
>> I'm completely happy with your patch.
> 
> Thanks for double-checking, Tsunakawa-san.

As this is one of those small bug fixes for which we can do something,
attached is an updated patch with a commit description, and
tutti-quanti.  At the end, I have moved the size check within
allocate_recordbuf().  Even if the size calculated is not going to be
higher than total_len, it feels safer in the long term for two reasons:
- Allocation size calculation could be changed, or let's say made
smarter.
- New callers of allocate_recordbuf would not be able to see the problem
for the backend, and this could become a problem if the WAL reader
facility is extended.

Thoughts?
--
Michael
From 7688dceee5590b8430a90cfae2b98917f710e824 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 12 Jun 2018 14:52:35 +0900
Subject: [PATCH] Prevent hard failures of standbys caused by recycled WAL
 segments

When a standby stops reading WAL from a WAL stream, it writes data to
the current WAL segment without having priorily zero'ed the page written
to, which can cause the WAL reader to read junk data from a past
recycled segment and then it would try to get a record from it.  While
sanity checks in place provide most of the protection needed, in some
rare circumstances, with chances increasing when a record header crosses
a page boundary, then the startup process could fail violently on an
allocation failure, which is unhelpful as this would require in the
worst case a manual restart of the instance, impacting potentially the
availability of the cluster.

The chances of seeing failures are higher if the connection between the
standby and its root node is flacky, causing WAL pages to be written in
the middle.  A couple of approaches have been discussed, like zero-ing
new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the serializable writes done by the WAL receiver.  This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure.  It could be possible that
the allocation call passes and that an unnecessary amount of memory is
allocated, however follow-up checks on records would just fail, making
this allocation short-lived anyway.

This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.

Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
---
 src/backend/access/transam/xlogreader.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a7953f323b..5929ce30af 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -25,6 +25,10 @@
 #include "common/pg_lzcompress.h"
 #include "replication/origin.h"
 
+#ifndef FRONTEND
+#include "utils/memutils.h"
+#endif
+
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
@@ -160,6 +164,27 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
 	newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
 
+#ifndef FRONTEND
+
+	/*
+	 * Note that in much unlucky circumstances, the random data read from a
+	 * recycled segment can cause this routine to be called with a size
+	 * causing a hard failure at allocation.  While for frontend utilities
+	 * this likely means that the end of WAL has been reached, for a standby
+	 * this would cause the instance to stop suddendly with a hard failure,
+	 * preventing it to retry fetching WAL from one of its sources which could
+	 * allow it to move on with replay without a manual restart. If the data
+	 * comes from a past recycled segment and is still valid, then the
+	 * allocation would succeed but record checks are going to fail so this
+	 * would be short-lived.  If the allocation fails because of a memory
+	 * shortage, then this is not a hard failure either per the guarantee
+	 * given by MCXT_ALLOC_NO_OOM.
+	 */
+	if (!AllocSizeIsValid(newSize))
+		return false;
+
+#endif
+
 	if (state->readRecordBuf)
 		pfree(state->readRecordBuf);
 	state->readRecordBuf =
-- 
2.17.1



signature.asc
Description: PGP signature


Re: SCRAM with channel binding downgrade attack

2018-06-11 Thread Michael Paquier
On Mon, Jun 11, 2018 at 04:54:45PM +0200, Magnus Hagander wrote:
> I'm wondering if that means we should then also not do it specifically for
> scram in this version. Otherwise we're likely to end up with a parameter
> that only has a "lifetime" of one version, and that seems like a bad idea.
> If nothing else we should clearly think out what the path is to make sure
> that doesn't happen. (e.g. we don't want a
> scram_channel_binding_mode=require in this version, if the next one is
> going to replace it with something like heikkis suggested
> allowed_authentication_methods=SCRAM-SHA-256-PLUS or whatever we end up
> coming up with there).

Conceptually, it depends on if we consider SCRAM and
SCRAM+channel_binding as two separate authentication protocols.  However
it seems to me that as both are the same thing as they use the same
protocol so it would be confusing for the user to be able to define both
SCRAM-SHA-256 and SCRAM-SHA-256-PLUS within the same list, so I would
tend to think that we should have in this future
"allowed_authentication_methods" only scram-sha-256 listed as an option,
which counts for both SCRAM with and without channel binding.

Thinking harder about this thread, it could be as well possible in the
future that we add support for channel binding for some other SASL
mechanism, in which case I would tend to rename
scram_channel_binding_type and scram_channel_binding_mode to simply
channel_binding_type and channel_binding_mode, without any concepts of
SCRAM attached to it.  So in short, I'd like to keep both enforcement
mechanisms as separate concepts.  One future compatibility issue is how
to deal with for example cases like allowed_authentication_methods="md5"
and channel_binding_mode=require where an allowed authentication does
not have channel binding?  And it seems to me that this should result in
an error.

Opinions of others are welcome.
--
Michael


signature.asc
Description: PGP signature


Re: Fix some error handling for read() and errno

2018-06-11 Thread Michael Paquier
On Mon, Jun 11, 2018 at 03:17:15PM -0700, Andres Freund wrote:
> On 2018-06-11 18:11:05 -0400, Alvaro Herrera wrote:
>> As for the messages, I propose to make them more regular, i.e. always
>> use the wording "could not read from file "%s": read %d, expected %zu", 
>> avoiding variations such as
>>  not enough data in file \"%s\": %d read, %d expected"
>>  could not read compressed file \"%s\": read %d out of %zu
>>  could not read any data from log segment %s, offset %u, length %lu
>> and others that appear in other places.  (In the last case, I even go as
>> far as proposing "read %d, expected %zu" where the the first %d is
>> constant zero.  Extra points if the sprintf format specifiers are always
>> the same (say %zu) instead of, say, %d in odd places.
> 
>> I would go as far as suggesting to remove qualifiers that indicate what
>> the file is for (such as "relation mapping file"); relying on the path
>> as indicator of what's going wrong should be sufficient, since it's an
>> error that affects internals anyway, not anything that users can do much
>> about.  Keep variations to a minimum, to ease translator's work;
>> sometimes it's hard enough to come up with good translations for things
>> like "relation mapping file" in the first place, and they don't help the
>> end user.
> 
> If we go there, why not wrap the read/write/etc calls into a wrapper,
> including the error handling?

On this thread have been spotted 17 code paths involved: 13 for the
backend and 4 for src/bin.  For the frontend part the error handling of
each of of them is slightly different (pg_rewind uses pg_fatal,
pg_waldump uses fatal_error) so I don't think that for this part another
wrapper would bring more clarity.  For the backend part, 7 are working
on transient files and 6 are not, which would make the wrapper a tad
more complex if we would need for example a set of flags in input to
control if the fd passed down by the caller should use
CloseTransientFile() before emitting an error or not.  If the balance
was way more in favor of one or the other though...

> I'm not quite convinced that "relation mapping file" doesn't provide any
> information. It's likley to be easier to search for than a specific
> filename, particularly if there's oids or such in the name...

Agreed.  I also quite like the message mentioning directly 2PC files as
well.  I think that we could gain by making all end messages more
consistent, as the markers used and the style of each message is
slightly different, so I would suggest something like that instead to
gain in consistency:
if (readBytes < 0)
ereport(elevel, "could not blah: %m");
else
ereport(elevel, "could not blah: %d read, expected %zu");

My point is that if we use the same markers and the same end messages,
then those are easier to grep for, and callers are still free to provide
the head of error messages the way they want depending on the situation.
--
Michael


signature.asc
Description: PGP signature


Re: Supporting tls-server-end-point as SCRAM channel binding for OpenSSL 1.0.0 and 1.0.1

2018-06-11 Thread Michael Paquier
On Mon, Jun 11, 2018 at 10:47:23AM -0400, Peter Eisentraut wrote:
> I think we'll just have to wait for an updated RFC on channel bindings
> for TLS 1.3.
> 
> Perhaps we should change PostgreSQL 11 to not advertise channel binding
> when TLS 1.3 is used?

Yeah, that's what we should do and I would vote for doing nothing as
long as we are not sure how the TLS is shaped at the end, as we could as
well be able to use only be-tls-end-point so -PLUS can be advertised.

From a technical point of view, the decision-making can happen with
Port->ssl->version by looking for TLS1_3_VERSION which is new as of
OpenSSL 1.1.1 so that's very fresh (1.1.1 beta 5 is out as of today).
--
Michael


signature.asc
Description: PGP signature


Re: Checkpoint not retrying failed fsync?

2018-06-11 Thread Thomas Munro
On Sun, Apr 8, 2018 at 9:17 PM, Thomas Munro
 wrote:
> New patch attached.

For Linux users (read: almost all users) this patch on its own is a
bit like rearranging the deck chairs on the Titanic.  Because retrying
on failure is useless, among other problems, Craig and Andres are
working on converting IO errors into PANICs and overhauling the
request queue[1].

I was about to mark this patch "rejected" and forget about it, since
Craig's patch makes it redundant.  But then I noticed that Craig's
patch doesn't actually remove the retry behaviour completely: it
promotes only EIO and ENOSPC to PANIC.  If you somehow manage to get
any other errno from fsync(), the checkpoint will fail and you'll be
exposed to this bug: PostgreSQL forgets that the segment was dirty, so
the next checkpoint might fsync nothing at all and then bogusly report
success.  So, I think either Craig's patch should PANIC on *any* error
from fsync(), or we should commit this patch too so that we remember
which segments are dirty next time around.

[1] 
https://www.postgresql.org/message-id/flat/CAMsr%2BYFPeKVaQ57PwHqmRNjPCPABsdbV%3DL85he2dVBcr6yS1mA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Does logical replication supports cross platform servers?

2018-06-11 Thread Craig Ringer
On 12 June 2018 at 11:04, Haribabu Kommi  wrote:

> Hi All,
>
> I am not able to find any docs suggesting that PostgreSQL logical
> replication supports
> cross platform servers (windows --> Linux or vice-versa).
>
>
It should. I don't think there's buildfarm coverage yet, though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-06-11 Thread Kyotaro HORIGUCHI
Thanks for the discussion.

At Thu, 7 Jun 2018 19:16:57 +0530, Ashutosh Bapat 
 wrote in 

> On Tue, Jun 5, 2018 at 3:40 PM, Kyotaro HORIGUCHI
>  wrote:
> > Hello.
> >
> > At Mon, 04 Jun 2018 20:58:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20180604.205828.208262556.horiguchi.kyot...@lab.ntt.co.jp>
> >> It fails on some join-pushdown cases since it doesn't add tid
> >> columns to join tlist.  I suppose that build_tlist_to_deparse
> >> needs something but I'll consider further tomorrow.
> >
> > I made it work with a few exceptions and bumped.  PARAM_EXEC
> > doesn't work at all in a case where Sort exists between
> > ForeignUpdate and ForeignScan.
> >
> > =
> > explain (verbose, costs off)
> > update bar set f2 = f2 + 100
> > from
> >   ( select f1 from foo union all select f1+3 from foo ) ss
> > where bar.f1 = ss.f1;
> >   QUERY PLAN
> > -
> >  Update on public.bar
> >Update on public.bar
> >Foreign Update on public.bar2
> >  Remote SQL: UPDATE public.loct2 SET f2 = $3 WHERE tableoid = $1 AND 
> > ctid = $2
> > ...
> >->  Merge Join
> >  Output: bar2.f1, (bar2.f2 + 100), bar2.f3, (ROW(foo.f1))
> >  Merge Cond: (bar2.f1 = foo.f1)
> >  ->  Sort
> >Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid
> >Sort Key: bar2.f1
> >->  Foreign Scan on public.bar2
> >  Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, 
> > bar2.ctid
> >  Remote SQL: SELECT f1, f2, f3, ctid, tableoid FROM 
> > public.loct2 FOR UPDATE
> > =
> 
> What's the problem that you faced?

The required condtion for PARAM_EXEC to work is that executor
ensures the correspondence between the setter the reader of a
param like ExecNestLoop is doing. The Sort node breaks the
correspondence between the tuple obtained from the Foreign Scan
and that ForeignUpdate is updating. Specifically Foreign Update
upadtes the first tuple using the tableoid for the last tuple
from the Foreign Scan.

> > Even if this worked fine, it cannot be back-patched.  We need an
> > extra storage moves together with tuples or prevent sorts or
> > something like from being inserted there.
> 
> I think your approach still has the same problem that it's abusing the
> tableOid field in the heap tuple to store tableoid from the remote as
> well as local table. That's what Robert and Tom objected to [1], [2]

It's wrong understanding. PARAM_EXEC conveys remote tableoids
outside tuples and each tuple is storing correct (= local)
tableoid.

> > At Fri, 1 Jun 2018 10:21:39 -0400, Ashutosh Bapat 
> >  wrote in 
> > 
> >> I am not suggesting to commit 0003 in my patch set, but just 0001 and
> >> 0002 which just raise an error when multiple rows get updated when
> >> only one row is expected to be updated.
> >
> > So I agree to commit the two at least in order to prevent doing
> > wrong silently.
> 
> I haven't heard any committer's opinion on this one yet.
> 
> [1] 
> https://www.postgresql.org/message-id/CA+TgmobUHHZiDR=hcu4n30yi9_pe175ittbfk6t8jxzwkra...@mail.gmail.com
> [2] https://www.postgresql.org/message-id/7936.1526590932%40sss.pgh.pa.us

Agreed. We need any comment to proceed.

I have demonstrated and actually shown a problem of the
PARAM_EXEC case. (It seems a bit silly that I actually found the
problem after it became almost workable, though..)  If tuples
were not copied we will be able to use the address to identify a
tuple but actually they are. (Anyway we soudn't do that.)

A. Just detecting and reporting/erroring the problematic case.

B. Giving to Sort-like nodes an ability to convert PARAMS into
   junk columns.

C. Adding a space for 64bit tuple identifier in a tuple header.

D. Somehow inhibiting tuple-storing node like Sort between. (This
  should break something working.)


B seems to have possibility to fix this but I haven't have a
concrete design of it. With C, I see 2 bits of room in infomask2
and we can use one of the bits to indicate that the tuple has an
extra 64-bit tuple identifier. It could be propagated to desired
place but I'm not sure it is in acceptable shape.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Needless additional partition check in INSERT?

2018-06-11 Thread David Rowley
On 12 June 2018 at 09:13, Alvaro Herrera  wrote:
> Hearing no complaints I pushed it with the proposed shape.

Thanks for working on it and pushing.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



RE: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-06-11 Thread Tsunakawa, Takayuki
> From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
> I would like to get this committed soon, but I'm not sure about backpatching
> -- see below.  Here's a new version with a couple of minor changes:

Thank you for taking care of this patch.


> 1.  Small changes to the documentation.

I agree with Tom on this.


> 2.  I see that you added #include  to pgtypes_numeric.h and
> pgtypes_interval.h.  They have functions returning char *.  I
> experimented with removing those and including  directly in
> the .pgc test files, but then I saw why you did that: it changes all the
> line numbers in the expected output files making the patch much larger.
> No strong objection there.  But I figured we should at least be consistent,
> so I added #include  to pgtypes_timestamp.h and pgtypes_date.h
> (they also declare functions that return new strings).

The reason I added pgtypes.h only in pgtypes_numeric.h and pgtypes_interval.h 
is that the opgtypes_date.h includes pgtypes_timestamp.h and 
pgtypes_timestamp.h in turn includes pgtypes_interval.h.  So additional 
inclusion of pgtypes.h was not necessary.  But I'm OK with your patch for 
consistency.



> 3.  It seemed unnecessary to declare the new function in extern.h
> *and* in pgtypes.h.  I added #include "pgtypes.h" to common.c instead, and
> a comment to introduce the section of that file that defines functions from
> pgtypes.h.

Agreed, thanks.



> 4.  I found a place in src/interfaces/ecpg/test/sql/sqlda.pgc where you
> missed a free() call.
> 
> Are these changes OK?
> 
> Why is it OK that we do "free(outp_sqlda)" having got that pointer from
> a statement like "exec sql fetch 1 from mycur1 into descriptor outp_sqlda"?
> Isn't that memory allocated by libecpg.dll?

My colleague is now preparing a patch for that, which adds a function 
ECPGFreeSQLDA() in libecpg.so.  That thread is here:

https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA963A42097@G01JPEXMBKW03




> One question I have is whether it is against project policy to backport
> a new file and a new user-facing function.  It doesn't seem like a great
> idea, because it means that client code would need to depend on a specific
> patch release.  Even if we found an existing header to declare this function
> in, you'd still need to do conditional magic before using it.  So... how
> inconvenient do you think it would be if we did this for 11+ only?  Does
> anyone see a better way to do an API evolution here?  It's not beautiful
> but I suppose one work-around that end-user applications could use while
> they are stuck on older releases might be something like this, in their
> own tree, conditional on major version:
> 
> #define PGTYPESchar_free(x) PGTYPESdate_free((date *)(x))

I want some remedy for older releases.  Our customer worked around this problem 
by getting a libpq connection in their ECPG application and calling 
PQfreemem().  That's an ugly kludge, and I don't want other users to follow it.

I don't see a problem with back-patching as-is, because existing users who just 
call free() or don't call free() won't be affected.  I think that most serious 
applications somehow state their supported minor releases like "this 
application supports (or is certified against) PostgreSQL 10.5 or later", just 
like other applications support "RHEL 6.2 or later" or "Windows XP Sp2 or 
later."


Regards
Takayuki Tsunakawa




Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-11 Thread Andres Freund
Hi,

On 2018-05-28 12:52:06 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-27 13:00:06 -0700, Andres Freund wrote:
> > I've a patch that seems to work, that mostly needs some comment
> > polishing.
> 
> Attached is what I currently have. Still needs some more work, but I
> think it's more than good enough to review the approach.  Basically the
> approach consists out of two changes:
> 
> 1) Send init file removals for shared nailed relations as well.
> 
>This fixes that the shared init file contains arbitrarily outdated
>information for relfrozenxid etc. Leading to e.g. the pg_authid
>errors we've seen in some recent threads.  Only applies to
>new connections.
> 
> 2) Reread RelationData->rd_rel for nailed relations when invalidated.
> 
>This ensures that already built relcache entries for nailed relations
>are updated. Currently they never are.  This currently doesn't cause
>*that* frequently an issue for !shared entries, because for those the
>init file gets zapped regularly, and autovacuum workers usually don't
>live that long.  But it's still a significant correctness issue for
>both shared an non shared relations.

Here's a more polished v2 version of the patch.  I, locally, did the
work to backpatch the change. Besides trivialities there's two
nontrivial changes:

- In earlier versions there's no global invalidations. I've inquired and
  complained about what exactly they're for in
  
http://archives.postgresql.org/message-id/20180611231634.w2rgtlzxaw4loefk%40alap3.anarazel.de

  In earlier branches we just don't do the global thing. That seems
  unproblematic to me.

- The bigger issue is that in 9.3, and just there
  
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8de3e410faa06ab20ec1aa6d0abb0a2c040261ba
  does not yet exist.

  That means back then we performed reloads even outside a
  transaction. I don't feel confident about invoking additional catalog
  reloads in the new situations, so I kept the IsTransactionState()
  checks in RelationReloadNailed().  That seems less risky, but possibly
  somebody wants to argue the other way round?

There's some minor other conflicts, but they're all pretty obvious.

I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.


> FWIW, I wonder if this isn't critical enough to make us consider having
> a point release earlier..

Still think this is something we should seriously consider.

- Andres
>From d4fc41f841951e74e298811ad5ad9b872d14d607 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 28 May 2018 12:38:22 -0700
Subject: [PATCH v2] Fix bugs in vacuum of shared rels, by keeping their
 relcache entries current.

When vacuum processes a relation it uses the corresponding relcache
entry's relfrozenxid / relminmxid as a cutoff for when to remove
tuples etc. Unfortunately for nailed relations (i.e. critical system
catalogs) bugs could frequently lead to the corresponding relcache
entry being stale.

This set of bugs could cause actual data corruption as vacuum would
potentially not remove the correct row versions, potentially reviving
them at a later point.  After 699bf7d05c some corruptions in this vein
were prevented, but the additional error checks could also trigger
spuriously. Examples of such errors are:
  ERROR: found xmin ... from before relfrozenxid ...
and
  ERROR: found multixact ... from before relminmxid ...
To be caused by this bug the errors have to occur on system catalog
tables.

The two bugs are:

1) Invalidations for nailed relations were ignored, based on the
   theory that the relcache entry for such tables doesn't
   change. Which is largely true, except for fields like relfrozenxid
   etc.  This means that changes to relations vacuumed in other
   sessions weren't picked up by already existing sessions.  Luckily
   autovacuum doesn't have particularly longrunning sessions.

2) For shared *and* nailed relations, the shared relcache init file
   was never invalidated while running.  That means that for such
   tables (e.g. pg_authid, pg_database) it's not just already existing
   sessions that are affected, but even new connections are as well.
   That explains why the reports usually were about pg_authid et. al.

To fix 1), revalidate the rd_rel portion of a relcache entry when
invalid. This implies a bit of extra complexity to deal with
bootstrapping, but it's not too bad.  The fix for 2) is simpler,
simply always remove both the shared and local init files.

Author: Andres Freund
Reviewed-By: Alvaro Herrera
Discussion:
https://postgr.es/m/20180525203736.crkbg36muzxrj...@alap3.anarazel.de
https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bpgg+_gdmxe25tvuy4s...@mail.gmail.com
https://postgr.es/m/cakmfjucqbuodrfxpdx39wha3vjyxwerg_zdvxzncr6+5wog...@mail.gmail.com
https://postgr.es/m/cagewt-ujgpmlq09gxcufmzazsgjc98vxhefbf-tppb0fb13...@mail.gmail.com
Backpatch: 9.3-
---
 src/backend/utils/cache/inval.c

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-11 Thread Andres Freund
Hi,

On 2018-05-29 19:14:51 -0400, Alvaro Herrera wrote:
> I added an Assert(DatabasePath != NULL) to
> RelationCacheInitFilePreInvalidate() and didn't see a single crash when
> running the tests.  I thought that adding a "VACUUM FREEZE pg_class"
> in the recovery tests (where there is a standby) ought to do it, but it
> doesn't.  What's the deal there?
> 
> Here are some proposed changes.  Some of these comment edits are WIP :-)

I'm a bit confused by these changes - there seems to be some that look
like a borked diff?  And a number of others look pretty unrelated?

> -- 
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

> diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
> index 0d6100fb08..8a8620943f 100644
> --- a/src/backend/utils/cache/inval.c
> +++ b/src/backend/utils/cache/inval.c
> @@ -872,6 +872,8 @@ 
> ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
>int 
> nmsgs, bool RelcacheInitFileInval,
>Oid 
> dbid, Oid tsid)
>  {
> + Assert(InRecovery);
> +

Idk, seems unrelated.

>   if (nmsgs <= 0)
>   return;
>  
> @@ -884,12 +886,13 @@ 
> ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
>dbid);
>  
>   /*
> -  * RelationCacheInitFilePreInvalidate, when the invalidation 
> message
> -  * is for a specific database, requires DatabasePath to be set, 
> but we
> -  * should not use SetDatabasePath during recovery, since it is
> -  * intended to be used only once by normal backends.  Hence, a 
> quick
> -  * hack: set DatabasePath directly then unset after use.
> +  * When the invalidation message is for a specific database,
> +  * RelationCacheInitFilePreInvalidate requires DatabasePath to 
> be set,
> +  * but we're not allowed to use SetDatabasePath during recovery 
> (since
> +  * it is intended to be used by normal backends).  Hence, a 
> quick hack:
> +  * set DatabasePath directly then unset after use.
>*/
> + Assert(!DatabasePath);  /* don't clobber an existing value */
>   if (OidIsValid(dbid))
>   DatabasePath = GetDatabasePath(dbid, tsid);

Same.


> diff --git a/src/backend/utils/cache/relcache.c 
> b/src/backend/utils/cache/relcache.c
> index aa4427724d..71b2212cbd 100644
> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -1934,6 +1934,11 @@ RelationIdGetRelation(Oid relationId)
>   RelationClearRelation(rd, true);
>  
>   /*
> +  * Normal entries are valid by now -- except nailed 
> ones when
> +  * loaded before relcache initialization.  There isn't 
> enough
> +  * infrastructure yet to do pg_class lookups, but we 
> need their
> +  * rd_rel entries to be updated, so we let these 
> through.
> +  */
>* Normally entries need to be valid here, but before 
> the relcache
>* has been initialized, not enough infrastructure 
> exists to
>* perform pg_class lookups. The structure of such 
> entries doesn't
> @@ -2346,8 +2351,7 @@ RelationClearRelation(Relation relation, bool rebuild)
>   RelationCloseSmgr(relation);

This sure looks like it's a syntax error? So I guess you might not have
staged the removal ports of the diff?


>   /*
> -  * Treat nailed-in system relations separately, they always need to be
> -  * accessible, so we can't blow them away.
> +  * We cannot blow away nailed-in relations, so treat them especially.
>*/

The former seems just as accurate, and is basically just the already
existing comment?


>   if (relation->rd_isnailed)
>   {
> @@ -5942,7 +5946,8 @@ write_relcache_init_file(bool shared)
>* wrote out was up-to-date.)
>*
>* This mustn't run concurrently with the code that unlinks an init file
> -  * and sends SI messages, so grab a serialization lock for the duration.
> +  * and sends SI messages (see RelationCacheInitFilePreInvalidate), so 
> grab
> +  * a serialization lock for the duration.
>*/
>   LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);

Unrelated?


> @@ -6061,6 +6066,10 @@ RelationHasUnloggedIndex(Relation rel)
>   * changed one or more of the relation cache entries that are kept in the
>   * local init file.
>   *
> + * When DatabasePath is set, both the init file for that database and the
> + * shared (global) init files are to be removed; otherwise only the latter 
> is.
> + * This is useful 

Re: [COMMITTERS] pgsql: Logical replication

2018-06-11 Thread Andres Freund
Hi,

On 2017-01-20 14:06:18 +, Peter Eisentraut wrote:
> Logical replication
> 
> - Add PUBLICATION catalogs and DDL
> - Add SUBSCRIPTION catalog and DDL
> - Define logical replication protocol and output plugin
> - Add logical replication workers

I know I mentioned this before, but I just hit it again. Please split
up your commits in more logical chunks. This has a *LOT* of largely
unrelated pieces. Lots of them without so much as a comment about why
they have been changed.  The above certainly doesn't explain why the
*generic* cache invalidation code was affected:

@@ -509,8 +514,10 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
/*
 * If the relation being invalidated is one of those cached in the local
 * relcache init file, mark that we need to zap that file at commit.
+* Same is true when we are invalidating whole relcache.
 */
-   if (OidIsValid(dbId) && RelationIdIsInInitFile(relId))
+   if (OidIsValid(dbId) &&
+   (RelationIdIsInInitFile(relId) || relId == InvalidOid))
transInvalInfo->RelcacheInitFileInval = true;
 }

I mean, like, seriously?  What is this stuff about?

I kinda guess the motivating factor is:

/*
 * CacheInvalidateRelcacheAll
 *  Register invalidation of the whole relcache at the end of 
command.
 *
 * This is used by alter publication as changes in publications may affect
 * large number of tables.
 */
void
CacheInvalidateRelcacheAll(void)
{
PrepareInvalidationState();

RegisterRelcacheInvalidation(InvalidOid, InvalidOid);
}

but that doesn't explain why it's not tied to a database?  Nor why the
relcache init file is now invalidated when there's a publication change?

Greetings,

Andres Freund



Re: why partition pruning doesn't work?

2018-06-11 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/11/2018 06:24 PM, Tom Lane wrote:
>> If we had any buildfarm critters running valgrind on
>> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
>> detected use of uninitialized memory here ... but I don't think we have
>> any.  (The combination of valgrind and CCA would probably be too slow to
>> be practical :-(, though maybe somebody with a fast machine could do
>> the other thing.)

> I don't have a fast machine, but I do have a slow machine already 
> running valgrind and not doing much else :-) Let's see how lousyjack 
> does with -DRELCACHE_FORCE_RELEASE

I just tried the case here, and it doesn't even get as far as any
of the partitioning tests: it bombs out in inherit.sql :-(

==00:00:06:55.816 26107== Invalid read of size 4
==00:00:06:55.816 26107==at 0x5F3978: ATExecDropInherit (tablecmds.c:11928)
==00:00:06:55.816 26107==by 0x60212A: ATExecCmd (tablecmds.c:4241)
==00:00:06:55.816 26107==by 0x602CC4: ATController (tablecmds.c:3976)
==00:00:06:55.816 26107==by 0x7910EA: ProcessUtilitySlow (utility.c:1117)
==00:00:06:55.816 26107==by 0x79180F: standard_ProcessUtility 
(utility.c:920)
==00:00:06:55.816 26107==by 0x78D748: PortalRunUtility (pquery.c:1178)
==00:00:06:55.816 26107==by 0x78E6D0: PortalRunMulti (pquery.c:1331)
==00:00:06:55.816 26107==by 0x78EF8F: PortalRun (pquery.c:799)
==00:00:06:55.816 26107==by 0x78B35C: exec_simple_query (postgres.c:1122)
==00:00:06:55.816 26107==by 0x78C8B3: PostgresMain (postgres.c:4153)
==00:00:06:55.816 26107==by 0x70FBD5: PostmasterMain (postmaster.c:4361)
==00:00:06:55.816 26107==by 0x67AE4F: main (main.c:228)
==00:00:06:55.816 26107==  Address 0xe823e90 is 179,504 bytes inside a recently 
re-allocated block of size 524,288 alloc'd
==00:00:06:55.816 26107==at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:06:55.816 26107==by 0x8BBB35: AllocSetAlloc (aset.c:923)
==00:00:06:55.816 26107==by 0x8C4473: palloc (mcxt.c:938)
==00:00:06:55.816 26107==by 0x488DEF: CreateTemplateTupleDesc (tupdesc.c:66)
==00:00:06:55.816 26107==by 0x88D2C0: RelationBuildDesc (relcache.c:416)
==00:00:06:55.816 26107==by 0x8904B5: RelationIdGetRelation 
(relcache.c:1943)
==00:00:06:55.816 26107==by 0x4C93BF: relation_open (heapam.c:1135)
==00:00:06:55.816 26107==by 0x4D8305: index_open (indexam.c:154)
==00:00:06:55.816 26107==by 0x62D6EB: ExecOpenIndices (execIndexing.c:197)
==00:00:06:55.816 26107==by 0x53B607: CatalogOpenIndexes (indexing.c:49)
==00:00:06:55.816 26107==by 0x556467: recordMultipleDependencies 
(pg_depend.c:112)
==00:00:06:55.816 26107==by 0x560D44: create_toast_table (toasting.c:385)


That one's pretty obvious when you look at the code:

/* keep our lock on the parent relation until commit */
heap_close(parent_rel, NoLock);

ObjectAddressSet(address, RelationRelationId,
 RelationGetRelid(parent_rel));

It looks like this might be a fruitful source of creepie-crawlies.

regards, tom lane



Re: [PATCH v18] GSSAPI encryption support

2018-06-11 Thread Nico Williams
On Mon, Jun 11, 2018 at 04:11:10PM -0400, Robbie Harwood wrote:
> Nico was kind enough to provide me with some code review.  This should
> those concerns (clarify short-read behavior and fixing error checking on
> GSS functions).

Besides the bug you fixed and which I told you about off-list (on IRC,
specifically), I only have some commentary that does not need any
action:

 - support for non-Kerberos/default GSS mechanisms

   This might require new values for gssmode: prefer-
   and require-.  One could always use SPNEGO if there
   are multiple mechanisms to choose from.  And indeed, you could just
   use SPNEGO if the user has credentials for multiple mechanism.

   (Because GSS has no standard mechanism _names_, this means making
   some up.  This is one obnoxious shortcoming of the GSS-API...)


 - when the SCRAM channel binding work is done, it might be good to add
   an option for TLS + GSS w/ channel binding to TLS and no gss wrap
   tokens


Nico
-- 



Re: why partition pruning doesn't work?

2018-06-11 Thread Andrew Dunstan




On 06/11/2018 06:24 PM, Tom Lane wrote:


If we had any buildfarm critters running valgrind on
RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
detected use of uninitialized memory here ... but I don't think we have
any.  (The combination of valgrind and CCA would probably be too slow to
be practical :-(, though maybe somebody with a fast machine could do
the other thing.)





I don't have a fast machine, but I do have a slow machine already 
running valgrind and not doing much else :-) Let's see how lousyjack 
does with -DRELCACHE_FORCE_RELEASE


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: automating perl compile time checking

2018-06-11 Thread Andrew Dunstan




On 06/11/2018 03:38 PM, Andrew Dunstan wrote:



On 06/11/2018 02:33 PM, Alvaro Herrera wrote:

On 2018-Jun-05, Daniel Gustafsson wrote:

On 5 Jun 2018, at 16:31, Andrew Dunstan 
 wrote:
The patch contains a simple script to run the checks. The code that 
finds perl files is put in a function in a single file that is 
sourced by the three locations that need it.

+1 on centralizing the find-files function.

+1 on that.  Why do we need to make the new find_perl_files file
executable, given it's always sourced?  (I would have given a .sh
extension because it's a lib not an executable, but I suppose that's
just matter of taste; we certainly don't have a policy about it).

Looks fine to me either way.




I've committed this, but I'm fine if people want to tweak the names. 
It probably doesn't need to be executable.






People might be interested to see the perlcritic and 'perl -cw" checks 
in operation:


https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake=2018-06-11%2021%3A47%3A18=perl-check

The module isn't actually using the scripts in src/tools/perlcheck, 
because they are designed to be quiet and it's designed to be more 
verbose, but apart from that it's doing exactly the same thing.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: why partition pruning doesn't work?

2018-06-11 Thread Tom Lane
I wrote:
> * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in
> perform_pruning_base_step.  Those seem likely to leak memory, and
> for sure they destroy any opportunity for the called comparison
> function to cache info in fn_extra --- something that's critical
> for performance for some comparison functions such as array_eq.
> Why is it necessary to suppose that those functions could change
> from one execution to the next?

After looking closer, that code isn't just inefficient, it's flat
out broken.  The reason is that ExecSetupPartitionPruneState thinks
it can store some pointers into the target relation's relcache entry
in the PartitionPruneContext, and then continue to use those pointers
after closing the relcache entry.  Nope, you can't do that.

The only reason this code has appeared to pass buildfarm testing is
that when we do

if (cmpfn != context->partsupfunc[keyno].fn_oid)
fmgr_info(cmpfn, [keyno]);
else ...

if the relcache entry that context->partsupfunc is pointing into
has been freed (and overwritten thanks to CLOBBER_FREED_MEMORY), then the
function OID comparison generally fails so that we do a fresh fmgr_info
call.  In the field it's quite likely that we'd accept and attempt to
use a partially-clobbered FmgrInfo; but it would only happen if a relcache
flush had caused the data to get released, so it could be awhile before
anybody saw it happen, let alone reproduced it enough to figure it out.

It's easy to demonstrate that there's a problem if you instrument this
code to log when the OID comparison fails, and then run the regression
tests with -DRELCACHE_FORCE_RELEASE: you get lots of reports like

2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351
2018-06-11 18:01:28.707 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351
2018-06-11 18:01:28.707 EDT [16734] LOG:  replace partsupfunc 2139062143 with 
351

showing that context->partsupfunc has been overwritten by
CLOBBER_FREED_MEMORY.

If we had any buildfarm critters running valgrind on
RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
detected use of uninitialized memory here ... but I don't think we have
any.  (The combination of valgrind and CCA would probably be too slow to
be practical :-(, though maybe somebody with a fast machine could do
the other thing.)

Not sure about a good fix for this.  It seems annoying to copy the
rel's whole partkey data structure into query-local storage, but
I'm not sure we have any choice.  On the bright side, there might
be an opportunity to get rid of repeated runtime fmgr_info lookups
in cross-type comparison situations.

regards, tom lane



Re: Fix some error handling for read() and errno

2018-06-11 Thread Andres Freund
On 2018-06-11 18:11:05 -0400, Alvaro Herrera wrote:
> As for the messages, I propose to make them more regular, i.e. always
> use the wording "could not read from file "%s": read %d, expected %zu", 
> avoiding variations such as
>   not enough data in file \"%s\": %d read, %d expected"
>   could not read compressed file \"%s\": read %d out of %zu
>   could not read any data from log segment %s, offset %u, length %lu
> and others that appear in other places.  (In the last case, I even go as
> far as proposing "read %d, expected %zu" where the the first %d is
> constant zero.  Extra points if the sprintf format specifiers are always
> the same (say %zu) instead of, say, %d in odd places.

> I would go as far as suggesting to remove qualifiers that indicate what
> the file is for (such as "relation mapping file"); relying on the path
> as indicator of what's going wrong should be sufficient, since it's an
> error that affects internals anyway, not anything that users can do much
> about.  Keep variations to a minimum, to ease translator's work;
> sometimes it's hard enough to come up with good translations for things
> like "relation mapping file" in the first place, and they don't help the
> end user.

If we go there, why not wrap the read/write/etc calls into a wrapper,
including the error handling?

I'm not quite convinced that "relation mapping file" doesn't provide any
information. It's likley to be easier to search for than a specific
filename, particularly if there's oids or such in the name...

Greetings,

Andres Freund



Re: Fix some error handling for read() and errno

2018-06-11 Thread Alvaro Herrera
On 2018-Jun-11, Robbie Harwood wrote:

> It might be less confusing to just set errno if it's not set already
> (e.g., to EIO, or something).  Up to you though - this is a bit of a
> niche case.

I think that would be very confusing, if we receive a report that really
is a short read but looks like EIO.  I'm for keeping the code as
proposed.

As for the messages, I propose to make them more regular, i.e. always
use the wording "could not read from file "%s": read %d, expected %zu", 
avoiding variations such as
not enough data in file \"%s\": %d read, %d expected"
could not read compressed file \"%s\": read %d out of %zu
could not read any data from log segment %s, offset %u, length %lu
and others that appear in other places.  (In the last case, I even go as
far as proposing "read %d, expected %zu" where the the first %d is
constant zero.  Extra points if the sprintf format specifiers are always
the same (say %zu) instead of, say, %d in odd places.

I would go as far as suggesting to remove qualifiers that indicate what
the file is for (such as "relation mapping file"); relying on the path
as indicator of what's going wrong should be sufficient, since it's an
error that affects internals anyway, not anything that users can do much
about.  Keep variations to a minimum, to ease translator's work;
sometimes it's hard enough to come up with good translations for things
like "relation mapping file" in the first place, and they don't help the
end user.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Fix and document lock handling for in-memory replication slot da

2018-06-11 Thread Michael Paquier
On Mon, Jun 11, 2018 at 09:49:52AM -0700, Andres Freund wrote:
> Same is true for the codepaths calling GetRedoRecPtr().

You are right.  I'll fix in a minute.  A first commit's stress make
things really harder to get right...

> I don't object to the general idea of adding locking - although the
> benefit are nearly guaranteed to be cosmetic - but this has the
> potential to make things worse.

Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Fix some error handling for read() and errno

2018-06-11 Thread Robbie Harwood
Michael Paquier  writes:

> diff --git a/src/backend/access/transam/slru.c 
> b/src/backend/access/transam/slru.c
> index 87942b4cca..d487347cc6 100644
> --- a/src/backend/access/transam/slru.c
> +++ b/src/backend/access/transam/slru.c
> @@ -683,6 +683,11 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
>   pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
>   if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
>   {
> + /*
> +  * XXX: Note that this may actually report sucess if the number
> +  * of bytes read is positive, but lacking data so that errno is
> +  * not set.
> +  */
>   pgstat_report_wait_end();
>   slru_errcause = SLRU_READ_FAILED;
>   slru_errno = errno;

It might be less confusing to just set errno if it's not set already
(e.g., to EIO, or something).  Up to you though - this is a bit of a
niche case.

The rest of the patch looks good to me.

Thanks,
--Robbie



signature.asc
Description: PGP signature


Re: Needless additional partition check in INSERT?

2018-06-11 Thread Alvaro Herrera
Hearing no complaints I pushed it with the proposed shape.

Thanks everyone,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH v16] GSSAPI encryption support

2018-06-11 Thread Nico Williams
On Mon, Jun 11, 2018 at 01:31:12PM -0400, Andrew Dunstan wrote:
> On 06/11/2018 01:13 PM, Nico Williams wrote:
> >Well, all the free CIs like Travis and Appveyor do it this way.  You
> >don't have to *use* it just because the .yml files are in the source
> >tree.  But you have to have the .yml files in the source tree in order
> >to use these CIs.  It'd be nice to be able to point somewhere else for
> >them, but whatever, that's not something we get much choice in at this
> >time.
> 
> That's not true, at least for Appveyor (can't speak about travis - I have no
> first hand experience). For appveyor, you can supply a custom appveyor.yml
> file, which can be a complete URL. In fact, if you use a plain git source as
> opposed to one of the managed git services it supports, you have to do it
> that way - it ignores an appveyor.yml in your repo. I found this out the
> very hard way over the last few days, and they very kindly don't warn you at
> all about this.

OK, that's.. nice, maybe, I guess, but I'd still want version control
for these yml files -- why not have them in-tree?  I'd rather have them
in-tree unless there's a good reason not to have them there.

In other projects I definitely find it better to have these files
in-tree.

Nico
-- 



Re: adding tab completions

2018-06-11 Thread Arthur Zakirov
On Sat, Jun 09, 2018 at 06:42:12PM -0500, Justin Pryzby wrote:
> > Moreover there is no such completion for example for the command (it shows
> > only first column):
> > 
> > CREATE INDEX ON test (
> 
> Noted (I misunderstood at first: you just mean there's precedent that column
> names aren't completed, except maybe the first).

Yes, exactly. It was about the precedent.

> I can revise patch to not complete attributes in analyze; but, I think that
> means that this will have to complete to table names:
> 
>   postgres=# ANALYZE tt (a , 
>   alu_enodeb_201601 information_schema.
>   alu_enodeb_201602 jrn_pg_buffercache
>   ...
> 
> .. since, without checking for matching parens, it has no idea whether to
> complete with rels or atts.  WDYT?

IMHO, I'd leave the code as simple as possible. It is up to you of
course. But it is easy to add completion for a first attribute, by
adding the condition (and leave other attributes without completion):

else if (HeadMatches1("VACUUM") && TailMatches1("("))
COMPLETE_WITH_ATTR(prev2_wd, "");

> > > - "SERVER", "INDEX", "LANGUAGE", "POLICY", "PUBLICATION", 
> > > "RULE",
> > > + "SERVER", "INDEX", "LANGUAGE", "POLICY", "PUBLICATION",
> > 
> > Is this a typo? I think still there is a possibility to comment rules.
> 
> Not in PG11(b1) (note, that's a custom table)
>   postgres=# COMMENT ON RULE pg_settings_u IS 'asdf';
>   ERROR:  syntax error at or near "IS"
> ...
> Remove deprecated COMMENT ON RULE syntax
> e8d016d81940e75c126aa52971b7903b7301002e

Oh, I understood what it is it here. Those commit removed the syntax:

COMMENT ON RULE rule_name

But still there is the syntax:

COMMENT ON RULE rule_name ON table_name

I can run the command:

COMMENT ON RULE rtest ON test IS 'rtest';

> > The last point I've noticed, there is no VERBOSE entry after VACUUM FULL
> > ANALYZE command anymore.
> 
> See commit 921059bd6, above (it's not 100% clear to me that's intended to
> reject VACUUM ANALYZE VERBOSE and not just reject VACUUM VERBOSE ANALYZE
> VERBOSE, but I tentatively assume it's intentional).

Right. Understood.

> > I'm not sure how this patch should be commited. Can it be commited
> > outside the commitfest? Otherwise add it to the next commitfest please
> > in order not to forget it.
> 
> I've done https://commitfest.postgresql.org/18/1661/

Thank you!

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: automating perl compile time checking

2018-06-11 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/11/2018 04:01 PM, Peter Eisentraut wrote:
>> Why is this being committed after feature freeze?

> This affects pretty much nothing. In fact some of the other changes I've 
> recently committed were arguably more dangerous. Do you want me to 
> revert the whole lot?

AFAICS this is adding testing, not "features", so it seems fine from here.

regards, tom lane



Re: [PATCH v18] GSSAPI encryption support

2018-06-11 Thread Robbie Harwood
Hi,

Nico was kind enough to provide me with some code review.  This should
those concerns (clarify short-read behavior and fixing error checking on
GSS functions).

Thanks,
--Robbie


signature.asc
Description: PGP signature
>From 20b9f7d5cd9a35e7210ccc309bada789411b6cdb Mon Sep 17 00:00:00 2001
From: Robbie Harwood 
Date: Thu, 10 May 2018 16:12:03 -0400
Subject: [PATCH] libpq GSSAPI encryption support

On both the frontend and backend, prepare for GSSAPI encryption
support by moving common code for error handling into a separate file.
Fix a TODO for handling multiple status messages in the process.
Eliminate the OIDs, which have not been needed for some time.

Add frontend and backend encryption support functions.  Keep the
context initiation for authentication-only separate on both the
frontend and backend in order to avoid concerns about changing the
requested flags to include encryption support.

In postmaster, pull GSSAPI authorization checking into a shared
function.  Also share the initiator name between the encryption and
non-encryption codepaths.

Modify pqsecure_write() to take a non-const pointer.  The pointer will
not be modified, but this change (or a const-discarding cast, or a
malloc()+memcpy()) is necessary for GSSAPI due to const/struct
interactions in C.

For HBA, add "hostgss" and "hostnogss" entries that behave similarly
to their SSL counterparts.  "hostgss" requires either "gss", "trust",
or "reject" for its authentication.

Simiarly, add a "gssmode" parameter to libpq.  Supported values are
"disable", "require", and "prefer".  Notably, negotiation will only be
attempted if credentials can be acquired.  Move credential acquisition
into its own function to support this behavior.

Finally, add documentation for everything new, and update existing
documentation on connection security.

Thanks to Michael Paquier for the Windows fixes.
---
 doc/src/sgml/client-auth.sgml   |  75 --
 doc/src/sgml/libpq.sgml |  57 +++-
 doc/src/sgml/runtime.sgml   |  77 +-
 src/backend/libpq/Makefile  |   4 +
 src/backend/libpq/auth.c| 103 +++
 src/backend/libpq/be-gssapi-common.c|  64 +
 src/backend/libpq/be-gssapi-common.h|  26 ++
 src/backend/libpq/be-secure-gssapi.c| 321 ++
 src/backend/libpq/be-secure.c   |  16 ++
 src/backend/libpq/hba.c |  51 +++-
 src/backend/postmaster/pgstat.c |   3 +
 src/backend/postmaster/postmaster.c |  64 -
 src/include/libpq/hba.h |   4 +-
 src/include/libpq/libpq-be.h|  11 +-
 src/include/libpq/libpq.h   |   3 +
 src/include/libpq/pqcomm.h  |   5 +-
 src/include/pgstat.h|   3 +-
 src/interfaces/libpq/Makefile   |   4 +
 src/interfaces/libpq/fe-auth.c  |  90 +--
 src/interfaces/libpq/fe-connect.c   | 232 +++-
 src/interfaces/libpq/fe-gssapi-common.c | 128 +
 src/interfaces/libpq/fe-gssapi-common.h |  23 ++
 src/interfaces/libpq/fe-secure-gssapi.c | 345 
 src/interfaces/libpq/fe-secure.c|  16 +-
 src/interfaces/libpq/libpq-fe.h |   3 +-
 src/interfaces/libpq/libpq-int.h|  30 ++-
 src/tools/msvc/Mkvcbuild.pm |  22 +-
 27 files changed, 1578 insertions(+), 202 deletions(-)
 create mode 100644 src/backend/libpq/be-gssapi-common.c
 create mode 100644 src/backend/libpq/be-gssapi-common.h
 create mode 100644 src/backend/libpq/be-secure-gssapi.c
 create mode 100644 src/interfaces/libpq/fe-gssapi-common.c
 create mode 100644 src/interfaces/libpq/fe-gssapi-common.h
 create mode 100644 src/interfaces/libpq/fe-secure-gssapi.c

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 656d5f9417..38cf32e3be 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -108,6 +108,8 @@ hostnossl  database  user
 host   database  user  IP-address  IP-mask  auth-method  auth-options
 hostssldatabase  user  IP-address  IP-mask  auth-method  auth-options
 hostnossl  database  user  IP-address  IP-mask  auth-method  auth-options
+hostgssdatabase  user  IP-address  IP-mask  auth-method  auth-options
+hostnogss  database  user  IP-address  IP-mask  auth-method  auth-options
 
The meaning of the fields is as follows:
 
@@ -128,9 +130,10 @@ hostnossl  database  user
  
   
This record matches connection attempts made using TCP/IP.
-   host records match either
+   host records match
SSL or non-SSL connection
-   attempts.
+   attempts as well as GSSAPI or
+   non-GSSAPI connection attempts.
   
  
   
@@ -176,6 +179,42 @@ hostnossl  database  user
  
 
 
+
+ hostgss
+ 
+  
+   This record matches connection attempts made using TCP/IP,
+   but only when the connection is made with GSSAPI
+   encryption.
+  
+
+  
+   

Re: automating perl compile time checking

2018-06-11 Thread Andrew Dunstan




On 06/11/2018 04:01 PM, Peter Eisentraut wrote:

On 6/11/18 15:38, Andrew Dunstan wrote:

On 5 Jun 2018, at 16:31, Andrew Dunstan  wrote:
The patch contains a simple script to run the checks. The code that finds perl 
files is put in a function in a single file that is sourced by the three 
locations that need it.

+1 on centralizing the find-files function.

+1 on that.  Why do we need to make the new find_perl_files file
executable, given it's always sourced?  (I would have given a .sh
extension because it's a lib not an executable, but I suppose that's
just matter of taste; we certainly don't have a policy about it).

Looks fine to me either way.



I've committed this, but I'm fine if people want to tweak the names. It
probably doesn't need to be executable.

Why is this being committed after feature freeze?




This affects pretty much nothing. In fact some of the other changes I've 
recently committed were arguably more dangerous. Do you want me to 
revert the whole lot?


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: automating perl compile time checking

2018-06-11 Thread Peter Eisentraut
On 6/11/18 15:38, Andrew Dunstan wrote:
 On 5 Jun 2018, at 16:31, Andrew Dunstan  
 wrote:
 The patch contains a simple script to run the checks. The code that finds 
 perl files is put in a function in a single file that is sourced by the 
 three locations that need it.
>>> +1 on centralizing the find-files function.
>> +1 on that.  Why do we need to make the new find_perl_files file
>> executable, given it's always sourced?  (I would have given a .sh
>> extension because it's a lib not an executable, but I suppose that's
>> just matter of taste; we certainly don't have a policy about it).
>>
>> Looks fine to me either way.
>>
> 
> 
> I've committed this, but I'm fine if people want to tweak the names. It 
> probably doesn't need to be executable.

Why is this being committed after feature freeze?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: automating perl compile time checking

2018-06-11 Thread Andrew Dunstan




On 06/11/2018 02:33 PM, Alvaro Herrera wrote:

On 2018-Jun-05, Daniel Gustafsson wrote:


On 5 Jun 2018, at 16:31, Andrew Dunstan  wrote:
The patch contains a simple script to run the checks. The code that finds perl 
files is put in a function in a single file that is sourced by the three 
locations that need it.

+1 on centralizing the find-files function.

+1 on that.  Why do we need to make the new find_perl_files file
executable, given it's always sourced?  (I would have given a .sh
extension because it's a lib not an executable, but I suppose that's
just matter of taste; we certainly don't have a policy about it).

Looks fine to me either way.




I've committed this, but I'm fine if people want to tweak the names. It 
probably doesn't need to be executable.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PostgreSQL vs SQL Standard

2018-06-11 Thread Tom Lane
Alvaro Herrera  writes:
> On 2018-Jun-10, Tom Lane wrote:
>> (wanders away wondering exactly what parsing technology the SQL committee
>> thinks implementations use...)

> Umm, doesn't this come from our decision to make the AS optional there?

No, it was THEIR decision to make AS optional.  I'd like to think that
we'd never have done anything so stupid if we were designing the syntax.

regards, tom lane



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra
On 06/11/2018 08:13 PM, Jeff Davis wrote:
> On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote:
>> For example we hit the work_mem limit after processing 10% tuples,
>>  switching to sort would mean spill+sort of 900GB of data. Or we 
>> might say - hmm, we're 10% through, so we expect hitting the limit
>> 10x, so let's spill the hash table and then do sort on that,
>> writing and sorting only 10GB of data. (Or merging it in some
>> hash-based way, per Robert's earlier message.)
> 
> Your example depends on large groups and a high degree of group
> clustering. That's fine, but it's a special case,
> 

True, it's a special case and it won't work for other cases. It was
merely an example for Andres.

OTOH it's not entirely unrealistic, I think. Consider something like

  SELECT
extract(year from ts) AS y,
extract(month from ts) AS m,
extract(day from ts) AS d,
string_agg(x),
array_agg(y)
  FROM fact_table
  GROUP BY y, m, d;

which is likely very correlated (assuming new data is appended to the
table), and the string_agg/array_agg are likely to produce fairly large
groups (about proportional to the number of tuples in the group).

Another example might be about HLL aggregate, although in that case the
transition state does not grow, so it may not be that bad (and the
default estimate of 1kB would work pretty nicely). But there certainly
are other aggregates with large transition state, where this might not
be the case, and we currently have no way to communicate that to the
planner - except for setting work_mem much lower :-/

However, I now realize I've ignored the fact that we typically don't
sort the whole table but only a very few columns, so the example was not
entirely fair - we would not sort the whole remaining 900GB but likely
much less.

> and complexity does have a cost, too.

Sure.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Proposal: Partitioning Advisor for PostgreSQL

2018-06-11 Thread Julien Rouhaud
Hi Ashutosh,

Thanks for answering!  And I'm very sorry for the time I needed to reply

On Wed, May 30, 2018 at 5:44 PM, Ashutosh Bapat
 wrote:
> On Thu, May 24, 2018 at 4:16 PM, Yuzuko Hosoya
>>
>> However, PostgreSQL isn't designed to have hypothetical tables,
>
> I agree. But there are light-weight tables like foreign tables, views
> and partitioned tables themselves. These kinds of tables do not have
> any storage associated with them. We could implement semi-hypothetical
> partitioned table using these three. The reason I say it's
> semi-hypothetical since we will need to create some real objects, but
> which do not take actual storage. The idea is to create partitioned
> table with foreign table partitions which point to views simulating
> partitions. The steps are
> 1. Create views one per partition which select data from the
> unpartitioned table that would fall in a partition simulated by that
> view. So something like SELECT * FROM unpartitioned_table WHERE
> partition constraint for that partition.
> 2. Create partitioned table
> 3. Create foreign table partitions that point to the views created in
> the first step.
> 4. ANALYZE the foreign tables and the partitioned table
>
> Now if we EXPLAIN the query on unpartitioned table by redirecting it
> to the partitioned table, we would get the EXPLAIN plans as if the
> query is running on the partitioned table. We will need to zero out
> the FDW costs, so that the cost of accessing foreign table comes out
> to be same as accessing a local table. That's mostly setting the right
> FDW GUCs.
>
> Since we are creating and dropping some real objects, may be we want
> to create temporary objects (we don't have support to create temporary
> foreign tables for now, but may be that's desirable feature) or create
> them in a different database to reduce catalog bloat. Similarly we
> won't be able to create indexes on the foreign table, but may be we
> could simulate those using hypothetical indexes feature.
>
> This method doesn't need any core changes which are useful only for
> this extension. Supporting temporary foreign table and declarative
> indexes on foreign tables may be seen as separate features and
> acceptable in the community.

I both like and dislike this idea.  The good thing is that it's way
less hacky than what we did in our prototype, and it's also working
out of the box.  However, the problem I have with this approach is
that the generated plans will be quite different from real
partitioning,  The main features such as partition pruning or
partition-wise join will probably work, but you'll always have a
ForeignScan as the primary path and I think that it'll drastically
limit the planner and the usability.

I'm also not a fan of doing core changes for a single extension
purpose only, but I think that many of the blockers could be solved
with only slight changes in the core code (for instance, don't use a
Relation as a function parameter just to get the underlying
PartitionKey, but directly pass the PartitionKey on top level).  For
the rest, I'm not sure yet of what exactly would need to be changed
(the partitioning code moved quite a lot lately, and it's hard to stay
up to date), and if such changes could also be useful for other
purpose.



Re: automating perl compile time checking

2018-06-11 Thread Alvaro Herrera
On 2018-Jun-05, Daniel Gustafsson wrote:

> > On 5 Jun 2018, at 16:31, Andrew Dunstan  
> > wrote:
> 
> > The patch contains a simple script to run the checks. The code that finds 
> > perl files is put in a function in a single file that is sourced by the 
> > three locations that need it.
> 
> +1 on centralizing the find-files function.

+1 on that.  Why do we need to make the new find_perl_files file
executable, given it's always sourced?  (I would have given a .sh
extension because it's a lib not an executable, but I suppose that's
just matter of taste; we certainly don't have a policy about it).

Looks fine to me either way.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tom Lane
Jeff Davis  writes:
> Also, I am not sure we should really be designing around data types
> where it makes sense to group and then don't supply a btree opclass.
> Seems like they are likely to hit a problem soon anyway.

It's not that unreasonable to have a hash opclass and no btree opclass;
the datatype might not have a natural linear ordering.

But in any case, I think Robert's point was that he'd prefer to avoid
having a poorly-tested special-case code path for that situation, which
seems like a good idea independently of performance considerations.

regards, tom lane



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Jeff Davis
On Mon, 2018-06-11 at 11:55 -0400, Robert Haas wrote:
> performance degradation is not necessarily much better than OOM.  I
> suspect that part of the reason why both Andres and David proposed
> algorithms that combined hashing and sorting is out of a desire to
> cater somewhat to both few-tuples-per-group and many-tuples-per-
> group.
> 

I think average group size is the wrong way to look at it, because
averages are only useful for a normal distribution. The two most
interesting cases are:

1. Fairly uniform group size (e.g. normal dist)
2. Skew values, power law distribution, etc., where some groups are
very large and most are tiny by comparison. I am calling the very large
groups "skewed groups".

For #1, hashing and sorting are both reasonable, and it depends on a
lot of factors (number of groups, clustering, available memory).

For #2, hashing is really good and sorting is really bad. That's
because (at present) we sort all of the tuples before doing any
aggregation, so it expends a lot of effort on the skewed groups.
Hashing can keep skewed groups in memory and avoid spilling a large
fraction of the input tuples at all.

If we get the skewed groups into the hash table, and avoid OOM, I think
that is a success. My patch did that, except it didn't account for two
cases:

  (a) clustered input, where skewed groups may not appear until the
hash table is already full; and
  (b) variable-sized transition values that grow with the group size

> One big advantage to just partitioning the input tuples by hash code
> and then proceeding batch by batch is that it works for any aggregate
> that can support hash aggregation in the first place.  You do not

Agreed, but I think we should evaluate Andres's idea of feeding spilled
tuples to a Sort, because the overall complexity might be lower even
accounting for the special cases you're worried about.

Also, I am not sure we should really be designing around data types
where it makes sense to group and then don't supply a btree opclass.
Seems like they are likely to hit a problem soon anyway.

Regards,
Jeff Davis





Re: commitfest 2018-07

2018-06-11 Thread Andres Freund
Hi,

On 2018-06-11 11:50:55 +0900, Michael Paquier wrote:
> On Wed, Jun 06, 2018 at 12:38:56AM +0200, Magnus Hagander wrote:
> > Thus, if we don't want to have to risk doing surgery on the system (or
> > guarantee we won't bounce any patches from the 07 CF, but that seems like
> > the wrong thing to do), we should rename the existing 09 CF to 07, so all
> > patches automatically end up there, and stick to only "bouncing patches in
> > the forward direction".
> 
> So, we have confirmation from the RTM that there will be a 2018-07.  And
> there is visibly consensus that renaming 2018-09 to 2018-07 makes the
> most sense.  Any objections in moving forward with this renaming at the
> 5-CF plan for v12?

FWIW, I still think it'd be a better plan to explicitly move things,
rather than just rename.  But concensus isn't unamity, so ...

Greetings,

Andres Freund



Re: JIT versus standalone-headers checks

2018-06-11 Thread Andres Freund
Hi,

On 2018-06-10 15:59:04 -0400, Tom Lane wrote:
> I find that the JIT stuff has broken cpluspluscheck for me, along
> with a related script that I use to verify that each header builds
> cleanly standalone (ie with no prerequisites except postgres.h).
> There are two problems:
> 
> (1) Doesn't work on a platform without the llvm header files:
> 
> ./src/include/jit/llvmjit.h:18:26: error: llvm-c/Types.h: No such file or 
> directory
> followed by a lot of complaints like
> ./src/include/jit/llvmjit.h:60: error: 'LLVMTypeRef' does not name a type
> 
> It seems like a reasonable fix for that is to wrap the contents of these
> headers in "#ifdef USE_LLVM" ... do you see a reason not to?

> (2) This seems to need re-thought:
> 
> ./src/include/jit/llvmjit.h:15:2: error: #error "llvmjit.h should only be 
> included by code dealing with llvm"
> 
> I don't especially see the value of this #error, especially if we are
> wrapping this whole header in "#ifdef USE_LLVM", and propose to just
> remove it.

The reason 1) and 2) happen in this conjunction is that during
development I repeatedly made the mistake of including llvmjit.h, and
then used containing declarations, in files that shouldn't rely on it.
This was to help spot future mistakes of the same kind.

But arguably that kind of has been obsoleted by the more stringent
"plugin" model that's now in place, where llvmjit.h really shouldn't
ever be included outside backend/jit/llvmjit/, at leats in core. Out of
core somebody could reasonably try to layer something ontop of it to
build something more complicated.

So I guess we could just go for an #ifdef USE_LLVM as you suggest. Let
me think about it for a day, and then I'll make it so.

Greetings,

Andres Freund



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra




On 06/11/2018 07:14 PM, Andres Freund wrote:

Hi,

On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:

It would be great to get something that performs better than just falling
back to sort (and I was advocating for that), but I'm worried we might be
moving the goalposts way too far.


I'm unclear on why that'd have that bad performance in relevant
cases. You're not going to hit the path unless the number of groups is
pretty large (or work_mem is ridiculously small, in which case we don't
care). With a large number of groups the sorting path isn't particularly
inefficient, because repeatedly storing the input values isn't such a
large fraction in comparison to the number of groups (and their
transition values).  Which scenarios are you concerned about?



Say you have a 1TB table, and keeping the groups in memory would require 
 work_mem=2GB. After hitting the work_mem limit, there may be pretty 
large amount of tuples you'd have to spill to disk and sort.


For example we hit the work_mem limit after processing 10% tuples, 
switching to sort would mean spill+sort of 900GB of data. Or we might 
say - hmm, we're 10% through, so we expect hitting the limit 10x, so 
let's spill the hash table and then do sort on that, writing and sorting 
only 10GB of data. (Or merging it in some hash-based way, per Robert's 
earlier message.)


I don't quite understand the argument that the number of groups needs to 
be pretty large for us to hit this. So what if the groups are 2x or 10x 
more than work_mem? It can still be cheaper than switching to sort-based 
approach, no?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH v16] GSSAPI encryption support

2018-06-11 Thread Andrew Dunstan




On 06/11/2018 01:13 PM, Nico Williams wrote:


Well, all the free CIs like Travis and Appveyor do it this way.  You
don't have to *use* it just because the .yml files are in the source
tree.  But you have to have the .yml files in the source tree in order
to use these CIs.  It'd be nice to be able to point somewhere else for
them, but whatever, that's not something we get much choice in at this
time.




That's not true, at least for Appveyor (can't speak about travis - I 
have no first hand experience). For appveyor, you can supply a custom 
appveyor.yml file, which can be a complete URL. In fact, if you use a 
plain git source as opposed to one of the managed git services it 
supports, you have to do it that way - it ignores an appveyor.yml in 
your repo. I found this out the very hard way over the last few days, 
and they very kindly don't warn you at all about this.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Andres Freund
Hi,

On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:
> It would be great to get something that performs better than just falling
> back to sort (and I was advocating for that), but I'm worried we might be
> moving the goalposts way too far.

I'm unclear on why that'd have that bad performance in relevant
cases. You're not going to hit the path unless the number of groups is
pretty large (or work_mem is ridiculously small, in which case we don't
care). With a large number of groups the sorting path isn't particularly
inefficient, because repeatedly storing the input values isn't such a
large fraction in comparison to the number of groups (and their
transition values).  Which scenarios are you concerned about?

Greetings,

Andres Freund



Re: [PATCH v16] GSSAPI encryption support

2018-06-11 Thread Nico Williams
On Mon, Jun 11, 2018 at 09:27:17AM -0400, Robert Haas wrote:
> On Thu, Jun 7, 2018 at 6:11 PM, Thomas Munro
>  wrote:
> >> Cool!  Is there any reason that your patch for Travis and AppVeyor
> >> integration is not just committed to master?
> >
> > I think that's a good idea and I know that some others are in favour.
> 
> One problem is that was discussed at PGCon it commits us to one
> particular build configuration i.e. one set of --with-whatever options
> to configure.  It's not bad to provide a reasonable set of defaults,
> but it means that patches which are best tested with some other set of
> values will have to modify the file (I guess?). Patches that need to
> be tested with multiple sets of flags are ... maybe just out of luck?

Hmm, that's not really true.  You can have a build and test matrix with
more than one row in it.

For example, look at: https://travis-ci.org/heimdal/heimdal

You'll see that Heimdal's Travis-CI integration has four builds:

 - Linux w/ GCC
 - Linux w/ Clang
 - OS X w/ Clang
 - Linux code coverage w/ GCC

We could easily add more options for Heimdal, if we felt we needed to
build and test more with Travis-CI.

Appveyor also has matrix support (though I'm not using it in Heimdal's
Appveyor-CI integration).

Now, of course if we had a very large set of configurations to test,
things might get slow, and the CIs might find it abusive.  It would be
best to use a maximal build configuration and go from there.  So, for
example, two configurations, one with and w/o JIT, but with all the
optional libraries (GSS, LDAP, ICU, Perl, Python, ...), and with two
different compilers (GCC and Clang, with Clang only for the JIT), plus
one OS X build (with JIT), and so on:

 - Linux w/ GCC
 - Linux w/ Clang (   JIT)
 - Linux w/ Clang (no JIT)
 - Linux code coverage
 - OS X  w/ Clang (   JIT)

and similarly for Windows on Appveyor.

> I really don't understand the notion of putting the build script
> inside the source tree.  It's all fine if there's One Way To Do It but
> often TMTOWTDII.  If the build configurations are described outside
> the source tree then you can have as many of them as you need.

Well, all the free CIs like Travis and Appveyor do it this way.  You
don't have to *use* it just because the .yml files are in the source
tree.  But you have to have the .yml files in the source tree in order
to use these CIs.  It'd be nice to be able to point somewhere else for
them, but whatever, that's not something we get much choice in at this
time.

The .yml files are unobstrusive anyways, and it's handy to have them
in-tree anyways.  It also makes it easier to do things like:

 - get build passing/failing buttons on wiki / build status pages 

 - make sure that the .yml files stay up to date as the source tree gets
   changed

It also makes it somewhat easier to get hooked on github and such, but
a bit of discipline will make that a non-issue.

Nico
-- 



Re: CF bug fix items

2018-06-11 Thread Andres Freund
On 2018-06-11 13:30:27 +0900, Michael Paquier wrote:
> > "Fix the optimization to skip WAL-logging on table created in same
> > transaction" has been in 10 (!) commitfests. It's seen no substantive action
> > since November. It has a bunch of authors and reviewers listed, Surely
> > somebody can move it forward?
> 
> I think that this is a complicated topic, which results in a rather
> large and invasive patch introducing new logic concepts in order to fix
> a rather narrow use-case.  So I am wondering if it is really something
> we ought to fix here..

I think we absolutely definitely need to fix it, or remove
wal_level=minimal. It's a failure to provide the fundamental guarantees
a database should provide. It seems not unreasonable to commit something
to v11 and then backpatch a bit later, to manage risk, however.

Greetings,

Andres Freund



Re: pgsql: Fix and document lock handling for in-memory replication slot da

2018-06-11 Thread Andres Freund
Hi,

On 2018-06-10 10:45:04 +, Michael Paquier wrote:
> Fix and document lock handling for in-memory replication slot data
>
> While debugging issues on HEAD for the new slot forwarding feature of
> Postgres 11, some monitoring of the code surrounding in-memory slot data
> has proved that the lock handling may cause inconsistent data to be read
> by read-only callers of slot functions, particularly
> pg_get_replication_slots() which fetches data for the system view
> pg_replication_slots, or modules looking directly at slot information.
>
> The code paths involved in those problems concern logical decoding
> initialization (down to 9.4) and WAL reservation for slots (new as of
> 10).
>
> A set of comments documenting all the lock handlings, particularly the
> dependency with LW locks for slots and the in_use flag as well as the
> internal mutex lock is added, based on a suggested by Simon Riggs.
>
> Some of the fixed code exists down to 9.4 where WAL decoding has been
> introduced, but as those race conditions are really unlikely going to
> happen as those concern code paths for slot and decoding creation, just
> fix the problem on HEAD.

You can't do things like:

/* start at current insert position */
+   SpinLockAcquire(>mutex);
slot->data.restart_lsn = GetXLogInsertRecPtr();
+   SpinLockRelease(>mutex);

a) we don't call external functions with a spinlock held. As a
   rule. It's too hard to se what happens in that other function / too
   likely to change independently.

b) we most certainly don't do it if the other function also acquires a
   spinlock:
XLogRecPtr
GetXLogInsertRecPtr(void)
{
XLogCtlInsert *Insert = >Insert;
uint64  current_bytepos;

SpinLockAcquire(>insertpos_lck);
current_bytepos = Insert->CurrBytePos;
SpinLockRelease(>insertpos_lck);

return XLogBytePosToRecPtr(current_bytepos);
}

   Nesting spinlock means that you'd need to be very careful about
   whether all lockers use the same order. And be ok with the system
   stalled until the PANIC if it failed.

Same is true for the codepaths calling GetRedoRecPtr().


I don't object to the general idea of adding locking - although the
benefit are nearly guaranteed to be cosmetic - but this has the
potential to make things worse.

- Andres



Re: [PATCH] Clear up perlcritic 'missing return' warning

2018-06-11 Thread Alvaro Herrera
On 2018-May-26, Andrew Dunstan wrote:

> Not quite trivial but it's done - see 
> .
> 
> crake is now set up to run this - see 
> 
> 
> So, are there any other objections?
> 
> The patch Mike supplied doesn't give us a clean run (at least on the machine
> I tested on), since it turns down the severity level to 4 but leaves some
> items unfixed. I propose to enable this policy at level 5 for now, and then
> remove that when we can go down to level 4 cleanly, and use its default
> setting at that stage.

Just to be clear -- this is done, right?  And we plan no further
enhancements in perlcritic area for pg11?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: cursors with prepared statements

2018-06-11 Thread Peter Eisentraut
On 6/11/18 09:57, Amit Kapila wrote:
> Sounds like a reasonable approach.  Have you not considered using a
> special OPEN syntax because there are some other forms of problems
> with it?

There is no OPEN command in direct SQL.  Do you mean whether I have
considered introducing an OPEN command?  Yes, but it seems to me that
that would create weird inconsistencies and doesn't seem very useful in
practice.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Robert Haas
On Mon, Jun 11, 2018 at 11:29 AM, Tomas Vondra
 wrote:
> I think the underlying question is whether we want to treat this as an
> emergency safety against OOM (which should be a rare thing in most cases) or
> something allowing us to pick hash aggregate more often.
>
> It would be great to get something that performs better than just falling
> back to sort (and I was advocating for that), but I'm worried we might be
> moving the goalposts way too far.

Well, the scope of the first patch is always negotiable, but I think
we should try to think beyond the first patch in design discussions,
so that we don't paint ourselves into a corner.  I also think it's
worth noting that an emergency safety which also causes a 10x
performance degradation is not necessarily much better than OOM.  I
suspect that part of the reason why both Andres and David proposed
algorithms that combined hashing and sorting is out of a desire to
cater somewhat to both few-tuples-per-group and many-tuples-per-group.
AFAICS, Andres's algorithm will be better for few-tuples-per-group
(maybe with a few large groups mixed in) and David's algorithm will be
better for many-tuples-per-group (maybe with some small groups mixed
in), but in each case the inclusion of a sorting component seems like
it will ease the pain in the opposite use case.  However, I wonder
whether it will be better still to just acknowledge that those two
cases really need separate algorithms and then think about the best
approach for each one individually.

One big advantage to just partitioning the input tuples by hash code
and then proceeding batch by batch is that it works for any aggregate
that can support hash aggregation in the first place.  You do not need
a btree opclass.  You do not need support for serialize/deserialize or
combine.  If you have serialize/deserialize, it's better, because now
you can spill growing transition values out of the hash table on the
fly.  But the basic contours of the algorithm don't change even if
such support is lacking.  That's a significant advantage in my view,
because with some of these more complex strategies, we'll end up with
code paths for data types lacking the necessary support which are
almost never exercised in practice.  That's likely to lead to bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: CF bug fix items

2018-06-11 Thread Alvaro Herrera
On 2018-Jun-11, Etsuro Fujita wrote:

> (2018/06/11 20:34), Andrew Dunstan wrote:
> > On 06/11/2018 07:08 AM, Etsuro Fujita wrote:
> > > > In one of those cases, "ConvertRowtypeExpr reference errors from
> > > > partition-wise join", the patch has been marked Ready for Committer and
> > > > then Etsuro seems to have changeed his mind. If it's not ready it should
> > > > be set back to "needs review" or "waiting for author".
> > > 
> > > Yeah, I don't think the proposed patch is the right way to go, so I'm
> > > proposing another solution for that, which I think makes code much
> > > simple, but I'd like to hear the opinion from Robert, who is the owner
> > > of PWJ.  (I marked this as Ready for Committer partly because I wanted
> > > to hear the opinion.)
> > 
> > I don't think that's the way we should use "Ready for Committer". I
> > suggest you ,move it back to "Needs Review".
> 
> Done.

Actually, for something that's an open item, there needn't be an entry
in the commitfest at all, ISTM.  Open items are must-fix for release
(unlike older bugs), unless as a community we decide that something is
not a bug or that it can go unfixed.  So the commitfest entry is
unnecessary.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra

On 06/11/2018 04:25 PM, Robert Haas wrote:



> ...


Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though.  Not sure.  But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at least to some extent, we
ought to let our choice of algorithm be informed by the desire not to
need too much infrastructure.



I was using eviction strategy in a somewhat narrower sense - pretty much 
just "Which groups to evict from the hash table?" but you're right the 
question is more general and depends on which scheme we end up using 
(just hashagg, hash+sort, something else ...)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tomas Vondra




On 06/11/2018 05:16 PM, Robert Haas wrote:

On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
 wrote:

... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.


That's a really good point.  If the number of groups is pretty small
compared to the number of input tuples, then you really only ever want
to dump out transition values.  By doing so, you minimize the amount
of data you have to write.  But if the number of groups is, say, half
the number of input tuples, then computing transition values before
you have all the values that belong to that group is probably a waste
of time.  I wonder if we could decide what to do by comparing the
number of input tuples consumed to the number of groups created at the
time we run out of memory.  If we've got less than, I dunno, five
tuples per group, then assume most groups are small.  Pick a strategy
that (mainly) spools input tuples.  Otherwise, pick a strategy that
spools transition tuples.

In either case, it seems like we can pick a pure hashing strategy or
switch to using both hashing and sorting.  For example, IIUC, Andres's
proposal involves spooling mostly input tuples, but can also spool
transition tuples if the transition values consume more memory as they
absorb more tuples.  However, you could do a similar kind of thing
without needing sort support.  When you see a value that's not doesn't
fit in your in-memory hash table, use the hash code to route it to 1
of N batch files.  Have a second set of batch files for transition
tuples in case you need to kick things out of the in-memory hash
table.  Once you finish reading the input, emit all the values that
remain in the in-memory hash table and then process each batch file
separately.

Similarly, David's strategy involves spooling only transition tuples
and then sorting on the group key, but it's again possible to do
something similar without relying on sorting.  Instead of flushing the
in-memory hash table to a tuple store, split the transition tuples it
contains among N batch files based on the hash code.  Once you've read
all of the input, go back and reprocess each batch file, combining
transition values whenever the same group keys appear in more than one
transition tuple.



Yeah, essentially something like the batching in hash joins.


To me, the pure-hashing strategies look a little simpler, but maybe
there's enough performance benefit from combining hashing and sorting
that it's worth the complexity, or maybe we should just accept
whichever variant somebody's willing to code.  But I think we almost
have to have separate handling for many-row-per-group and
few-rows-per-group, because those seem fundamentally different.



I think the underlying question is whether we want to treat this as an 
emergency safety against OOM (which should be a rare thing in most 
cases) or something allowing us to pick hash aggregate more often.


It would be great to get something that performs better than just 
falling back to sort (and I was advocating for that), but I'm worried we 
might be moving the goalposts way too far.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: config.{guess,sub} updates

2018-06-11 Thread Peter Eisentraut
On 6/7/18 23:54, Tom Lane wrote:
> Peter Eisentraut  writes:
>> We didn't update those for beta1, as we usually do.  Should we do it for
>> beta2?
> 
> Yes, if there are updates to be made.  Better late than never.

done

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Robert Haas
On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
 wrote:
> ... and this is pretty much what Jeff Davis suggested, I think. The
> trouble is, each of those cases behaves nicely/terribly in different
> corner cases.

That's a really good point.  If the number of groups is pretty small
compared to the number of input tuples, then you really only ever want
to dump out transition values.  By doing so, you minimize the amount
of data you have to write.  But if the number of groups is, say, half
the number of input tuples, then computing transition values before
you have all the values that belong to that group is probably a waste
of time.  I wonder if we could decide what to do by comparing the
number of input tuples consumed to the number of groups created at the
time we run out of memory.  If we've got less than, I dunno, five
tuples per group, then assume most groups are small.  Pick a strategy
that (mainly) spools input tuples.  Otherwise, pick a strategy that
spools transition tuples.

In either case, it seems like we can pick a pure hashing strategy or
switch to using both hashing and sorting.  For example, IIUC, Andres's
proposal involves spooling mostly input tuples, but can also spool
transition tuples if the transition values consume more memory as they
absorb more tuples.  However, you could do a similar kind of thing
without needing sort support.  When you see a value that's not doesn't
fit in your in-memory hash table, use the hash code to route it to 1
of N batch files.  Have a second set of batch files for transition
tuples in case you need to kick things out of the in-memory hash
table.  Once you finish reading the input, emit all the values that
remain in the in-memory hash table and then process each batch file
separately.

Similarly, David's strategy involves spooling only transition tuples
and then sorting on the group key, but it's again possible to do
something similar without relying on sorting.  Instead of flushing the
in-memory hash table to a tuple store, split the transition tuples it
contains among N batch files based on the hash code.  Once you've read
all of the input, go back and reprocess each batch file, combining
transition values whenever the same group keys appear in more than one
transition tuple.

To me, the pure-hashing strategies look a little simpler, but maybe
there's enough performance benefit from combining hashing and sorting
that it's worth the complexity, or maybe we should just accept
whichever variant somebody's willing to code.  But I think we almost
have to have separate handling for many-row-per-group and
few-rows-per-group, because those seem fundamentally different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: why partition pruning doesn't work?

2018-06-11 Thread Tom Lane
David Rowley  writes:
> On 12 June 2018 at 02:26, Tom Lane  wrote:
>>> ... I'd previously tried having NULL subnodes in
>>> the Append and I thought it required too much hacking work in
>>> explain.c,

>> No, that was pretty much exactly what I was envisioning.

> What you're proposing exchanges logic that fits neatly into one
> function for special logic that will be scattered all over explain.c.
> I really don't think that's the better of the two evils.

As far as I can see, it'd involve about three or four lines' worth of
rewrite in one place-you-already-made-quite-ugly in explain.c, and an
added if-test in planstate_walk_members, and that'd be it.  That seems
like a pretty cheap price for being able to vastly simplify the invariants
for the pruning functions.  In fact, I doubt you'd even need two of them
anymore; just one with a bool flag for can-use-exec-params.

> I'd rather just see my last patch applied which simplifies the re-map
> code by removing the 2nd loop.  Actually, even updating the
> present_parts to remove the unneeded subpartitioned table indexes is
> optional. We' just need to give up on the elog(ERROR, "partition
> missing from subplans"); error and assume that case is fine.

The fact that you added that loop and only later decided it was
unnecessary seems to me to support my point pretty strongly: that
code is overcomplicated.

regards, tom lane



Re: SCRAM with channel binding downgrade attack

2018-06-11 Thread Magnus Hagander
On Mon, Jun 11, 2018 at 4:49 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 6/6/18 18:04, Michael Paquier wrote:
> > On Wed, Jun 06, 2018 at 11:53:06PM +0300, Heikki Linnakangas wrote:
> >> That would certainly be good. We've always had that problem, even with
> md5
> >> -> plaintext password downgrade, and it would be nice to fix it. It's
> quite
> >> late in the release cycle already, do you think we should address that
> now?
> >> I could go either way..
> >
> > I would be inclined to treat that as new development as this is no new
> > problem.
>
> I agree.
>
>
Agreed as well.

I'm wondering if that means we should then also not do it specifically for
scram in this version. Otherwise we're likely to end up with a parameter
that only has a "lifetime" of one version, and that seems like a bad idea.
If nothing else we should clearly think out what the path is to make sure
that doesn't happen. (e.g. we don't want a
scram_channel_binding_mode=require in this version, if the next one is
going to replace it with something like heikkis suggested
allowed_authentication_methods=SCRAM-SHA-256-PLUS or whatever we end up
coming up with there).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: SCRAM with channel binding downgrade attack

2018-06-11 Thread Peter Eisentraut
On 6/6/18 18:04, Michael Paquier wrote:
> On Wed, Jun 06, 2018 at 11:53:06PM +0300, Heikki Linnakangas wrote:
>> That would certainly be good. We've always had that problem, even with md5
>> -> plaintext password downgrade, and it would be nice to fix it. It's quite
>> late in the release cycle already, do you think we should address that now?
>> I could go either way..
> 
> I would be inclined to treat that as new development as this is no new
> problem.

I agree.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Supporting tls-server-end-point as SCRAM channel binding for OpenSSL 1.0.0 and 1.0.1

2018-06-11 Thread Peter Eisentraut
On 6/6/18 16:16, Steven Fackler wrote:
> TLS 1.3, (which is currently in a draft state, but is theoretically
> being finalized soon) does not support the TLS channel binding
> algorithms [1]. From talking with one of the people working on the TLS
> 1.3 standard, tls-unique is seen as particularly problematic. There's
> some discussion on the IETF mailing lists from a couple of years ago [2].

I think we'll just have to wait for an updated RFC on channel bindings
for TLS 1.3.

Perhaps we should change PostgreSQL 11 to not advertise channel binding
when TLS 1.3 is used?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



PostgreSQL 11 Beta 2 Release: 2018-06-28

2018-06-11 Thread Jonathan S. Katz
Hi,

The Release Management Team is pleased to announce that
the release date for PostgreSQL 11 Beta 2 is set to be 2018-06-28,
which is roughly a month after Beta 1 released.

We appreciate everyone’s diligent effort fixing open issues[1] and we
hope to get even more closed before the Beta 2 release. Early feedback
both from this mailing list and fa small survey[2] indicates people
are testing and help us make the upcoming major release as stable
as possible :-)

Thank you again for your hard work and please let us know if you
have any questions.

Thanks,

Jonathan

[1] https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items 

[2] 
https://www.postgresql.org/community/survey/95-are-you-planning-to-test-postgresql-11-while-it-is-in-beta/
 


Re: why partition pruning doesn't work?

2018-06-11 Thread Tom Lane
David Rowley  writes:
> On 11 June 2018 at 10:49, Tom Lane  wrote:
>> * The business with ExecFindInitialMatchingSubPlans remapping the
>> subplan indexes seems very dubious to me.  Surely, that is adding
>> way more complexity and fragility than it's worth.

> I think having the ability to prune during executor initialisation is
> enormously important. I think without it, this patch is less than half
> as useful.

Sure.

> However, if you didn't mean removing the executor
> initialise pruning, and just not remapping the arrays, then I'm not
> sure how we'd do that.  I'd previously tried having NULL subnodes in
> the Append and I thought it required too much hacking work in
> explain.c, so I decided to just collapse the array instead and do what
> was required to make it work after having removed the unneeded
> subplans. Did you have another idea on how to do this?

No, that was pretty much exactly what I was envisioning.  I have
not looked at the consequences for explain.c but it seemed like
it couldn't be all that bad; and to my mind the remapping business
in the partition code is pretty bad.  "It's all in one file" is not
a suitable justification for unintelligible, overcomplex code.

regards, tom lane



Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Robert Haas
On Tue, Jun 5, 2018 at 1:46 AM, Jeff Davis  wrote:
> On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
>> I expect the eviction strategy to be the primary design challenge of
>> this patch. The other bits will be mostly determined by this one
>> piece.
>
> Not sure I agree that this is the primary challenge.
>
> The cases that benefit from eviction are probably a minority. I see two
> categories that would benefit:
>
>   * Natural clustering in the heap. This sounds fairly common, but a
> lot of the cases that come to mind are too low-cardinality to be
> compelling; e.g. timestamps grouped by hour/day/month. If someone has
> run into a high-cardinality natural grouping case, let me know.
>   * ARRAY_AGG (or similar): individual state values can be large enough
> that we need to evict to avoid exceeding work_mem even if not adding
> any new groups.
>
> In either case, it seems like a fairly simple eviction strategy would
> work. For instance, we could just evict the entire hash table if
> work_mem is exceeded or if the hit rate on the hash table falls below a
> certain threshold. If there was really something important that should
> have stayed in the hash table, it will go back in soon anyway.
>
> So why should eviction be a major driver for the entire design? I agree
> it should be an area of improvement for the future, so let me know if
> you see a major problem, but I haven't been as focused on eviction.

Hmm, I think the eviction strategy is pretty important.  For example,
as we discussed in the hallway at PGCon, if we're flushing the whole
hash table (or one of several hash tables), we can account for memory
usage chunk-by-chunk instead of palloc-by-palloc, which as you pointed
out is more accurate and substantially less expensive.  If we are
evicting individual tuples, though, it's quite possible that evicting
even a large number of tuples might not free up any chunks, so we'd
probably have to do accounting palloc-by-palloc.  Working harder to
get a less accurate answer isn't real appealing even if the absolute
overhead is low.

As Andres mentioned, it also matters what kind of thing we evict.  If
we push raw input tuples to a spool file in lieu of creating new
groups and deal with those in a second pass with a new hash table, we
don't need anything extra at all.  If the second pass uses sort +
group, we need the aggregates to be both hashable and sortable.  If we
evict transition tuples rather than input tuples, we need
serialization + deserialization functions.  If we do that in a way
that might create multiple transition values for the same combination
of grouping values, then we also need combine functions.

Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though.  Not sure.  But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at least to some extent, we
ought to let our choice of algorithm be informed by the desire not to
need too much infrastructure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-06-11 Thread Tom Lane
Thomas Munro  writes:
> One question I have is whether it is against project policy to
> backport a new file and a new user-facing function.

Given that this has been broken since forever, and there've been
no complaints before now, I do not think the case for back-patching
is strong enough to justify the problems it would cause.  Just
put it in v11 and be done.

Also, this bit in the proposed documentation seems quite inappropriate:

(This is a change from earlier releases of
PostgreSQL ...

We don't normally put in such comments at all, and if we do, we
specify which version we're talking about.  Two years from now
this'd be totally confusing.  I'd just make it read

(This is important only on Windows, where ...

regards, tom lane



Re: cursors with prepared statements

2018-06-11 Thread Amit Kapila
On Fri, Jun 8, 2018 at 1:12 AM, Peter Eisentraut
 wrote:
> I have developed a patch that allows declaring cursors over prepared
> statements:
>
> DECLARE cursor_name CURSOR FOR prepared_statement_name
>[ USING param, param, ... ]
>
> This is an SQL standard feature.  ECPG already supports it (with
> different internals).
>
> Internally, this just connects existing functionality in different ways,
> so it doesn't really introduce anything new.
>
> One point worth pondering is how to pass the parameters of the prepared
> statements.  The actual SQL standard syntax would be
>
> DECLARE cursor_name CURSOR FOR prepared_statement_name;
> OPEN cursor_name USING param, param;
>
> But since we don't have the OPEN statement in direct SQL, it made sense
> to me to attach the USING clause directly to the DECLARE statement.
>
> Curiously, the direct EXECUTE statement uses the non-standard syntax
>
> EXECUTE prep_stmt (param, param);
>
> instead of the standard
>
> EXECUTE prep_stmt USING param, param;
>
> I tried to consolidate this.  But using
>
> DECLARE c CURSOR FOR p (foo, bar)
>
> leads to parsing conflicts (and looks confusing?), and instead allowing
> EXECUTE + USING leads to a mess in the ECPG parser that exhausted me.
> So I'm leaving it as is for now and might give supporting EXECUTE +
> USING another try later on.
>

Sounds like a reasonable approach.  Have you not considered using a
special OPEN syntax because there are some other forms of problems
with it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-11 Thread Robert Haas
On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> That being said, I don't mind a bit if you want to look for further
>> speedups here, but if you do, keep in mind that a lot of queries won't
>> even use partition-wise join, so all of the arrays will be of length
>> 1.  Even when partition-wise join is used, it is quite likely not to
>> be used for every table in the query, in which case it will still be
>> of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
>> is probably a regression overall.
>
> TBH, I am way more concerned about the pessimization introduced for
> every pre-existing usage of these functions by putting search loops
> into them at all.  I'd like very much to revert that.  If we can
> replace those with something along the line of root->index_array[varno]
> we'll be better off across the board.

I think David's analysis is correct -- this doesn't quite work.  We're
trying to identify whether a given varno is one of the ones to be
translated, and it's hard to come up with a faster solution than
iterating over a (very short) array of those values.  One thing we
could do is have two versions of each function, or else an optimized
path for the very common nappinfos=1 case.  I'm just not sure it would
be worthwhile.  Traversing a short array isn't free, but it's pretty
cheap.

An early version of the patch that made these changes used a List here
rather than a C array, and I asked for that to be changed on
efficiency grounds, and also because constructing 1-element lists
would have a cost of its own.  I think in general we have way too much
code that uses Lists for convenience even though C arrays would be
faster.  For the most part, the performance consequences of any
individual place where we do this are probably beneath the noise
floor, but in the aggregate I think it has nasty consequences for both
performance and memory utilization.  I think if we are going to look
at optimizing we are likely to buy more by worrying about cases where
we traverse lists, especially ones that may be long, rather than
worrying about looping over short C arrays.  Of course I'm open to
being proved wrong...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH v16] GSSAPI encryption support

2018-06-11 Thread Robert Haas
On Thu, Jun 7, 2018 at 6:11 PM, Thomas Munro
 wrote:
>> Cool!  Is there any reason that your patch for Travis and AppVeyor
>> integration is not just committed to master?
>
> I think that's a good idea and I know that some others are in favour.

One problem is that was discussed at PGCon it commits us to one
particular build configuration i.e. one set of --with-whatever options
to configure.  It's not bad to provide a reasonable set of defaults,
but it means that patches which are best tested with some other set of
values will have to modify the file (I guess?). Patches that need to
be tested with multiple sets of flags are ... maybe just out of luck?

I really don't understand the notion of putting the build script
inside the source tree.  It's all fine if there's One Way To Do It but
often TMTOWTDII.  If the build configurations are described outside
the source tree then you can have as many of them as you need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-11 Thread David Rowley
On 9 June 2018 at 07:08, Tom Lane  wrote:
> If we can
> replace those with something along the line of root->index_array[varno]
> we'll be better off across the board.

Are you talking about the loop over the appinfos in functions such as
adjust_appendrel_attrs_mutator?

If so, please note that it's looking for the appinfo with the parent
relid matching the appinfo, not a child relid. There may be many
appinfos in this "index_array" with that parent, so what you mention
above, if I understand you correctly, is not possible. We'd still need
a list of childrelids and we'd need to still loop over each
AppendRelInfo which belongs to that child and check if the parent
matches the Var.

I started coding a patch which instead of the direct lookup you've
mentioned just does a bms_next_member() loop using the given
childrelids. It only part way through writing the patch when I hit a
snag around the various places that call adjust_appendrel_attrs() with
a pre-determined AppendRelInfo, for example, the one in
inheritance_planner() at line 1312.  I'd thought that maybe I could
just pass the childrelids in with
bms_make_singleton(appinfo->child_relid), but the manufactured
"parent_root" inside the inheritance_planner does not have the
append_rel_array set. We could work around this by having another
version of adjust_appendrel_attrs() which accepts an AppendRelInfo
instead of a Relids, but that looks like it means making another
adjust_appendrel_attrs_mutator like function and duplicating quite a
bit of code.  Another option is perhaps to add a callback function to
adjust_appendrel_attrs_context to have that search for the
AppendRelInfo, or just return the known one, but unsure if that's
going to look very good.  I've now stopped as I'm starting to think
that this is not improving the code.

I've attached the half done broken, regression test failing patch just
so you can see what I'm talking about.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


get_rid_of_find_appinfos_by_relids_incomplete_mess.patch
Description: Binary data


Re: CF bug fix items

2018-06-11 Thread Andrew Dunstan




On 06/11/2018 07:08 AM, Etsuro Fujita wrote:




In one of those cases, "ConvertRowtypeExpr reference errors from
partition-wise join", the patch has been marked Ready for Committer and
then Etsuro seems to have changeed his mind. If it's not ready it should
be set back to "needs review" or "waiting for author".


Yeah, I don't think the proposed patch is the right way to go, so I'm 
proposing another solution for that, which I think makes code much 
simple, but I'd like to hear the opinion from Robert, who is the owner 
of PWJ.  (I marked this as Ready for Committer partly because I wanted 
to hear the opinion.)


I don't think that's the way we should use "Ready for Committer". I 
suggest you ,move it back to "Needs Review".




Anyway, I think this is an open item for PG11, so I'll add this to 
that list for PG11.



Ok, good.

cheers

andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: CF bug fix items

2018-06-11 Thread Etsuro Fujita

(2018/06/10 23:41), Andrew Dunstan wrote:

I've been looking over the older items in the CF, specifically those in
the "bug fix" category.

Six of them are marked "ready for committer" and all of those have a
committer named as either author or reviewer. It would be good to get
those committed as soon as possible. So Heikki, Michael, Alexander,
Simon, Etsuro and Thomas, that means you :-) Pleas claim and commit
those if possible, or at least move them forward.


Thanks for the summary!


In one of those cases, "ConvertRowtypeExpr reference errors from
partition-wise join", the patch has been marked Ready for Committer and
then Etsuro seems to have changeed his mind. If it's not ready it should
be set back to "needs review" or "waiting for author".


Yeah, I don't think the proposed patch is the right way to go, so I'm 
proposing another solution for that, which I think makes code much 
simple, but I'd like to hear the opinion from Robert, who is the owner 
of PWJ.  (I marked this as Ready for Committer partly because I wanted 
to hear the opinion.)


Anyway, I think this is an open item for PG11, so I'll add this to that 
list for PG11.


Best regards,
Etsuro Fujita



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-06-11 Thread Etsuro Fujita

(2018/06/07 19:42), Ashutosh Bapat wrote:

On Wed, Jun 6, 2018 at 5:00 PM, Etsuro Fujita
  wrote:

Since I'm not 100% sure that that is the right way to go, I've been
rethinking how to fix this issue.  Yet another idea I came up with to fix
this is to redesign the handling of the tlists for children in the
partitioning case.  Currently, we build the reltarget for a child by
applying adjust_appendrel_attrs to the reltarget for its parent in
set_append_rel_size, which maps a wholerow Var referring to the parent rel
to a ConvertRowtypeExpr that translates a wholerow of the child rel into a
wholerow of the parent rel's rowtype.  This works well for the
non-partitioned inheritance case, but makes complicated the code for
handling the partitioning case especially when planning partitionwise-joins.
And I think this would be the root cause of this issue.


Although true, this is not only about targetlist. Even the whole-row
expressions in the conditions, equivalence classes and other
planner/optimizer data structures are translated to
ConvertRowtypeExpr.


Yeah, but I mean we modify set_append_rel_size so that we only map a 
parent wholerow Var in the parent tlist to a child wholerow Var in the 
child's tlist; parent wholerow Vars in the parent's other expressions 
such as conditions are transformed into CREs as-is.



  I don't think the
tlists for the children need to have their wholerows transformed into the
corresponding ConvertRowtypeExprs *at this point*, so what I'd like to
propose is to 1) map a parent wholerow Var simply to a child wholerow Var,
instead (otherwise, the same as adjust_appendrel_attrs), when building the
reltarget for a child in set_append_rel_size, 2) build the tlists for child
joins using those children's wholerow Vars at path creation time, and 3)
adjust those wholerow Vars in the tlists for subpaths in the chosen
AppendPath so that those are transformed into the corresponding
ConvertRowtypeExprs, at plan creation time (ie, in
create_append_plan/create_merge_append_plan).  IIUC, this would not require
any changes to pull_var_clause as proposed by the patch.  This would not
require any changes to postgres_fdw as proposed by the patch, either.  In
addition, this would make unnecessary the code added to setrefs.c by the
partitionwise-join patch.  Maybe I'm missing something though.


Not translating whole-row expressions to ConvertRowtypeExpr before
creating paths can lead to a number of anomalies. For example,

1. if there's an index on the whole-row expression of child,
translating parent's whole-row expression to child's whole-row
expression as is will lead to using that index, when in reality the it
can not be used since the condition/ORDER BY clause (which originally
referred the parent's whole-row expression) require the child's
whole-row reassembled as parent's whole-row.
2. Constraints on child'd whole-row expression, will be used to prune
a child when they can not be used since the original condition was on
parent' whole-row expression and not that of the child.
3. Equivalence classes will think that a child whole-row expression
(without ConvertRowtypeExpr) is equivalent to an expression which is
part of the same equivalence class as the parent' whole-row
expression.


Is that still possible when we only map a parent wholerow Var in the 
parent's tlist to a child wholerow Var in the child's tlist?


Best regards,
Etsuro Fujita



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-11 Thread Simon Riggs
On 5 June 2018 at 14:45, Alexander Korotkov  wrote:

> Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ),
> while caller can relock it later.  However, I don't see what prevents
> _bt_search()
> from locking leaf immediately in exclusive mode (aka BT_WRITE) when required.
> When traversing downlink from non-leaf page of level 1 (lowest level of 
> non-leaf
> pages just prior to leaf pages), we know that next page is going to be leaf.  
> In
> this case, we can immediately lock next page in exclusive mode if required.
> For sure, it might happen that we didn't manage to exclusively lock leaf in 
> this
> way when _bt_getroot() points us to leaf page.  But this case could be handled
> in _bt_search() by relock.  Please, find implementation of this approach in 
> the
> attached patch.

It's a good idea. How does it perform with many duplicate entries?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Concurrency bug in UPDATE of partition-key

2018-06-11 Thread Amit Kapila
On Mon, Jun 11, 2018 at 3:02 PM, Amit Kapila  wrote:
> On Thu, Jun 7, 2018 at 1:53 PM, Amit Khandekar  wrote:
>> On 7 June 2018 at 11:44, Amit Kapila  wrote:
>>> On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar 
>>> wrote:
>>>
>>> I think this will allow before row delete triggers to be fired more than
>>> once.  Normally, if the EvalPlanQual testing generates a new tuple, we don't
>>> fire the triggers again.
>>
>> If there are BR delete triggers, the tuple will be locked using
>> GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
>> run, since the tuple is already locked due to triggers having run.
>>
>> But that leads me to think : The same concurrency issue can occur in
>> GetTupleForTrigger() also. Say, a concurrent session has already
>> locked the tuple, and GetTupleForTrigger() would wait and then return
>> the updated tuple in its last parameter newSlot. In that case, we need
>> to pass this slot back through ExecBRDeleteTriggers(), and further
>> through epqslot parameter of ExecDelete(). But yes, in this case, we
>> should avoid calling this trigger function the second time.
>>
>> If you agree on the above, I will send an updated patch.
>>
>
> Sounds reasonable to me.
>

Try to add a test case which covers BR trigger code path where you are
planning to update.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Concurrency bug in UPDATE of partition-key

2018-06-11 Thread Amit Kapila
On Thu, Jun 7, 2018 at 1:53 PM, Amit Khandekar  wrote:
> On 7 June 2018 at 11:44, Amit Kapila  wrote:
>> On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar 
>> wrote:
>>
>> I think this will allow before row delete triggers to be fired more than
>> once.  Normally, if the EvalPlanQual testing generates a new tuple, we don't
>> fire the triggers again.
>
> If there are BR delete triggers, the tuple will be locked using
> GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
> run, since the tuple is already locked due to triggers having run.
>
> But that leads me to think : The same concurrency issue can occur in
> GetTupleForTrigger() also. Say, a concurrent session has already
> locked the tuple, and GetTupleForTrigger() would wait and then return
> the updated tuple in its last parameter newSlot. In that case, we need
> to pass this slot back through ExecBRDeleteTriggers(), and further
> through epqslot parameter of ExecDelete(). But yes, in this case, we
> should avoid calling this trigger function the second time.
>
> If you agree on the above, I will send an updated patch.
>

Sounds reasonable to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-11 Thread Masahiko Sawada
On Fri, May 25, 2018 at 8:41 PM, Moon, Insung
 wrote:
> Hello Hackers,
>
> This propose a way to develop "Table-level" Transparent Data Encryption (TDE) 
> and Key Management Service (KMS) support in
> PostgreSQL.
>
>
> Issues on data encryption of PostgreSQL
> ==
> Currently, in PostgreSQL, data encryption can be using pgcrypto Tool.
> However, it is inconvenient to use pgcrypto to encrypts data in some cases.
>
> There are two significant inconveniences.
>
> First, if we use pgcrypto to encrypt/decrypt data, we must call pgcrypto 
> functions everywhere we encrypt/decrypt.
> Second, we must modify application program code much if we want to do 
> database migration to PostgreSQL from other databases that is
> using TDE.
>
> To resolved these inconveniences, many users want to support TDE.
> There have also been a few proposals, comments, and questions to support TDE 
> in the PostgreSQL community.
>
> However, currently PostgreSQL does not support TDE, so in development 
> community, there are discussions whether it's necessary to
> support TDE or not.
>
> In these discussions, there were requirements necessary to support TDE in 
> PostgreSQL.
>
> 1) The performance overhead of encryption and decryption database data must 
> be minimized
> 2) Need to support WAL encryption.
> 3) Need to support Key Management Service.
>
> Therefore, I'd like to propose the new design of TDE that deals with both 
> above requirements.
> Since this feature will become very large, I'd like to hear opinions from 
> community before starting making the patch.
>
> First, my proposal is table-level TDE which is that user can specify tables 
> begin encrypted.
> Indexes, TOAST table and WAL associated with the table that enables TDE are 
> also encrypted.
>
> Moreover, I want to support encryption for large object as well.
> But I haven't found a good way for it so far. So I'd like to remain it as 
> future TODO.
>
> My proposal has five characteristics features of "table-level TDE".
>
> 1) Buffer-level data encryption and decryption
> 2) Per-table encryption
> 3) 2-tier encryption key management
> 4) Working with external key management services(KMS)
> 5) WAL encryption
>
> Here are more details for each items.
>
>
> 1. Buffer-level data encryption and decryption
> ==
> Transparent data encryption and decryption accompany by storage operation
> With ordinally way like using pgcrypto, the biggest problem with encrypted 
> data is the performance overhead of decrypting the data
> each time the run to queries.
>
> My proposal is to encrypt and decrypt data when performing DISK I/O operation 
> to minimize performance overhead.
> Therefore, the data in the shared memory layer is unencrypted so that 
> performance overhead can minimize.
>
> With this design, data encryption/decryption implementations can be developed 
> by modifying the codes of the storage and buffer
> manager modules,
> which are responsible for performing DISK I/O operation.
>
>
> 2. Per-table encryption
> ==
> User can enable TDE per table as they want.
> I introduce new storage parameter "encryption_enabled" which enables TDE at 
> table-level.
>
> // Generate  the encryption table
>CREATE TABLE foo WITH ( ENCRYPTION_ENABLED = ON );
>
> // Change to the non-encryption table
>ALTER TABLE foo SET ( ENCRYPTION_ENABLED = OFF );
>
> This approach minimizes the overhead for tables that do not require 
> encryption options.
> For tables that enable TDE, the corresponding table key will be generated 
> with random values, and it's stored into the new system
> catalog after being encrypted by the master key.
>
> BTW, I want to support CBC mode encryption[3]. However, I'm not sure how to 
> use the IV in CBC mode for this proposal.
> I'd like to hear opinions by security engineer.
>
>
> 3. 2-tier encryption key management
> ==
> when it comes time to change cryptographic keys, there is a performance 
> overhead to decryption and re-encryption to all data.
>
> To solve this problem we employee 2-tier encryption.
> 2-tier encryption is All table keys can be stored in the database cluster 
> after being encrypted by the master key, And master keys
> must be stored at external of PostgreSQL.
>
> Therefore, without master key, it is impossible to decrypt the table key. 
> Thus, It is impossible to decrypt the database data.
>
> When changing the key, it's not necessary to re-encrypt for all data.
> We use the new master key only to decrypt and re-encrypt the table key, these 
> operations for minimizing the performance overhead.
>
> For table keys, all TDE-enabled tables have different table keys.
> And for master key, all database have different master keys. Table keys are 
> encrypted by the master key of its own database.
> For WAL encryption, we have another cryptographic key. WAL-key is also 
> encrypted by a master key, but it is shared across the
> database cluster.
>
>
> 4. Working with 

Re: hot_standby_feedback vs excludeVacuum and snapshots

2018-06-11 Thread Simon Riggs
On 9 June 2018 at 15:41, Amit Kapila  wrote:
> On Fri, Jun 8, 2018 at 5:27 PM, Simon Riggs  wrote:
>> On 8 June 2018 at 11:33, Amit Kapila  wrote:
>>> On Fri, Jun 8, 2018 at 1:53 PM, Simon Riggs  wrote:

 So the attached patch fixes both the bug in the recent commit and the
 one I just found by observation of 49bff5300d527, since they are
 related.

 StandbyReleaseOldLocks() can sweep in the same way as
 ExpireOldKnownAssignedTransactionIds().

>>>
>>
>>> How will this avoid the bug introduced by your recent commit
>>> (32ac7a11)?  After that commit, we no longer consider vacuum's xid in
>>> RunningTransactions->oldestRunningXid calculation, so it is possible
>>> that we still remove/release its lock on standby earlier than
>>> required.
>>
>> In the past, the presence of an xid was required to prevent removal by
>> StandbyReleaseOldLocks().
>>
>> The new patch removes that requirement and removes when the xid is
>> older than oldestxid
>>
>
> The case I have in mind is: "Say vacuum got xid 600 while truncating,
> and then some other random transactions 601,602  starts modifying the
> database.  At this point, I think the computed value of
> RunningTransactions->oldestRunningXid will be 601.  Now, on standby
> when StandbyReleaseOldLocks sees the lock from transaction 600, it
> will remove it which doesn't appear correct to me.".

OK, thanks. Patch attached.

>> The normal path of removal is at commit or abort,
>> StandbyReleaseOldLocks is used almost never (as originally intended).
>>
> Okay, but that doesn't prevent us to ensure that whenever used, it
> does the right thing.

What do you mean? Has anyone argued in favour of doing the wrong thing?


I'm playing around with finding a test case to prove this area works,
rather than just manual testing. Suggestions welcome.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


execution_ordering_in_running_xact.v1.patch
Description: Binary data


Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-06-11 Thread David Rowley
On 11 June 2018 at 12:19, Tom Lane  wrote:
> David Rowley  writes:
>> On 10 June 2018 at 04:48, Tom Lane  wrote:
>>> So, IIUC, the issue is that for partitioning cases Append expects *all*
>>> its children to be partitions of the *same* partitioned table?  That
>>> is, you could also break it with
>>>
>>> select * from partitioned_table_a
>>> union all
>>> select * from partitioned_table_b
>>>
>>> ?
>
>> Not quite. I think what I sent above is the most simple way to break
>> it. Your case won't because there are no quals to prune with, so
>> run-time pruning is never attempted.
>
> Well, I hadn't bothered to put in the extra code needed to have a qual
> to prune with, but my point remains that it doesn't seem like the current
> Append code is prepared to cope with an Append that contains partitions
> of more than one top-level partitioned table.

Besides the partition pruning code, was there other aspects of Append
that you saw to be incompatible with mixed hierarchies?

> I just had a thought that might lead to a nice solution to that, or
> might be totally crazy.  What if we inverted the sense of the bitmaps
> that track partition pruning state, so that instead of a bitmap of
> valid partitions that need to be scanned, we had a bitmap of pruned
> partitions that we know we don't need to scan?  (The indexes of this
> bitmap would be subplan indexes not partition indexes.)  With this
> representation, it doesn't matter if some of the Append's children
> are not supposed to participate in pruning; they just don't ever get
> added to the bitmap of what to skip.  It's also fairly clear, I think,
> how to handle independent pruning rules for different top-level tables
> that are being unioned together: just OR the what-to-skip bitmaps.
> But there may be some reason why this isn't workable.

I think it would be less efficient. A common case and one that I very
much would like to make as fast as possible is when all but a single
partition is pruned. Doing the opposite sounds like more effort would
need to be expended to get the subplans that we do need to scan.

I don't really see the way it works now as a huge problem to overcome
in pruning. We'd just a list of subplans that don't belong to the
hierarchy and tag them on to the matches found in
ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans. The
bigger issue to overcome is the mixed flattened list of partition RT
indexes in partitioned_rels.  Perhaps having a list of Lists for
partitioned_rels could be used to resolve that. The question is more,
how to solve for PG11. Do we need that?

I think we'll very soon be wanting to have ordered partition scans
where something like:

create table listp(a int) partition by list(a);
create index on listp(a);
create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);

and

select * from listp order by a;

would be possible with an Append and Index Scan, rather than having a
MergeAppend or Sort. In which case we'll not want mixed partition
hierarchies in the Append subplans. Although, perhaps that would mean
we just wouldn't pullup AppendPaths which have PathKeys.

I have written and attached the patch to stop flattening of
partitioned tables into UNION ALL parent's paths, meaning we can now
get nested Append and MergeAppend paths.

I've added Robert too as I know he was the committer of partitioning
and parallel Append. Maybe he has a view on what should be done about
this? Is not flattening the paths a problem?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


dont_flatten_append_paths_for_partitions.patch
Description: Binary data


Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-06-11 Thread Thomas Munro
On Mon, Mar 26, 2018 at 6:07 PM, Kyotaro HORIGUCHI
 wrote:
> At Sun, 25 Mar 2018 22:15:52 +0900, "MauMau"  wrote in 
> 
>> And thank you for your review.  All modifications are done.
>
> Thank you for the new version. I marked this as "Ready for
> Committer" with one change.

Hi Tsunakawa-san and Horiguchi-san,

I would like to get this committed soon, but I'm not sure about
backpatching -- see below.  Here's a new version with a couple of
minor changes:

1.  Small changes to the documentation.

2.  I see that you added #include  to pgtypes_numeric.h and
pgtypes_interval.h.  They have functions returning char *.  I
experimented with removing those and including  directly in
the .pgc test files, but then I saw why you did that: it changes all
the line numbers in the expected output files making the patch much
larger.  No strong objection there.  But I figured we should at least
be consistent, so I added #include  to pgtypes_timestamp.h
and pgtypes_date.h (they also declare functions that return new
strings).

3.  It seemed unnecessary to declare the new function in extern.h
*and* in pgtypes.h.  I added #include "pgtypes.h" to common.c instead,
and a comment to introduce the section of that file that defines
functions from pgtypes.h.

4.  I found a place in src/interfaces/ecpg/test/sql/sqlda.pgc where
you missed a free() call.

Are these changes OK?

Why is it OK that we do "free(outp_sqlda)" having got that pointer
from a statement like "exec sql fetch 1 from mycur1 into descriptor
outp_sqlda"?  Isn't that memory allocated by libecpg.dll?

The files in this area all seem to lack our standard boilerplate,
copyright message, blaming everything on UC Berkeley etc.  Your new
header fits the existing pattern, so I can't really complain about
that.

The examples in the documentation call a bunch of _to_asc() functions
and then don't free the result, which is a leak, but that isn't your
patch's fault.  (Example: printf("numeric = %s\n",
PGTYPESnumeric_to_asc(num, 0))).

One question I have is whether it is against project policy to
backport a new file and a new user-facing function.  It doesn't seem
like a great idea, because it means that client code would need to
depend on a specific patch release.  Even if we found an existing
header to declare this function in, you'd still need to do conditional
magic before using it.  So... how inconvenient do you think it would
be if we did this for 11+ only?  Does anyone see a better way to do an
API evolution here?  It's not beautiful but I suppose one work-around
that end-user applications could use while they are stuck on older
releases might be something like this, in their own tree, conditional
on major version:

#define PGTYPESchar_free(x) PGTYPESdate_free((date *)(x))

-- 
Thomas Munro
http://www.enterprisedb.com


pgtypes_freemem_v5.patch
Description: Binary data


Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-06-11 Thread Antonin Houska
Arseny Sher  wrote:
> Please see detailed description of the issues, tests which reproduce
> them and fixes in the attached patch.

I've played with your tests and gdb for a while, both w/o and with your
patch. I think I can understand both problems. I could not invent simpler way
to fix them.

One comment about the coding: since you introduce a new transaction list and
it's sorted by LSN, I think you should make the function AssertTXNLsnOrder()
more generic and use it to check the by_base_snapshot_lsn list too. For
example call it after the list insertion and deletion in
ReorderBufferPassBaseSnapshot().

Also I think it makes no harm if you split the diff into two, although both
fixes use the by_base_snapshot_lsn field.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-11 Thread Ashutosh Bapat
On Sat, Jun 9, 2018 at 12:22 AM, Robert Haas  wrote:
> On Fri, Jun 8, 2018 at 2:10 PM, Tom Lane  wrote:
>>> I don't understand this complaint.  Before, the code took one
>>> AppendRelInfo, and according to you, it was clear what was supposed to
>>> happen.  Now it takes an array of AppendRelInfos and, according to
>>> you, it's completely unclear.  Yet that seems, to me at least, to be a
>>> straightforward generalization.  If 1 AppendRelInfo is an adequate
>>> specification of one translations, why are N AppendRelInfos not an
>>> adequate specification of N translations?
>>
>> Because the relationships between the transforms are unclear.  Are we
>> supposed to apply those N transformations to the expression in sequence?
>> It doesn't look to me like that's what the code does.  I think --- I might
>> be wrong --- that the code is relying on the transformations to be
>> non-overlapping, that is a change made by any one of them cannot be
>> further affected by another one.  This is, however, undocumented.
>
> OK, I see.  I guess that possible confusion didn't occur to me both
> because I was looking at the code, which I knew wouldn't handle
> overlapping transformations, and also because the intended use was for
> partition-wise join, where the problem can't come up because we're
> translating from the toplevel RTIs for the join to the set of RTIs
> appropriate for one child-join.  But, it's certainly fine to add a
> comment.

That didn't occurred to me as well.

Here's patch with comments and Assertions added to check the
non-overlapping AppendRelInfos. The assertions check that same parent
or child doesn't appear more than once in any of the AppendRelInfos,
neither as a parent nor as a child.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pg_arl_v2.patch
Description: Binary data


Secured and customizable PLPython and PLR on Postgresql

2018-06-11 Thread Hubert Zhang
Hi all,

As you know PLPython and PLR are untrusted language, which means only DBA
could create these UDFs(very inconvenient). Moreover it's also hard to
supply user specific Python/R env for different data scientists.

We are working on an open source  project to make PLPython and PLR trusted
and customizable. The project is called PLContainer, which use Docker
Container as the sandbox to avoid malicious user breaking Postgresql
database or even the whole host machine.

Now there is a basic version which could support PLContainer on PG9.6
(Thanks krait007 and markwort for the contribution). We still have a lot of
issues to make it production ready and share with more peoples. [Github
umbrella project](https://github.com/greenplum-db/plcontainer/projects/1)

If you are interested in it, feel free to try it. Your suggestion and
contribution will be appreciated.
-- 
Thanks

Hubert Zhang


Re: SHOW ALL does not honor pg_read_all_settings membership

2018-06-11 Thread Laurenz Albe
Alvaro Herrera wrote:
> On 2018-Mar-01, Laurenz Albe wrote:
> 
> > I noticed that commit 25fff40798fc4ac11a241bfd9ab0c45c085e2212 forgot
> > to teach SHOW ALL to show all GUCs when the user belongs to 
> > pg_read_all_settings.
> > 
> > Patch attached; I think this should be backpatched.
> 
> Done, with the further fixes downthread.  Thanks!

Thank you!

Laurenz Albe