Re: Should we warn against using too many partitions?

2019-05-23 Thread Amit Langote
On 2019/05/24 13:37, David Rowley wrote:
> I've attached 3 patches of what I think should go into master, pg11, and pg10.

Thanks for the updated patches.

In pg11 and pg10 patches, I see this text:

+  Whether using table inheritance or native partitioning, hierarchies

Maybe, it would better to use the word "declarative" instead of "native",
if only to be consistent; neighboring paragraphs use "declarative".

Thanks,
Amit





Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-05-23 Thread Amit Langote
Hi,

On 2019/05/23 4:15, Andreas Seltenreich wrote:
> …but when doing it on the parent relation, even 100 statements are
> enough to exceed the limit:
> 
> ,
> | $ psql -c "$(yes update t set c=c where c=6 \; | head -n 100)"
> | FEHLER:  Speicher aufgebraucht
> | DETAIL:  Failed on request of size 200 in memory context "MessageContext".
> `
> 
> The memory context dump shows plausible values except for the MessageContext:
> 
> TopMemoryContext: 124336 total in 8 blocks; 18456 free (11 chunks); 105880 
> used
>   [...]
>   MessageContext: 264241152 total in 42 blocks; 264 free (0 chunks); 
> 264240888 used
>   [...]

As David Rowley said, planning that query hundreds of times under a single
MessageContext is not something that will end well on 11.3, because even a
single instance takes up tons of memory that's only released when
MessageContext is reset.

> Maybe some tactically placed pfrees or avoiding putting redundant stuff
> into MessageContext can relax the situation?

I too have had similar thoughts on the matter.  If the planner had built
all its subsidiary data structures in its own private context (or tree of
contexts) which is reset once a plan for a given query is built and passed
on, then there wouldn't be an issue of all of that subsidiary memory
leaking into MessageContext.  However, the problem may really be that
we're subjecting the planner to use cases that it wasn't perhaps designed
to perform equally well under -- running it many times while handling the
same message.  It is worsened by the fact that the query in question is
something that ought to have been documented as not well supported by the
planner; David has posted a documentation patch for that [1].  PG 12 has
alleviated the situation to a large degree, so you won't see the OOM
occurring for this query, but not for all queries unfortunately.

With that said, we may want to look into the planner sometimes hoarding
memory, especially when planning complex queries involving partitions.
AFAIK, one of the reasons for partition-wise join, aggregate to be turned
off by default is that its planning consumes a lot of CPU and memory,
partly because of the fact that planner doesn't actively release the
memory of its subsidiary structures, or maybe because of inferior ways in
which partitions and partitioning properties are represented in the
planner.  Though if there's evidence that it's the latter, maybe we should
fix that before pondering any sophisticated planner memory management.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-2rx%2BE9mG3xrCVHupefMjAp1%2BtpczQa9SEOZWyU7fjEA%40mail.gmail.com





Re: Should we warn against using too many partitions?

2019-05-23 Thread David Rowley
On Fri, 24 May 2019 at 14:04, Amit Langote
 wrote:
> The latest patch on the thread linked from this CF entry (a modified
> version of your patch sent by Justin Pryzby) looks good to me.  Why not
> post it on this thread and link this one to the CF entry?

I'm not much of a fan of that patch:

+ 
+  When using table inheritance, partition hierarchies with more than a few
+  hundred partitions are not recommended.  Larger partition hierarchies may
+  incur long planning time, and, in the case of UPDATE
+  and DELETE, excessive memory usage.  When inheritance
+  is used, see also the limitations described in
+  .
+ 

I'm a bit confused about this paragraph.  It introduces itself as
talking about table inheritance, then uses the word "partition" in
various places. I think that can be dropped.  The final sentence
throws me off as it tries to reduce the scope to only inheritance, but
as far as I understand that was already the scope of the paragraph,
unless of course "table inheritance" is not the same as "inheritance".
Without any insider knowledge on it, I've no idea if this
UPDATE/DELETE issue affects native partitioning too.

+ 
+  When using declarative partitioning, the overhead of query planning
+  is directly related to the number of unpruned partitions.  Planning is
+  generally fast with small numbers of unpruned partitions, even in
+  partition hierarchies containing many thousands of partitions.  However,
+  long planning time will be incurred by large partition hierarchies if
+  partition pruning is not possible during the planning phase.
+ 

This should really mention the excessive memory usage when many
partitions survive pruning.

I've attached 3 patches of what I think should go into master, pg11, and pg10.

> Or maybe, make
> this an open item, because we should update documentation back to v11?

I'll add this to the open items list since it includes master, and
shift the CF entry to point to this thread.

Authors are Robert Haas and Justin Pryzby, who I've included in the email.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


docs_partitioning_warning_master_v2.patch
Description: Binary data


docs_partitioning_warning_pg11_v2.patch
Description: Binary data


docs_partitioning_warning_pg10_v2.patch
Description: Binary data


Re: Remove page-read callback from XLogReaderState.

2019-05-23 Thread Kyotaro HORIGUCHI
Thank you for looking this, Antonin.

At Wed, 22 May 2019 13:53:23 +0200, Antonin Houska  wrote in 
<25494.1558526...@spoje.net>
> Kyotaro HORIGUCHI  wrote:
> 
> > Hello. Thank you for looking this.
> > ...
> > Yeah, I'll register this, maybe the week after next week.
> 
> I've checked the new version. One more thing I noticed now is that XLR_STATE.j
> is initialized to zero, either by XLogReaderAllocate() which zeroes the whole
> reader state, or later by XLREAD_RESET. This special value then needs to be
> handled here:
> 
> #define XLR_SWITCH()  \
>   do {\
>   if ((XLR_STATE).j)  \
>   goto *((void *) (XLR_STATE).j); \
>   XLR_CASE(XLR_INIT_STATE);   \
>   } while (0)
> 
> I think it's better to set the label always to (&_INIT_STATE) so that
> XLR_SWITCH can perform the jump unconditionally.

I thought the same but did not do that since label is
function-scoped so it cannot be referred outside the defined
function.

I moved the state variable from XLogReaderState into functions
static variable. It's not problem since the functions are
non-reentrant in the first place.

> Attached is also an (unrelated) comment fix proposal.

Sounds reasoable. I found another typo "acutually" there.

-int32readLen;/* bytes acutually read, must be larger than
+int32readLen;/* bytes acutually read, must be at least

I fixed it with other typos found.

v3-0001 : Changed macrosas suggested.

v3-0002, 0004: Fixed comments. Fixes following changes of
 macros. Renamed state symbols.

v3-0003, 0005-0010: No substantial change from v2.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 036b1a50ad79ebfde990e178bead9ba03c753d02 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 17 Apr 2019 14:15:16 +0900
Subject: [PATCH 02/10] Make ReadPageInternal a state machine

This patch set aims to remove read_page call back from
XLogReaderState. This is done in two steps. This patch does the first
step. Makes ReadPageInternal, which currently calling read_page
callback, a state machine and move the caller sites to XLogReadRecord
and other direct callers of the callback.
---
 src/backend/access/transam/xlog.c  |  15 ++-
 src/backend/access/transam/xlogreader.c| 180 +++--
 src/backend/access/transam/xlogutils.c |   8 +-
 src/backend/replication/logical/logicalfuncs.c |   6 +-
 src/backend/replication/walsender.c|  10 +-
 src/bin/pg_rewind/parsexlog.c  |  17 ++-
 src/bin/pg_waldump/pg_waldump.c|   8 +-
 src/include/access/xlogreader.h|  39 --
 src/include/access/xlogutils.h |   2 +-
 src/include/replication/logicalfuncs.h |   2 +-
 10 files changed, 177 insertions(+), 110 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1c7dd51b9f..d1a29fe5cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -885,7 +885,7 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
-static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
+static void	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 		 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 		 TimeLineID *readTLI);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -11507,7 +11507,7 @@ CancelBackup(void)
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
  */
-static int
+static void
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
 {
@@ -11566,7 +11566,8 @@ retry:
 			readLen = 0;
 			readSource = 0;
 
-			return -1;
+			xlogreader->readLen = -1;
+			return;
 		}
 	}
 
@@ -11661,7 +11662,8 @@ retry:
 		goto next_record_is_invalid;
 	}
 
-	return readLen;
+	xlogreader->readLen = readLen;
+	return;
 
 next_record_is_invalid:
 	lastSourceFailed = true;
@@ -11675,8 +11677,9 @@ next_record_is_invalid:
 	/* In standby-mode, keep trying */
 	if (StandbyMode)
 		goto retry;
-	else
-		return -1;
+
+	xlogreader->readLen = -1;
+	return;
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index baf04b9f1f..f7499aff5e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -118,8 +118,8 @@ static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool 

Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-05-23 Thread David Rowley
On Fri, 24 May 2019 at 10:44, Peter Geoghegan  wrote:
>
> On Thu, May 23, 2019 at 3:31 PM Tom Lane  wrote:
> > Given the way that's implemented, I doubt that we can report it
> > reliably in EXPLAIN.
>
> Does it have to be totally reliable?
>
> cost_sort() costs sorts as top-N heapsorts. While we cannot make an
> iron-clad guarantee that it will work out that way from within
> tuplesort.c, that doesn't seem like it closes off the possibility of
> more informative EXPLAIN output. For example, can't we at report that
> the tuplesort will be "bounded" within EXPLAIN, indicating that we
> intend to attempt to sort using a top-N heap sort (i.e. we'll
> definitely do it that way if there is sufficient work_mem)?

I think this really needs more of a concrete proposal.  Remember
LIMIT/OFFSET don't need to be constants, they could be a Param or some
return value from a subquery, so the bound might not be known until
after executor startup, to which EXPLAIN is not going to get to know
about that.

Perhaps something to be tagged onto the Sort path in grouping_planner
if preprocess_limit() managed to come up with a value.  double does
not seem like the perfect choice for a bound to show in EXPLAIN and
int64 could wrap for very high LIMIT + OFFSET values.  Showing an
approximate value in EXPLAIN seems like it might be a source of future
bug reports.  Perhaps if we did it, we could just set it to -1
(unknown) if LIMIT + OFFSET happened to wrap an int64.  Implementing
that would seem to require adding a new field for that in SortPath,
Sort and SortState, plus all the additional code for passing the value
over.  We'd have to then hope nobody used the field for anything
important in the future.

After that, what would we do with it in EXPLAIN?  Always show "Bound:
", if it's not -1?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Zedstore - compressed in-core columnar storage

2019-05-23 Thread Ajin Cherian
Hi Ashwin,

- how to pass the "column projection list" to table AM? (as stated in
  initial email, currently we have modified table am API to pass the
  projection to AM)

We were working on a similar columnar storage using pluggable APIs; one
idea that we thought of was to modify the scan slot based on the targetlist
to have only the relevant columns in the scan descriptor. This way the
table AMs are passed a slot with only relevant columns in the descriptor.
Today we do something similar to the result slot using
ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
a table am handler API to modify the scan tuple slot based on the
targetlist, a probable name for the new table am handler would be:
exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).

 So this way the scan am handlers like getnextslot is passed a slot only
having the relevant columns in the scan descriptor. One issue though is
that the beginscan is not passed the slot, so if some memory allocation
needs to be done based on the column list, it can't be done in beginscan.
Let me know what you think.


regards,
Ajin Cherian
Fujitsu Australia

On Thu, May 23, 2019 at 3:56 PM Ashwin Agrawal  wrote:

>
> We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
> share the recent additions and modifications. Attaching a patch
> with the latest code. Link to github branch [1] to follow
> along. The approach we have been leaning towards is to build required
> functionality, get passing the test and then continue to iterate to
> optimize the same. It's still work-in-progress.
>
> Sharing the details now, as have reached our next milestone for
> Zedstore. All table AM API's are implemented for Zedstore (except
> compute_xid_horizon_for_tuples, seems need test for it first).
>
> Current State:
>
> - A new type of item added to Zedstore "Array item", to boost
>   compression and performance. Based on Konstantin's performance
>   experiments [2] and inputs from Tomas Vodra [3], this is
>   added. Array item holds multiple datums, with consecutive TIDs and
>   the same visibility information. An array item saves space compared
>   to multiple single items, by leaving out repetitive UNDO and TID
>   fields. An array item cannot mix NULLs and non-NULLs. So, those
>   experiments should result in improved performance now. Inserting
>   data via COPY creates array items currently. Code for insert has not
>   been modified from last time. Making singleton inserts or insert
>   into select, performant is still on the todo list.
>
> - Now we have a separate and dedicated meta-column btree alongside
>   rest of the data column btrees. This special or first btree for
>   meta-column is used to assign TIDs for tuples, track the UNDO
>   location which provides visibility information. Also, this special
>   btree, which always exists, helps to support zero-column tables
>   (which can be a result of ADD COLUMN DROP COLUMN actions as
>   well). Plus, having meta-data stored separately from data, helps to
>   get better compression ratios. And also helps to further simplify
>   the overall design/implementation as for deletes just need to edit
>   the meta-column and avoid touching the actual data btrees. Index
>   scans can just perform visibility checks based on this meta-column
>   and fetch required datums only for visible tuples. For tuple locks
>   also just need to access this meta-column only. Previously, every
>   column btree used to carry the same undo pointer. Thus visibility
>   check could be potentially performed, with the past layout, using
>   any column. But considering overall simplification new layout
>   provides it's fine to give up on that aspect. Having dedicated
>   meta-column highly simplified handling for add columns with default
>   and null values, as this column deterministically provides all the
>   TIDs present in the table, which can't be said for any other data
>   columns due to default or null values during add column.
>
> - Free Page Map implemented. The Free Page Map keeps track of unused
>   pages in the relation. The FPM is also a b-tree, indexed by physical
>   block number. To be more compact, it stores "extents", i.e. block
>   ranges, rather than just blocks, when possible. An interesting paper [4]
> on
>   how modern filesystems manage space acted as a good source for ideas.
>
> - Tuple locks implemented
>
> - Serializable isolation handled
>
> - With "default_table_access_method=zedstore"
>   - 31 out of 194 failing regress tests
>   - 10 out of 86 failing isolation tests
> Many of the current failing tests are due to plan differences, like
> Index scans selected for zedstore over IndexOnly scans, as zedstore
> doesn't yet have visibility map. I am yet to give a thought on
> index-only scans. Or plan diffs due to table size differences between
> heap and zedstore.
>
> Next few milestones we wish to hit 

Re: initdb recommendations

2019-05-23 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> "Jonathan S. Katz"  writes:
> > For now I have left in the password based method to be scram-sha-256 as
> > I am optimistic about the support across client drivers[1] (and FWIW I
> > have an implementation for crystal-pg ~60% done).
> 
> > However, this probably means we would need to set the default password
> > encryption guc to "scram-sha-256" which we're not ready to do yet, so it
> > may be moot to leave it in.
> 
> > So, thinking out loud about that, we should probably use "md5" and once
> > we decide to make the encryption method "scram-sha-256" by default, then
> > we update the recommendation?
> 
> Meh.  If we're going to break things, let's break them.  Set it to
> scram by default and let people who need to cope with old clients
> change the default.  I'm tired of explaining that MD5 isn't actually
> insecure in our usage ...

+many.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: initdb recommendations

2019-05-23 Thread Tom Lane
"Jonathan S. Katz"  writes:
> For now I have left in the password based method to be scram-sha-256 as
> I am optimistic about the support across client drivers[1] (and FWIW I
> have an implementation for crystal-pg ~60% done).

> However, this probably means we would need to set the default password
> encryption guc to "scram-sha-256" which we're not ready to do yet, so it
> may be moot to leave it in.

> So, thinking out loud about that, we should probably use "md5" and once
> we decide to make the encryption method "scram-sha-256" by default, then
> we update the recommendation?

Meh.  If we're going to break things, let's break them.  Set it to
scram by default and let people who need to cope with old clients
change the default.  I'm tired of explaining that MD5 isn't actually
insecure in our usage ...

regards, tom lane




Re: Minor typos and copyright year slippage

2019-05-23 Thread Tom Lane
Thomas Munro  writes:
> Thanks, pushed.  There are also a few 2018 copyright messages in .po
> files but I understand that those are managed with a different
> workflow.

Right.  I'm not sure what the copyright-maintenance process is for the
.po files, but in any case the .po files in our gitmaster repo are
downstream from where that would need to happen.  There's no point
in editing them here.

regards, tom lane




Re: Should we warn against using too many partitions?

2019-05-23 Thread Amit Langote
Hi David,

On 2019/05/23 18:02, David Rowley wrote:
> Over on [1] I raised a concern about the lack of any warning in our
> documents to inform users that they might not want to use thousands of
> partitions.  More recently there's [2], also suffering from OOM using
> 100 partitions.  Perhaps there's more too this, but the planner using
> a lot of memory planning updates and deletes to partitioned tables
> does seem to be a surprise to many people.
> 
> I had hoped we could get something it the documents sooner rather than
> later about this. Probably the v12 patch will need to be adjusted now
> that the memory consumption will be reduced when many partitions are
> pruned, but I still think v12 needs to have some sort of warning in
> there.
> 
> https://commitfest.postgresql.org/23/2065/

The latest patch on the thread linked from this CF entry (a modified
version of your patch sent by Justin Pryzby) looks good to me.  Why not
post it on this thread and link this one to the CF entry?  Or maybe, make
this an open item, because we should update documentation back to v11?

Thanks,
Amit





Re: nitpick about poor style in MergeAttributes

2019-05-23 Thread Mark Dilger
On Thu, May 23, 2019 at 5:24 PM Michael Paquier  wrote:
>
> On Thu, May 23, 2019 at 08:27:19AM -0700, Mark Dilger wrote:
> > On Thu, May 23, 2019 at 7:54 AM Tom Lane  wrote:
> >> Are we sure that's not just a newly-introduced bug, ie it has not
> >> been tested in cases where the tlist could become empty?  My first
> >> thought would be to assign the list pointer value back as per usual
> >> coding convention, not to double down on the assumption that this
> >> was well-considered code.
> >
> > I don't think that is disputed.  I was debating between assigning
> > it back and also asserting that it is not NIL vs. assigning it back
> > and elog/ereporting if it is NIL.  Of course, this is assuming the
> > code was designed with the expectation that the list can never
> > become empty.  If you think it might become empty, and that the
> > surrounding code can handle that sensibly, then perhaps we
> > need neither the assertion nor the elog/ereport, though we still
> > need the assignment.
>
> Looking closer, this code is not new as of v12.  We have that since
> e7b3349 which has introduced CREATE TABLE OF.  Anyway, I think that
> assigning the result of list_delete_cell and adding an assertion like
> in the attached are saner things to do.  This code scans each entry in
> the list and removes columns with duplicate names, so we should never
> finish with an empty list as we will in the first case always merge
> down to at least one column.  That's rather a nit, but I guess that
> this is better than the previous code which assumed that silently?

I like it better because it makes static analysis of the code easier,
and because if anybody ever changed list_delete_cell to return a
different list object in more cases than just when the list is completely
empty, this call site would be silently wrong.

Thanks for the patch!

mark




Re: Read-only access to temp tables for 2PC transactions

2019-05-23 Thread Michael Paquier
On Thu, May 23, 2019 at 08:54:59AM -0700, Andres Freund wrote:
> On 2019-05-23 12:36:09 +0100, Simon Riggs wrote:
>> The ONLY case where this matters is if someone does a PREPARE and then
>> starts doing other work on the session. Which makes no sense in the normal
>> workflow of a session. I'm sure there are tests that do that, but those
>> tests are unrepresentative of sensible usage.
> 
> That's extremely common.
> 
> There's no way we can forbid using session after 2PC unconditionally,
> it'd break most users of 2PC.

This does not break Postgres-XC or XL as their inner parts with a
COMMIT involving multiple write nodes issue a set of PREPARE
TRANSACTION followed by an immediate COMMIT PREPARED which are
transparent for the user, so the point of Simon looks sensible from
this angle.  Howewer, I much agree with Andres that it is very common
to have PREPARE and COMMIT PREPARED issued with different sessions.  I
am not much into the details of XA-compliant drivers, but I think that
having us lose this property would be the source of many complaints.
--
Michael


signature.asc
Description: PGP signature


Re: nitpick about poor style in MergeAttributes

2019-05-23 Thread Michael Paquier
On Thu, May 23, 2019 at 08:27:19AM -0700, Mark Dilger wrote:
> On Thu, May 23, 2019 at 7:54 AM Tom Lane  wrote:
>> Are we sure that's not just a newly-introduced bug, ie it has not
>> been tested in cases where the tlist could become empty?  My first
>> thought would be to assign the list pointer value back as per usual
>> coding convention, not to double down on the assumption that this
>> was well-considered code.
> 
> I don't think that is disputed.  I was debating between assigning
> it back and also asserting that it is not NIL vs. assigning it back
> and elog/ereporting if it is NIL.  Of course, this is assuming the
> code was designed with the expectation that the list can never
> become empty.  If you think it might become empty, and that the
> surrounding code can handle that sensibly, then perhaps we
> need neither the assertion nor the elog/ereport, though we still
> need the assignment.

Looking closer, this code is not new as of v12.  We have that since
e7b3349 which has introduced CREATE TABLE OF.  Anyway, I think that
assigning the result of list_delete_cell and adding an assertion like
in the attached are saner things to do.  This code scans each entry in
the list and removes columns with duplicate names, so we should never
finish with an empty list as we will in the first case always merge
down to at least one column.  That's rather a nit, but I guess that
this is better than the previous code which assumed that silently?
--
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c9b8857d30..67ef86b9cc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2088,7 +2088,13 @@ MergeAttributes(List *schema, List *supers, char relpersistence,
 	coldef->cooked_default = restdef->cooked_default;
 	coldef->constraints = restdef->constraints;
 	coldef->is_from_type = false;
-	list_delete_cell(schema, rest, prev);
+	schema = list_delete_cell(schema, rest, prev);
+
+	/*
+	 * As two elements are merged and one is removed, we
+	 * should never finish with an empty list.
+	 */
+	Assert(schema != NIL);
 }
 else
 	ereport(ERROR,


signature.asc
Description: PGP signature


Re: initdb recommendations

2019-05-23 Thread Jonathan S. Katz
On 5/23/19 6:47 PM, Jonathan S. Katz wrote:
> On 5/23/19 12:54 PM, Peter Eisentraut wrote:
>> On 2019-04-06 20:08, Noah Misch wrote:
> I think we should just change the defaults.  There is a risk of warning
> fatigue.  initdb does warn about this, so anyone who cared could have
> gotten the information.
>

 I've been suggesting that for years, so definite strong +1 for doing that.
>>>
>>> +1
>>
>> To recap, the idea here was to change the default authentication methods
>> that initdb sets up, in place of "trust".
>>
>> I think the ideal scenario would be to use "peer" for local and some
>> appropriate password method (being discussed elsewhere) for host.
> 
> +1.
> 
>> Looking through the buildfarm, I gather that the only platforms that
>> don't support peer are Windows, AIX, and HP-UX.  I think we can probably
>> figure out some fallback or alternative default for the latter two
>> platforms without anyone noticing.  But what should the defaults be on
>> Windows?  It doesn't have local sockets, so the lack of peer wouldn't
>> matter.  But is it OK to default to a password method, or would that
>> upset people particularly?
> 
> +1 for password method. Definitely better than trust :)

Attached is v2 of the patch.

For now I have left in the password based method to be scram-sha-256 as
I am optimistic about the support across client drivers[1] (and FWIW I
have an implementation for crystal-pg ~60% done).

However, this probably means we would need to set the default password
encryption guc to "scram-sha-256" which we're not ready to do yet, so it
may be moot to leave it in.

So, thinking out loud about that, we should probably use "md5" and once
we decide to make the encryption method "scram-sha-256" by default, then
we update the recommendation?

Thanks,

Jonathan
From d610f4e575b6ee634b94dc6cb9125c4dbaedc305 Mon Sep 17 00:00:00 2001
From: "Jonathan S. Katz" 
Date: Fri, 5 Apr 2019 12:02:40 -0400
Subject: [PATCH] Add a warning about the client authentication defaults that
 initdb provides.

This also provides advice on how to securely set up initial client connection
configurations, and removes the section that explains similar steps that is
below the directory setup. This information should be around where its explained
how initdb is first called, anyway.
---
 doc/src/sgml/runtime.sgml | 46 +-
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index e053e2ee34..040aacf87f 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -85,6 +85,31 @@
described in the previous section.
   
 
+  
+
+  By default initdb sets up trust
+  client authentication for connecting to the database. This is not
+  recommended on multi-user systems where you do not trust all users, nor 
if
+  the database server will be made accessible to remote systems.
+
+
+  We recommend using the -W, --pwprompt,
+  or --pwfile flags to assign a password to the database
+  superuser, and to override the pg_hba.conf default
+  generation using -auth-local peer for local connections,
+  (except on Windows, use -auth-local scram-sha-256 as
+  peer authentication is not supported) and
+  -auth-host scram-sha-256 for remote connections. See
+   for more information on client
+  authentication methods.
+
+
+  If installing PostgreSQL from a distribution, we recommend you validate
+  your initially generated pg_hba.conf file to ensure
+  it meets your operational requirements.
+
+  
+
   

 As an alternative to the -D option, you can set
@@ -155,27 +180,6 @@ postgres$ initdb -D 
/usr/local/pgsql/data
for directories and 0640 for files.
   
 
-  
-   However, while the directory contents are secure, the default
-   client authentication setup allows any local user to connect to the
-   database and even become the database superuser. If you do not
-   trust other local users, we recommend you use one of
-   initdb's -W, --pwprompt
-   or --pwfile options to assign a password to the
-   database superuser.
- password
- of the superuser
-   
-   Also, specify -A md5 or
-   -A password so that the default trust 
authentication
-   mode is not used; or modify the generated pg_hba.conf
-   file after running initdb, but
-   before you start the server for the first time. (Other
-   reasonable approaches include using peer authentication
-   or file system permissions to restrict connections. See  for more information.)
-  
-
   
initdb also initializes the default
localelocale for the database 
cluster.
-- 
2.14.3 (Apple Git-98)



signature.asc
Description: OpenPGP digital signature


Re: Minor typos and copyright year slippage

2019-05-23 Thread Thomas Munro
On Thu, May 23, 2019 at 1:55 PM Michael Paquier  wrote:
> On Thu, May 23, 2019 at 01:28:45PM +1200, Thomas Munro wrote:
> > Here are some tiny things I noticed in passing.
>
> Good catches.  And you have spotted all the blank spots for the
> copyright notices.

Thanks, pushed.  There are also a few 2018 copyright messages in .po
files but I understand that those are managed with a different
workflow.

-- 
Thomas Munro
https://enterprisedb.com




Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-24 08:32:29 +0900, Michael Paquier wrote:
> On Thu, May 23, 2019 at 03:31:30PM -0700, Andres Freund wrote:
> > Well, I think the approach of duplicating code all over is a bad idea,
> > and the fix is many times too big. But it's better than not fixing.
> 
> Well, I can see why the current solution is not perfect, but we have
> been doing that for some time now, and redesigning that part has a
> much larger impact than a single column.  I have committed the initial
> fix now.

That argument would hold some sway if we there weren't a number of cases
doing it differently in the tree already:

if (i_checkoption == -1 || PQgetisnull(res, i, i_checkoption))
tblinfo[i].checkoption = NULL;
else
tblinfo[i].checkoption = pg_strdup(PQgetvalue(res, i, 
i_checkoption));

if (PQfnumber(res, "protrftypes") != -1)
protrftypes = PQgetvalue(res, 0, PQfnumber(res, "protrftypes"));
else
protrftypes = NULL;

if (PQfnumber(res, "proparallel") != -1)
proparallel = PQgetvalue(res, 0, PQfnumber(res, "proparallel"));
else
proparallel = NULL;

if (i_proparallel != -1)
proparallel = PQgetvalue(res, 0, PQfnumber(res, "proparallel"));
else
proparallel = NULL;

And no, I don't buy the consistency argument. Continuing to do redundant
work just because we always have, isn't better than having one useful
and one redundant approach. And yes, a full blown redesign would be
better, but that doesn't get harder by having the -1 checks.

- Andres




Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

2019-05-23 Thread Michael Paquier
On Thu, May 23, 2019 at 03:31:30PM -0700, Andres Freund wrote:
> Well, I think the approach of duplicating code all over is a bad idea,
> and the fix is many times too big. But it's better than not fixing.

Well, I can see why the current solution is not perfect, but we have
been doing that for some time now, and redesigning that part has a
much larger impact than a single column.  I have committed the initial
fix now.  We can always break the wheel later on in 13~.
--
Michael


signature.asc
Description: PGP signature


Re: Inconsistency between table am callback and table function names

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-14 12:11:46 -0700, Ashwin Agrawal wrote:
> Thank you. Please find the patch to rename the agreed functions. It would
> be good to make all consistent instead of applying exception to three
> functions but seems no consensus on it. Given table_ api are new, we could
> modify them leaving systable_ ones as is, but as objections left that as is.

I've pushed a slightly modified version (rebase, some additional
newlines due to the longer function names) now. Thanks for the patch!

Greetings,

Andres Freund




No mention of no CIC support for partitioned index in docs

2019-05-23 Thread David Rowley
It appears there is no mention of lack of support for CREATE INDEX
CONCURRENTLY on partitioned index in the documents.

Added in the attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


docs_no_cic_for_part_indexes.patch
Description: Binary data


Re: initdb recommendations

2019-05-23 Thread Jonathan S. Katz
On 5/23/19 12:54 PM, Peter Eisentraut wrote:
> On 2019-04-06 20:08, Noah Misch wrote:
 I think we should just change the defaults.  There is a risk of warning
 fatigue.  initdb does warn about this, so anyone who cared could have
 gotten the information.

>>>
>>> I've been suggesting that for years, so definite strong +1 for doing that.
>>
>> +1
> 
> To recap, the idea here was to change the default authentication methods
> that initdb sets up, in place of "trust".
> 
> I think the ideal scenario would be to use "peer" for local and some
> appropriate password method (being discussed elsewhere) for host.

+1.

> Looking through the buildfarm, I gather that the only platforms that
> don't support peer are Windows, AIX, and HP-UX.  I think we can probably
> figure out some fallback or alternative default for the latter two
> platforms without anyone noticing.  But what should the defaults be on
> Windows?  It doesn't have local sockets, so the lack of peer wouldn't
> matter.  But is it OK to default to a password method, or would that
> upset people particularly?

+1 for password method. Definitely better than trust :)

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: Question about BarrierAttach spinlock

2019-05-23 Thread Mark Dilger
On Thu, May 23, 2019 at 3:44 PM Thomas Munro  wrote:
>
> On Fri, May 24, 2019 at 4:10 AM Mark Dilger  wrote:
> > In src/backend/storage/ipc/barrier.c, BarrierAttach
> > goes to the bother of storing the phase before
> > releasing the spinlock, and then returns the phase.
> >
> > In nodeHash.c, ExecHashTableCreate ignores the
> > phase returned by BarrierAttach, and then immediately
> > calls BarrierPhase to get the phase that it just ignored.
> > I don't know that there is anything wrong with this, but
> > if the phase can be retrieved after the spinlock is
> > released, why hold the spinlock extra long in
> > BarrierAttach?
> >
> > Just asking
>
> Well spotted.  I think you're right, and we could release the spinlock
> a nanosecond earlier.  It must be safe to move that assignment, for
> the reason explained in the comment of BarrierPhase(): after we
> release the spinlock, we are attached, and the phase cannot advance
> without us.  I will contemplate moving that for v13 on principle.
>
> As for why ExecHashTableCreate() calls BarrierAttach(build_barrier)
> and then immediately calls BarrierPhase(build_barrier), I suppose I
> could remove the BarrierAttach() line and change the BarrierPhase()
> call to BarrierAttach(), though I think that'd be slightly harder to
> follow.  I suppose I could introduce a variable phase.

Thanks for the explanation!

mark




Re: ClosePipeStream failure ignored in pg_import_system_collations

2019-05-23 Thread Tom Lane
Mark Dilger  writes:
> On Thu, May 23, 2019 at 3:23 PM Tom Lane  wrote:
>> The concrete case where that's an issue, I think, is that "locale -a"
>> fails, possibly after outputting a few locale names.  The only report
>> we get about that is a failure indication from ClosePipeStream.
>> As things stand we just silently push on, creating no or a few collations.
>> With a check, we'd error out ... causing initdb to fail altogether.
>> Maybe that's an overreaction; I'm not sure.  Perhaps the right
>> thing is just to issue a warning?  But ignoring it completely
>> seems bad.

> Another option is to retry the "locale -a" call, perhaps after sleeping
> a short while, but I have no idea how likely a second (or third...) call
> to "locale -a" is to succeed if the prior call failed, mostly because I
> don't have a clear idea why it would fail the first time.

I doubt that retrying would be of any value; in a resource-exhaustion
situation you might as well just redo the whole initdb.  The main case
I can think of where you'd get a hard failure is "/usr/bin/locale not
installed".  I have no idea whether there are any platforms where that
would be a likely situation.  On my Linux machines it seems to be part of
glibc-common, so there's basically 0 chance ... but I can imagine that
other platforms with a more stripped-down mentality might allow it to
not be present.

regards, tom lane




Re: Question about BarrierAttach spinlock

2019-05-23 Thread Thomas Munro
On Fri, May 24, 2019 at 4:10 AM Mark Dilger  wrote:
> In src/backend/storage/ipc/barrier.c, BarrierAttach
> goes to the bother of storing the phase before
> releasing the spinlock, and then returns the phase.
>
> In nodeHash.c, ExecHashTableCreate ignores the
> phase returned by BarrierAttach, and then immediately
> calls BarrierPhase to get the phase that it just ignored.
> I don't know that there is anything wrong with this, but
> if the phase can be retrieved after the spinlock is
> released, why hold the spinlock extra long in
> BarrierAttach?
>
> Just asking

Well spotted.  I think you're right, and we could release the spinlock
a nanosecond earlier.  It must be safe to move that assignment, for
the reason explained in the comment of BarrierPhase(): after we
release the spinlock, we are attached, and the phase cannot advance
without us.  I will contemplate moving that for v13 on principle.

As for why ExecHashTableCreate() calls BarrierAttach(build_barrier)
and then immediately calls BarrierPhase(build_barrier), I suppose I
could remove the BarrierAttach() line and change the BarrierPhase()
call to BarrierAttach(), though I think that'd be slightly harder to
follow.  I suppose I could introduce a variable phase.

-- 
Thomas Munro
https://enterprisedb.com




Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-05-23 Thread Peter Geoghegan
On Thu, May 23, 2019 at 3:31 PM Tom Lane  wrote:
> Given the way that's implemented, I doubt that we can report it
> reliably in EXPLAIN.

Does it have to be totally reliable?

cost_sort() costs sorts as top-N heapsorts. While we cannot make an
iron-clad guarantee that it will work out that way from within
tuplesort.c, that doesn't seem like it closes off the possibility of
more informative EXPLAIN output. For example, can't we at report that
the tuplesort will be "bounded" within EXPLAIN, indicating that we
intend to attempt to sort using a top-N heap sort (i.e. we'll
definitely do it that way if there is sufficient work_mem)?

-- 
Peter Geoghegan




Re: ClosePipeStream failure ignored in pg_import_system_collations

2019-05-23 Thread Mark Dilger
On Thu, May 23, 2019 at 3:23 PM Tom Lane  wrote:
>
> Mark Dilger  writes:
> > I only see three invocations of ClosePipeStream in the sources.
> > In two of them, the return value is checked and an error is raised
> > if it failed.  In the third, the error (if any) is squashed.
>
> > I don't know if a pipe stream over "locale -a" could ever fail to
> > close, but it seems sensible to log an error if it does.
>
> The concrete case where that's an issue, I think, is that "locale -a"
> fails, possibly after outputting a few locale names.  The only report
> we get about that is a failure indication from ClosePipeStream.
> As things stand we just silently push on, creating no or a few collations.
> With a check, we'd error out ... causing initdb to fail altogether.
>
> Maybe that's an overreaction; I'm not sure.  Perhaps the right
> thing is just to issue a warning?  But ignoring it completely
> seems bad.

Another option is to retry the "locale -a" call, perhaps after sleeping
a short while, but I have no idea how likely a second (or third...) call
to "locale -a" is to succeed if the prior call failed, mostly because I
don't have a clear idea why it would fail the first time.

I would prefer initdb to fail, and fail loudly, rather than warning and
moving on, but I can imagine production systems which are set up
in a way where that would be painful.  Perhaps somebody with such
a setup will respond?

mark




Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 18:31:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It's also noticable that we preposterously assume that the sort actually
> > will return exactly the number of rows in the table, despite being a
> > top-n style sort.
> 
> In general, we report nodes below LIMIT with their execute-to-completion
> cost and rowcount estimates.  Doing differently for a top-N sort would
> be quite confusing, I should think.

I'm not quite sure that's true. I mean, a top-N sort wouldn't actually
necessarily return all the input rows, even if run to completion. Isn't
that a somewhat fundamental difference?

Greetings,

Andres Freund




Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-23 Thread Tom Lane
Donald Dong  writes:
> Perhaps the cheapest-total-cost should not be the best/only choice
> for fitness?

Well, really the GEQO code should be thrown out and rewritten from
the ground up ... but that hasn't quite gotten done yet.

regards, tom lane




Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 09:11:33 +0900, Michael Paquier wrote:
> On Wed, May 22, 2019 at 02:39:54PM -0400, Tom Lane wrote:
> > Andres Freund  writes:
> >> Well, if we explicitly have to check for -1, it's not really an error of
> >> omission for everything. Yes, we could forget returning the amname in a
> >> newer version of the query, but given that we usually just forward copy
> >> the query that's not that likely.
> > 
> > No, the concerns I have are about (1) failure to insert the extra return
> > column into all branches where it's needed; (2) some unexpected run-time
> > problem causing the PGresult to not have the expected column.
> 
> Using a -1 check is not something I find much helpful, because this
> masks the real problem that some queries may not have the output they
> expect.

I don't buy this, at all. The likelihood of introducing failures by
having to modify a lot of queries nobody runs is much higher than what
we gain by the additional "checks". If this were something on the type
system level, where the compiler would detect the error, even without
running the query: Yea, ok. But it's not.


> Yes.  I don't think that this is completely wrong.  So, are there any
> objections if I just apply the patch at the top of this thread and fix
> the issue?

Well, I think the approach of duplicating code all over is a bad idea,
and the fix is many times too big. But it's better than not fixing.

Greetings,

Andres Freund




Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-05-23 Thread Tom Lane
Andres Freund  writes:
> Right now we don't indicate that a top-n sort is going to be used in
> EXPLAIN, just EXPLAIN ANALYZE.

Given the way that's implemented, I doubt that we can report it
reliably in EXPLAIN.

> It's also noticable that we preposterously assume that the sort actually
> will return exactly the number of rows in the table, despite being a
> top-n style sort.

In general, we report nodes below LIMIT with their execute-to-completion
cost and rowcount estimates.  Doing differently for a top-N sort would
be quite confusing, I should think.

> That seems bad for costing of the parallel query,
> because it think we'll assume that costs tups * parallel_tuple_cost?

If the parallel query stuff doesn't understand about LIMIT, that's
a bug independently of top-N sorts.

regards, tom lane




Re: FullTransactionId changes are causing portability issues

2019-05-23 Thread Tom Lane
Andres Freund  writes:
> On 2019-05-23 14:05:19 -0400, Tom Lane wrote:
>> I think you're vastly overstating the case for refusing support for this.
>> Adding "#ifndef FRONTEND" to relevant headers isn't a huge amount of work
>> --- it's certainly far less of a problem than the Microsoft-droppings
>> we've had to put in in so many places.  The only real issue in my mind
>> is the lack of buildfarm support for detecting that we need to do so.

> Well, doing it for every single inline function is pretty annoying, just
> from a bulkiness perspective.

Oh, I certainly wasn't suggesting we do that.

> And figuring out exactly which inline
> function needs this isn't easy without something that actually shows the
> problem.

... which is why we need a buildfarm animal that shows the problem.
We had some, up until the C99 move.

If the only practical way to detect the issue were to run some ancient
platform or other, I'd tend to agree with you that it's not worth the
trouble.  But if we can spot it just by using -fkeep-inline-functions
on an animal or two, I think it's a reasonable thing to keep supporting
the case for a few years more.

regards, tom lane




Re: ClosePipeStream failure ignored in pg_import_system_collations

2019-05-23 Thread Tom Lane
Mark Dilger  writes:
> I only see three invocations of ClosePipeStream in the sources.
> In two of them, the return value is checked and an error is raised
> if it failed.  In the third, the error (if any) is squashed.

> I don't know if a pipe stream over "locale -a" could ever fail to
> close, but it seems sensible to log an error if it does.

The concrete case where that's an issue, I think, is that "locale -a"
fails, possibly after outputting a few locale names.  The only report
we get about that is a failure indication from ClosePipeStream.
As things stand we just silently push on, creating no or a few collations.
With a check, we'd error out ... causing initdb to fail altogether.

Maybe that's an overreaction; I'm not sure.  Perhaps the right
thing is just to issue a warning?  But ignoring it completely
seems bad.

regards, tom lane




Re: Remove useless associativity/precedence from parsers

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-22 17:25:31 -0400, Tom Lane wrote:
> The easiest method is to fire up some client code that repeatedly
> does whatever you want to test, and then look at perf or oprofile
> or local equivalent to see where the time is going in the backend
> process.
> 
> For the particular case of stressing the parser, probably the
> best thing to look at is test cases that do a lot of low-overhead
> DDL, such as creating views.  You could do worse than just repeatedly
> sourcing our standard view files, like
>   src/backend/catalog/system_views.sql
>   src/backend/catalog/information_schema.sql
> (In either case, I'd suggest adapting the file to create all
> its objects in some transient schema that you can just drop.
> Repointing information_schema.sql to some other schema is
> trivial, just change a couple of commands at the top; and
> you could tweak system_views.sql similarly.  Also consider
> wrapping the whole thing in BEGIN; ... ROLLBACK; instead of
> spending time on an explicit DROP.)
> 
> Somebody else might know of a better test case but I'd try
> that first.

> There would still be a fair amount of I/O and catalog lookup
> overhead in a test run that way, but it would be an honest
> approximation of useful real-world cases.  If you're willing to
> put some blinders on and just micro-optimize the flex/bison
> code, you could set up a custom function that just calls that
> stuff.  I actually did that not too long ago; C code attached
> for amusement's sake.

FWIW, this is why I'd suggested the hack of EXPLAIN (PARSE_ANALYZE OFF,
OPTIMIZE OFF) a few years back. Right now it's hard to measure the
parser in isolation.

Greetings,

Andres Freund




Re: SQL-spec incompatibilities in similar_escape() and related stuff

2019-05-23 Thread Tom Lane
I wrote:
> I propose therefore that we leave similar_escape in place with its
> current behavior, as a compatibility measure for cases like this.
> Intead, invent two new strict functions, say
>   similar_to_escape(pattern)
>   similar_to_escape(pattern, escape)
> and change the parser and the implementation of SUBSTRING() to
> rely on these going forward.

> The net effect will be to make explicit "ESCAPE NULL" spec-compliant,
> and to get rid of the performance problem from inlining failure for
> substring().  All else is just doc clarifications.

Here's a proposed patch for that.  I think it's a bit too late to be
messing with this kind of thing for v12, so I'll add this to the
upcoming CF.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a79e7c0..7d4472a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4146,6 +4146,14 @@ cast(-44 as bit(12))   11010100

 

+According to the SQL standard, omitting ESCAPE
+means there is no escape character (rather than defaulting to a
+backslash), and a zero-length ESCAPE value is
+disallowed.  PostgreSQL's behavior in
+this regard is therefore slightly nonstandard.
+   
+
+   
 The key word ILIKE can be used instead of
 LIKE to make the match case-insensitive according
 to the active locale.  This is not in the SQL standard but is a
@@ -4280,9 +4288,27 @@ cast(-44 as bit(12))   11010100

 

-As with LIKE, a backslash disables the special meaning
-of any of these metacharacters; or a different escape character can
-be specified with ESCAPE.
+As with LIKE, a backslash disables the special
+meaning of any of these metacharacters.  A different escape character
+can be specified with ESCAPE, or the escape
+capability can be disabled by writing ESCAPE ''.
+   
+
+   
+According to the SQL standard, omitting ESCAPE
+means there is no escape character (rather than defaulting to a
+backslash), and a zero-length ESCAPE value is
+disallowed.  PostgreSQL's behavior in
+this regard is therefore slightly nonstandard.
+   
+
+   
+Another nonstandard extension is that following the escape character
+with a letter or digit provides access to the same escape sequences
+defined for POSIX regular expressions, below (see
+,
+, and
+).

 

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 8311b1d..6462e13 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -13073,15 +13073,15 @@ a_expr:		c_expr	{ $$ = $1; }
 
 			| a_expr SIMILAR TO a_expr			%prec SIMILAR
 {
-	FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
-			   list_make2($4, makeNullAConst(-1)),
+	FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
+			   list_make1($4),
 			   @2);
 	$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~",
    $1, (Node *) n, @2);
 }
 			| a_expr SIMILAR TO a_expr ESCAPE a_expr			%prec SIMILAR
 {
-	FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
+	FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
 			   list_make2($4, $6),
 			   @2);
 	$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~",
@@ -13089,15 +13089,15 @@ a_expr:		c_expr	{ $$ = $1; }
 }
 			| a_expr NOT_LA SIMILAR TO a_expr	%prec NOT_LA
 {
-	FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
-			   list_make2($5, makeNullAConst(-1)),
+	FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
+			   list_make1($5),
 			   @2);
 	$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~",
    $1, (Node *) n, @2);
 }
 			| a_expr NOT_LA SIMILAR TO a_expr ESCAPE a_expr		%prec NOT_LA
 {
-	FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"),
+	FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"),
 			   list_make2($5, $7),
 			   @2);
 	$$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~",
@@ -14323,9 +14323,9 @@ subquery_Op:
 			| NOT_LA ILIKE
 	{ $$ = list_make1(makeString("!~~*")); }
 /* cannot put SIMILAR TO here, because SIMILAR TO is a hack.
- * the regular expression is preprocessed by a function (similar_escape),
+ * the regular expression is preprocessed by a function (similar_to_escape),
  * and the ~ operator for posix regular expressions is used.
- *x SIMILAR TO y ->x ~ similar_escape(y)
+ *x SIMILAR TO y ->x ~ similar_to_escape(y)
  * this transformation is made on the fly by the parser upwards.
  * however the SubLink structure which handles any/some/all stuff
  * is not ready for such a thing.
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 90a9197..3d38aef 100644
--- a/src/backend/utils/adt/regexp.c
+++ 

Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-23 Thread Donald Dong
On May 23, 2019, at 10:43 AM, Tom Lane  wrote:
> Donald Dong  writes:
>> On May 23, 2019, at 9:02 AM, Tom Lane  wrote:
>>> (2) the paths you show do not correspond to the finally selected
>>> plans --- they aren't even the same shape.  (The Gathers are in
>>> different places, to start with.)  I'm not sure where you were
>>> capturing the path data, but it looks like you missed top-level
>>> parallel-aggregation planning, and that managed to find some
>>> plans that were marginally cheaper than the ones you captured.
>>> Keep in mind that GEQO only considers join planning, not
>>> grouping/aggregation.
> 
>> By looking at the captured costs, I thought GEQO found a better join
>> order than the standard_join_search. However, the final plan using
>> the join order produced by GEQO turns out to be more expansive. Would
>> that imply if GEQO sees a join order which is identical to the one
>> produced by standard_join_search, it will discard it since the
>> cheapest_total_path has a higher cost, though the final plan may be
>> cheaper?
> 
> I suspect what's really going on is that you're looking at the wrong
> paths.  The planner remembers more paths for each rel than just the
> cheapest-total-cost one, the reason being that total cost is not the
> only figure of merit.  The plan that is winning in the end, it looks
> like, is parallelized aggregation on top of a non-parallel join plan,
> but the cheapest_total_path uses up the opportunity for a Gather on
> a parallelized scan/join.  If we were just doing a scan/join and
> no aggregation, that path would have been the basis for the final
> plan, but it's evidently not being chosen here; the planner is going
> to some other scan/join path that is not parallelized.

Seems the paths in the final rel (path list, cheapest parameterized
paths, cheapest startup path, and cheapest total path)  are the same
identical path for this particular query (JOB/1a.sql). Am I missing
anything?

Since the total cost of the cheapest-total-path is what GEQO is
currently using to evaluate the fitness (minimizing), I'm expecting
the cheapest-total-cost to measure how good is a join order. So a
join order from standard_join_search, with higher
cheapest-total-cost, ends up to be better is pretty surprising to me.

Perhaps the cheapest-total-cost should not be the best/only choice
for fitness?

Regards,
Donald Dong




Top-N sorts in EXPLAIN, row count estimates, and parallelism

2019-05-23 Thread Andres Freund
Hi,

Right now we don't indicate that a top-n sort is going to be used in
EXPLAIN, just EXPLAIN ANALYZE. That's imo suboptimal, because one quite
legitimately might want to know that before actually executing (as it
will make a huge amount of difference in the actual resource intensity
of the query).

postgres[28165][1]=# EXPLAIN (VERBOSE) SELECT * FROM hashes ORDER BY count DESC 
LIMIT 10;
┌───┐
│  QUERY PLAN   
│
├───┤
│ Limit  (cost=12419057.53..12419058.70 rows=10 width=45)   
│
│   Output: hash, count 
│
│   ->  Gather Merge  (cost=12419057.53..66041805.65 rows=459591466 width=45)   
│
│ Output: hash, count   
│
│ Workers Planned: 2
│
│ ->  Sort  (cost=12418057.51..12992546.84 rows=229795733 width=45) 
│
│   Output: hash, count 
│
│   Sort Key: hashes.count DESC 
│
│   ->  Parallel Seq Scan on public.hashes  (cost=0.00..7452254.33 
rows=229795733 width=45) │
│ Output: hash, count   
│
└───┘
(10 rows)

postgres[28165][1]=# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM hashes ORDER BY 
count DESC LIMIT 10;
┌─┐
│ QUERY 
PLAN  │
├─┤
│ Limit  (cost=12419057.53..12419058.70 rows=10 width=45) (actual 
time=115204.278..115205.024 rows=10 loops=1)
│
│   Output: hash, count 
  │
│   ->  Gather Merge  (cost=12419057.53..66041805.65 rows=459591466 width=45) 
(actual time=115204.276..115205.020 rows=10 loops=1)
│
│ Output: hash, count   
  │
│ Workers Planned: 2
  │
│ Workers Launched: 2   
  │
│ ->  Sort  (cost=12418057.51..12992546.84 rows=229795733 width=45) 
(actual time=115192.189..115192.189 rows=7 loops=3) 
  │
│   Output: hash, count 
  │
│   Sort Key: hashes.count DESC 
  │
│   Sort Method: top-N heapsort  Memory: 25kB   
  │
│   Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
  │
│   Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
  │
│   Worker 0: actual time=115186.558..115186.559 rows=10 loops=1
  │
│   Worker 1: actual time=115186.540..115186.540 rows=10 loops=1
  │
│   ->  Parallel Seq Scan on public.hashes  (cost=0.00..7452254.33 
rows=229795733 width=45) (actual time=0.080..90442.215 rows=183836589 loops=3) │
│ Output: hash, count   
   

Re: FullTransactionId changes are causing portability issues

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 14:05:19 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-23 13:46:15 -0400, Tom Lane wrote:
> >> I'm not really excited about adopting a position that PG will only
> >> build on GCC and clones thereof.
> 
> > That's not what I said though? Not supporting one compiler, on an OS
> > that's effectively not being developed anymore, with a pretty
> > indefensible behaviour, requiring not insignificant work by everyone,
> > isn't the same as standardizing on gcc. I mean, we obviously are going
> > to continue at the absolute very least gcc, llvm/clang and msvc.
> 
> I think you're vastly overstating the case for refusing support for this.
> Adding "#ifndef FRONTEND" to relevant headers isn't a huge amount of work
> --- it's certainly far less of a problem than the Microsoft-droppings
> we've had to put in in so many places.  The only real issue in my mind
> is the lack of buildfarm support for detecting that we need to do so.

Well, doing it for every single inline function is pretty annoying, just
from a bulkiness perspective. And figuring out exactly which inline
function needs this isn't easy without something that actually shows the
problem.


> Also relevant here is that you have no evidence for the assumption that
> these old Solaris compilers are the only live platform with the problem.
> Yeah, we wish our buildfarm covered everything of interest, but it does
> not.  Maybe, if we get to beta2 without any additional reports of build
> failures on beta1, that would be a bit of evidence that nobody else cares
> --- but we have no such evidence right now.  We certainly can't assume
> that any pre-v12 release provides evidence of that, because up till
> I retired pademelon, it was forcing us to keep this case supported.

I don't think I'm advocating for not fixing the issue we had for
solaris, for 12. I just don't think this a reasonable approach going
forward.

Greetings,

Andres Freund




ClosePipeStream failure ignored in pg_import_system_collations

2019-05-23 Thread Mark Dilger
Hackers,

I only see three invocations of ClosePipeStream in the sources.
In two of them, the return value is checked and an error is raised
if it failed.  In the third, the error (if any) is squashed.

I don't know if a pipe stream over "locale -a" could ever fail to
close, but it seems sensible to log an error if it does.

Thoughts?

mark




Re: fsync failure in durable_unlink ignored in xlog.c?

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 14:06:57 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-23 10:46:02 -0700, Mark Dilger wrote:
> >> Is this code safe against fsync failures?  If so, can I get an explanation
> >> that I might put into a code comment patch?
> 
> > What's the danger you're thinking of here? The issue with ignoring fsync
> > failures is that it could be the one signal about data corruption we get
> > for a write()/fsync() that failed - i.e. that durability cannot be
> > guaranteed. But we don't care about the file contents of those files.
> 
> Hmm ... if we don't care, why are we issuing an fsync at all?

Fair point. I think we do care in most of those cases, but we don't need
to trigger a PANIC. We'd be in trouble if e.g. an older tablespace map
file were to "revive" later. Looking at the cases:

- durable_unlink(TABLESPACE_MAP, DEBUG1) - we definitely care about a
  failure to unlink/remove, but *not* about ENOENT, because that's expected.

- /* Force installation: get rid of any pre-existing segment file */
  durable_unlink(path, DEBUG1);

  same.

- RemoveXlogFile():
  rc = durable_unlink(path, LOG);

  It's probably *tolerable* to fail here. Not sure why this is a
  durable_unlink(LOG) - doesn't make a ton of sense to me.

- durable_unlink(BACKUP_LABEL_FILE, ERROR);

  This is a "whaa, bad shit is happening" kind of situation. But
  crashing probably would make it even worse, because we'd restart
  assuming we're restoring from a backup.

ISTM that durable_unlink() for the first two cases really needs a
separate 'missing_ok' type argument. And that the reason for using
DEBUG1 here is solely an outcome of that not existing.

Greetings,

Andres Freund




Re: fsync failure in durable_unlink ignored in xlog.c?

2019-05-23 Thread Mark Dilger
On Thu, May 23, 2019 at 11:07 AM Tom Lane  wrote:
>
> Andres Freund  writes:
> > On 2019-05-23 10:46:02 -0700, Mark Dilger wrote:
> >> Is this code safe against fsync failures?  If so, can I get an explanation
> >> that I might put into a code comment patch?
>
> > What's the danger you're thinking of here? The issue with ignoring fsync
> > failures is that it could be the one signal about data corruption we get
> > for a write()/fsync() that failed - i.e. that durability cannot be
> > guaranteed. But we don't care about the file contents of those files.
>
> Hmm ... if we don't care, why are we issuing an fsync at all?

Tom's question is about as far as my logic went.  It seemed the fsync
must be important, or the author of this code wouldn't have put such
an expensive operation in that spot, and if so, then how can it be safe
to ignore whether the fsync returned an error.  Beyond that, I do not
have a specific danger in mind.

mark




Re: fsync failure in durable_unlink ignored in xlog.c?

2019-05-23 Thread Tom Lane
Andres Freund  writes:
> On 2019-05-23 10:46:02 -0700, Mark Dilger wrote:
>> Is this code safe against fsync failures?  If so, can I get an explanation
>> that I might put into a code comment patch?

> What's the danger you're thinking of here? The issue with ignoring fsync
> failures is that it could be the one signal about data corruption we get
> for a write()/fsync() that failed - i.e. that durability cannot be
> guaranteed. But we don't care about the file contents of those files.

Hmm ... if we don't care, why are we issuing an fsync at all?

regards, tom lane




Re: FullTransactionId changes are causing portability issues

2019-05-23 Thread Tom Lane
Andres Freund  writes:
> On 2019-05-23 13:46:15 -0400, Tom Lane wrote:
>> I'm not really excited about adopting a position that PG will only
>> build on GCC and clones thereof.

> That's not what I said though? Not supporting one compiler, on an OS
> that's effectively not being developed anymore, with a pretty
> indefensible behaviour, requiring not insignificant work by everyone,
> isn't the same as standardizing on gcc. I mean, we obviously are going
> to continue at the absolute very least gcc, llvm/clang and msvc.

I think you're vastly overstating the case for refusing support for this.
Adding "#ifndef FRONTEND" to relevant headers isn't a huge amount of work
--- it's certainly far less of a problem than the Microsoft-droppings
we've had to put in in so many places.  The only real issue in my mind
is the lack of buildfarm support for detecting that we need to do so.

Also relevant here is that you have no evidence for the assumption that
these old Solaris compilers are the only live platform with the problem.
Yeah, we wish our buildfarm covered everything of interest, but it does
not.  Maybe, if we get to beta2 without any additional reports of build
failures on beta1, that would be a bit of evidence that nobody else cares
--- but we have no such evidence right now.  We certainly can't assume
that any pre-v12 release provides evidence of that, because up till
I retired pademelon, it was forcing us to keep this case supported.

regards, tom lane




Re: fsync failure in durable_unlink ignored in xlog.c?

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 10:46:02 -0700, Mark Dilger wrote:
> I have seen other lengthy discussions about fsync semantics, and if this
> question is being addressed there, this question might not be relevant.
> 
> Two calls to durable_unlink at log level DEBUG1 are ignoring the
> return value.  Other calls at ERROR and FATAL are likewise ignoring
> the return value, though those make perfect sense to me.  There may
> be a reason why logging a debug message inside durable_unlink and
> continuing along is safe, but it is not clear from the structure of the
> code.
> 
> In InstallXLogFileSegment, durable_unlink(path, DEBUG1) is called
> without the return value being checked followed by a call to
> durable_link_or_rename, and perhaps that second call works
> whether the durable_unlink succeeded or failed, but the logic of that is
> not at all clear.
> 
> In do_pg_stop_backup, durable_unlink(TABLESPACE_MAP, DEBUG1)
> is similarly called without the return value being checked.
> 
> This code appears to have been changed in
> 1b02be21f271db6bd3cd43abb23fa596fcb6bac3.
> 
> Is this code safe against fsync failures?  If so, can I get an explanation
> that I might put into a code comment patch?

What's the danger you're thinking of here? The issue with ignoring fsync
failures is that it could be the one signal about data corruption we get
for a write()/fsync() that failed - i.e. that durability cannot be
guaranteed. But we don't care about the file contents of those files.

Greetings,

Andres Freund




Re: FullTransactionId changes are causing portability issues

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 13:46:15 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-22 16:13:02 -0400, Tom Lane wrote:
> >> It doesn't sound like "use a newer compiler" is going to be a helpful
> >> answer there.
> 
> > Well, GCC is available on solaris, and IIRC not that hard to install
> > (isn't it just a 'pkg install gcc' or such?). Don't think we need to
> > invest a lot of energy fixing a compiler / OS combo that effectively
> > isn't developed further.
> 
> I'm not really excited about adopting a position that PG will only
> build on GCC and clones thereof.

That's not what I said though? Not supporting one compiler, on an OS
that's effectively not being developed anymore, with a pretty
indefensible behaviour, requiring not insignificant work by everyone,
isn't the same as standardizing on gcc. I mean, we obviously are going
to continue at the absolute very least gcc, llvm/clang and msvc.

Greetings,

Andres Freund




Re: Minimal logical decoding on standbys

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 23:08:55 +0530, Amit Khandekar wrote:
> On Thu, 23 May 2019 at 21:29, Andres Freund  wrote:
> > On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> > > On Tue, 21 May 2019 at 21:49, Andres Freund  wrote:
> > > Yeah, I agree we should add such checks to minimize the possibility of
> > > reading logical records from a master that has insufficient wal_level.
> > > So to summarize :
> > > a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> > > b. Call this function call in CreateInitDecodingContext() as well.
> > > c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> > > error if there is an existing logical slot.
> > >
> > > This made me think more of the race conditions. For instance, in
> > > pg_create_logical_replication_slot(), just after
> > > CheckLogicalDecodingRequirements and before actually creating the
> > > slot, suppose concurrently Controlfile->wal_level is changed from
> > > logical to replica.  So suppose a new slot does get created. Later the
> > > slot is read, so in pg_logical_slot_get_changes_guts(),
> > > CheckLogicalDecodingRequirements() is called where it checks
> > > ControlFile->wal_level value. But just before it does that,
> > > ControlFile->wal_level concurrently changes back to logical, because
> > > of replay of another param-change record. So this logical reader will
> > > think that the wal_level is sufficient, and will proceed to read the
> > > records, but those records are *before* the wal_level change, so these
> > > records don't have logical data.
> >
> > I don't think that's an actual problem, because there's no decoding
> > before the slot exists and CreateInitDecodingContext() has determined
> > the start LSN. And by that point the slot exists, slo
> > XLOG_PARAMETER_CHANGE replay can error out.
> 
> So between the start lsn and the lsn for
> parameter-change(logical=>replica) record, there can be some records ,
> and these don't have logical data. So the slot created will read from
> the start lsn, and proceed to read these records, before reading the
> parameter-change record.

I don't think that's possible. By the time CreateInitDecodingContext()
is called, the slot *already* exists (but in a state that'll cause it to
be throw away on error). But the restart point has not yet been
determined. Thus, if there is a XLOG_PARAMETER_CHANGE with a wal_level
change it can error out. And to handle the race of wal_level changing
between CheckLogicalDecodingRequirements() and the slot creation, we
recheck in CreateInitDecodingContext().

Think we might nee dto change ReplicationSlotReserveWal() to use the
replay, rather than the redo pointer for logical slots though.


> Can you re-write the below phrase please ? I suspect there is some
> letters missing there :
> "And by that point the slot exists, slo XLOG_PARAMETER_CHANGE replay
> can error out"

I think it's just one additional letter, namely s/slo/so/


> Are you saying we want to error out when the postgres replays the
> param change record and there is existing logical slot ? I thought you
> were suggesting earlier that it's the decoder.c code which should
> error out when reading the param-change record.

Yes, that's what I'm saying. See this portion of my previous email on
the topic:

On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:
> > What I am thinking is :
> > In CheckLogicalDecodingRequirements(), besides checking wal_level,
> > also check ControlFile->wal_level when InHotStandby. I mean, when we
> > are InHotStandby, both wal_level and ControlFile->wal_level should be
> > >= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
> > slot when master has incompatible wal_level.
> 
> That still allows the primary to change wal_level after logical decoding
> has started, so we need the additional checks.
> 
> I'm not yet sure how to best deal with the fact that wal_level might be
> changed by the primary at basically all times. We would eventually get
> an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
> that's not necessarily sufficient - if a primary changes its wal_level
> to lower, it could remove information logical decoding needs *before*
> logical decoding reaches the XLOG_PARAMETER_CHANGE record.
> 
> So I suspect we need conflict handling in xlog_redo's
> XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> slots, we ought to be safe.
> 
> Therefore I think the check in CheckLogicalDecodingRequirements() needs
> to be something like:
> 
> if (RecoveryInProgress())
> {
> if (!InHotStandby)
> ereport(ERROR, "logical decoding on a standby required hot_standby to 
> be enabled");
> /*
>  * This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
>  * wal_level has changed, we verify that there are no existin glogical
>  * replication slots. And to avoid races around creating a new slot,
>  * 

Re: FullTransactionId changes are causing portability issues

2019-05-23 Thread Tom Lane
Andres Freund  writes:
> On 2019-05-22 16:13:02 -0400, Tom Lane wrote:
>> It doesn't sound like "use a newer compiler" is going to be a helpful
>> answer there.

> Well, GCC is available on solaris, and IIRC not that hard to install
> (isn't it just a 'pkg install gcc' or such?). Don't think we need to
> invest a lot of energy fixing a compiler / OS combo that effectively
> isn't developed further.

I'm not really excited about adopting a position that PG will only
build on GCC and clones thereof.

regards, tom lane




fsync failure in durable_unlink ignored in xlog.c?

2019-05-23 Thread Mark Dilger
Hackers,

I have seen other lengthy discussions about fsync semantics, and if this
question is being addressed there, this question might not be relevant.

Two calls to durable_unlink at log level DEBUG1 are ignoring the
return value.  Other calls at ERROR and FATAL are likewise ignoring
the return value, though those make perfect sense to me.  There may
be a reason why logging a debug message inside durable_unlink and
continuing along is safe, but it is not clear from the structure of the
code.

In InstallXLogFileSegment, durable_unlink(path, DEBUG1) is called
without the return value being checked followed by a call to
durable_link_or_rename, and perhaps that second call works
whether the durable_unlink succeeded or failed, but the logic of that is
not at all clear.

In do_pg_stop_backup, durable_unlink(TABLESPACE_MAP, DEBUG1)
is similarly called without the return value being checked.

This code appears to have been changed in
1b02be21f271db6bd3cd43abb23fa596fcb6bac3.

Is this code safe against fsync failures?  If so, can I get an explanation
that I might put into a code comment patch?

mark




Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-23 Thread Tom Lane
Donald Dong  writes:
> On May 23, 2019, at 9:02 AM, Tom Lane  wrote:
>> (2) the paths you show do not correspond to the finally selected
>> plans --- they aren't even the same shape.  (The Gathers are in
>> different places, to start with.)  I'm not sure where you were
>> capturing the path data, but it looks like you missed top-level
>> parallel-aggregation planning, and that managed to find some
>> plans that were marginally cheaper than the ones you captured.
>> Keep in mind that GEQO only considers join planning, not
>> grouping/aggregation.

> By looking at the captured costs, I thought GEQO found a better join
> order than the standard_join_search. However, the final plan using
> the join order produced by GEQO turns out to be more expansive. Would
> that imply if GEQO sees a join order which is identical to the one
> produced by standard_join_search, it will discard it since the
> cheapest_total_path has a higher cost, though the final plan may be
> cheaper?

I suspect what's really going on is that you're looking at the wrong
paths.  The planner remembers more paths for each rel than just the
cheapest-total-cost one, the reason being that total cost is not the
only figure of merit.  The plan that is winning in the end, it looks
like, is parallelized aggregation on top of a non-parallel join plan,
but the cheapest_total_path uses up the opportunity for a Gather on
a parallelized scan/join.  If we were just doing a scan/join and
no aggregation, that path would have been the basis for the final
plan, but it's evidently not being chosen here; the planner is going
to some other scan/join path that is not parallelized.

I haven't looked closely at whether the parallel-query hacking has
paid any attention to GEQO.  It's entirely likely that GEQO is still
choosing its join order on the basis of cheapest-total scan/join cost
without regard to parallelizability, which would lead to an apparently
better cost for the cheapest_total_path even though the path that
will end up being used is some other one.

regards, tom lane




Re: Minimal logical decoding on standbys

2019-05-23 Thread Amit Khandekar
On Thu, 23 May 2019 at 21:29, Andres Freund  wrote:
>
> Hi,
>
> On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> > On Tue, 21 May 2019 at 21:49, Andres Freund  wrote:
> > Yeah, I agree we should add such checks to minimize the possibility of
> > reading logical records from a master that has insufficient wal_level.
> > So to summarize :
> > a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> > b. Call this function call in CreateInitDecodingContext() as well.
> > c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> > error if there is an existing logical slot.
> >
> > This made me think more of the race conditions. For instance, in
> > pg_create_logical_replication_slot(), just after
> > CheckLogicalDecodingRequirements and before actually creating the
> > slot, suppose concurrently Controlfile->wal_level is changed from
> > logical to replica.  So suppose a new slot does get created. Later the
> > slot is read, so in pg_logical_slot_get_changes_guts(),
> > CheckLogicalDecodingRequirements() is called where it checks
> > ControlFile->wal_level value. But just before it does that,
> > ControlFile->wal_level concurrently changes back to logical, because
> > of replay of another param-change record. So this logical reader will
> > think that the wal_level is sufficient, and will proceed to read the
> > records, but those records are *before* the wal_level change, so these
> > records don't have logical data.
>
> I don't think that's an actual problem, because there's no decoding
> before the slot exists and CreateInitDecodingContext() has determined
> the start LSN. And by that point the slot exists, slo
> XLOG_PARAMETER_CHANGE replay can error out.

So between the start lsn and the lsn for
parameter-change(logical=>replica) record, there can be some records ,
and these don't have logical data. So the slot created will read from
the start lsn, and proceed to read these records, before reading the
parameter-change record.

Can you re-write the below phrase please ? I suspect there is some
letters missing there :
"And by that point the slot exists, slo XLOG_PARAMETER_CHANGE replay
can error out"

Are you saying we want to error out when the postgres replays the
param change record and there is existing logical slot ? I thought you
were suggesting earlier that it's the decoder.c code which should
error out when reading the param-change record.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Fix link for v12

2019-05-23 Thread Euler Taveira
Hi,

I noticed that v12 release notes is referencing the wrong GUC. It
should be recovery_target_timeline instead of recovery_target_time.
Patch attached.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
From b9f7ad43f7b368678939e230fe86e0487d9e5a96 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Thu, 23 May 2019 14:10:33 -0300
Subject: [PATCH] Fix link

The link should refer to recovery_target_timeline.
---
 doc/src/sgml/release-12.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
index 90999410db..90aa013b75 100644
--- a/doc/src/sgml/release-12.sgml
+++ b/doc/src/sgml/release-12.sgml
@@ -157,7 +157,7 @@ Author: Peter Eisentraut 
  
 
  
-  Specifically,  now
+  Specifically,  now
   defaults to latest.  Previously, it defaulted
   to current.
  
@@ -1788,7 +1788,7 @@ Author: Peter Eisentraut 
 
   
Add an explicit value of current for  (Peter Eisentraut)
+   linkend="guc-recovery-target-timeline"/> (Peter Eisentraut)
   
  
 
-- 
2.11.0



Re: FullTransactionId changes are causing portability issues

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-22 16:13:02 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-22 15:55:50 -0400, Tom Lane wrote:
> Per Bjorn's report:
> >> The compiler used in Sun Studio 12u1, very old and and I can try to
> >> upgrade and see if that helps.
> > I tried Sun Studio 12u2 and then a more drastic upgrade to Developer
> > Studio 12.5 but both had the same effect.
> 
> It doesn't sound like "use a newer compiler" is going to be a helpful
> answer there.

Well, GCC is available on solaris, and IIRC not that hard to install
(isn't it just a 'pkg install gcc' or such?). Don't think we need to
invest a lot of energy fixing a compiler / OS combo that effectively
isn't developed further.

Greetings,

Andres Freund




Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-23 Thread Donald Dong
On May 23, 2019, at 9:02 AM, Tom Lane  wrote:
> 
> Donald Dong  writes:
>> On May 22, 2019, at 11:42 AM, Tom Lane  wrote:
>>> You're still asking us to answer hypothetical questions unsupported
>>> by evidence.  In what case does that really happen?
> 
>> I attached the query plan and debug_print_rel output for GEQO and
>> standard_join_search.
> 
>>  planstate->total_cost   cheapest_total_path
>> GEQO 54190.1354239.03
>> STD  54179.0254273.73
> 
>> Here I observe GEQO  produces a lower
>> cheapest_total_path->total_cost, but its planstate->total_cost is higher
>> than what standard_join_search produces.
> 
> Well,
> 
> (1) the plan selected by GEQO is in fact more expensive than
> the one found by the standard search.  Not by much --- as Andrew
> observes, this difference is less than what the planner considers
> "fuzzily the same" --- but nonetheless 54190.13 > 54179.02.
> 
> (2) the paths you show do not correspond to the finally selected
> plans --- they aren't even the same shape.  (The Gathers are in
> different places, to start with.)  I'm not sure where you were
> capturing the path data, but it looks like you missed top-level
> parallel-aggregation planning, and that managed to find some
> plans that were marginally cheaper than the ones you captured.
> Keep in mind that GEQO only considers join planning, not
> grouping/aggregation.
> 
> Andrew's point about fuzzy cost comparison is also a good one,
> though we needn't invoke it to explain these particular numbers.

Oh, that's very good to know! I captured the path at the end of the
join_search_hook. If I understood correctly, top-level
parallel-aggregation will be applied later, so GEQO is not taking it
into consideration during the join searching?

By looking at the captured costs, I thought GEQO found a better join
order than the standard_join_search. However, the final plan using
the join order produced by GEQO turns out to be more expansive. Would
that imply if GEQO sees a join order which is identical to the one
produced by standard_join_search, it will discard it since the
cheapest_total_path has a higher cost, though the final plan may be
cheaper?

Here is another query (JOB/27a.sql) which has more significant cost
differences:

planstate->total_cost   cheapest_total_path
GEQO343016.77   343016.75
STD 342179.13   344137.33

Regards,
Donald Dong




Re: initdb recommendations

2019-05-23 Thread Magnus Hagander
On Thu, May 23, 2019, 18:54 Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-04-06 20:08, Noah Misch wrote:
> >>> I think we should just change the defaults.  There is a risk of warning
> >>> fatigue.  initdb does warn about this, so anyone who cared could have
> >>> gotten the information.
> >>>
> >>
> >> I've been suggesting that for years, so definite strong +1 for doing
> that.
> >
> > +1
>
> To recap, the idea here was to change the default authentication methods
> that initdb sets up, in place of "trust".
>
> I think the ideal scenario would be to use "peer" for local and some
> appropriate password method (being discussed elsewhere) for host.
>
> Looking through the buildfarm, I gather that the only platforms that
> don't support peer are Windows, AIX, and HP-UX.  I think we can probably
> figure out some fallback or alternative default for the latter two
> platforms without anyone noticing.  But what should the defaults be on
> Windows?  It doesn't have local sockets, so the lack of peer wouldn't
> matter.  But is it OK to default to a password method, or would that
> upset people particularly?
>


I'm sure password would be fine there. It's what "everybody else" does
(well sqlserver also cord integrated security, but people are used to it).

/Magnus


Re: initdb recommendations

2019-05-23 Thread Peter Eisentraut
On 2019-04-06 20:08, Noah Misch wrote:
>>> I think we should just change the defaults.  There is a risk of warning
>>> fatigue.  initdb does warn about this, so anyone who cared could have
>>> gotten the information.
>>>
>>
>> I've been suggesting that for years, so definite strong +1 for doing that.
> 
> +1

To recap, the idea here was to change the default authentication methods
that initdb sets up, in place of "trust".

I think the ideal scenario would be to use "peer" for local and some
appropriate password method (being discussed elsewhere) for host.

Looking through the buildfarm, I gather that the only platforms that
don't support peer are Windows, AIX, and HP-UX.  I think we can probably
figure out some fallback or alternative default for the latter two
platforms without anyone noticing.  But what should the defaults be on
Windows?  It doesn't have local sockets, so the lack of peer wouldn't
matter.  But is it OK to default to a password method, or would that
upset people particularly?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: nitpick about useless floating point division in gimme_edge_table

2019-05-23 Thread Tom Lane
Mark Dilger  writes:
> Hackers,
> The return value of gimme_edge_table is not used anywhere in the
> core code, so far as I can see.  But the value is computed as

>   /* return average number of edges per index */
>   return ((float) (edge_total * 2) / (float) num_gene);

> which involves some floating point math.  I'm not sure that this matters
> much, but (1) it deceives a reader of this code into thinking that this
> calculation is meaningful, which it is not, and (2) gimme_edge_table is
> called inside a loop, so this is happening repeatedly, though admittedly
> that loop is perhaps not terribly large.

Hmm, probably there was use for that once upon a time, but I agree it's
dead code now.  Want to send a patch to change it to returns-void?

regards, tom lane




Re: Is it safe to ignore the return value of SPI_finish and SPI_execute?

2019-05-23 Thread Tom Lane
I wrote:
> It's not entirely clear to me whether we ought to change the return
> convention to be "returns void" or make it "always returns SPI_OK"
> for those functions where the return code becomes trivial.  The
> latter would avoid churn for external modules, but it seems not to
> have much other attractiveness.

Further thought about that --- it's clearly better in the long run
if we switch to "returns void" where possible.  We don't want to
encourage people to waste code space on dead error checks.  However,
doing that could be quite a PITA for people who are trying to maintain
extension code that works across multiple backend versions.

We could address that by providing compatibility macros, say per
this sketch:

extern void SPI_finish(void);
...

#ifdef BACKWARDS_COMPATIBLE_SPI_CALLS

#define SPI_finish() (SPI_finish(), SPI_OK_FINISH)
...

#endif

(This relies on non-recursive macro expansion, but that's been
standard since C89.)

The #ifdef stanza could be ripped out someday when all older branches
are out of support, but there wouldn't be any hurry.

regards, tom lane




Question about BarrierAttach spinlock

2019-05-23 Thread Mark Dilger
Hackers,

In src/backend/storage/ipc/barrier.c, BarrierAttach
goes to the bother of storing the phase before
releasing the spinlock, and then returns the phase.

In nodeHash.c, ExecHashTableCreate ignores the
phase returned by BarrierAttach, and then immediately
calls BarrierPhase to get the phase that it just ignored.
I don't know that there is anything wrong with this, but
if the phase can be retrieved after the spinlock is
released, why hold the spinlock extra long in
BarrierAttach?

Just asking

mark




Re: Memory bug in dsnowball_lexize

2019-05-23 Thread Tom Lane
Mark Dilger  writes:
> On Thu, May 23, 2019 at 8:46 AM Tom Lane  wrote:
>> Mark Dilger  writes:
>>> In src/backend/snowball/dict_snowball.c, 'dsnowball_lexize'
>>> calls 'SN_set_current' and ignores the return value, thereby
>>> failing to notice the error, if any.

>> Hm.  This seems like possibly a bug, in that even if we cover the
>> malloc issue, there's no API guarantee that OOM is the only possible
>> reason for reporting failure.

> Ok, that sounds fair.  Since the memory is being palloc'd, I suppose
> it would be safe to just ereport when the return value is -1?

Yeah ... I'd just make it an elog really, since whatever it is
would presumably not be a user-facing error.

>> Fair complaint --- do you want to propose some new wording that
>> references what header.h does?

> Perhaps something along these lines?

Seems reasonable, please include in patch covering the other thing.

regards, tom lane




Re: Memory bug in dsnowball_lexize

2019-05-23 Thread Mark Dilger
On Thu, May 23, 2019 at 8:46 AM Tom Lane  wrote:
>
> Mark Dilger  writes:
> > In src/backend/snowball/libstemmer/utilities.c, 'create_s' uses
> > malloc (not palloc) to allocate memory, and on memory exhaustion
> > returns NULL rather than throwing an exception.
>
> Actually not, see macros in src/include/snowball/header.h.

You are correct.  Thanks for the pointer.

> > In src/backend/snowball/dict_snowball.c, 'dsnowball_lexize'
> > calls 'SN_set_current' and ignores the return value, thereby
> > failing to notice the error, if any.
>
> Hm.  This seems like possibly a bug, in that even if we cover the
> malloc issue, there's no API guarantee that OOM is the only possible
> reason for reporting failure.

Ok, that sounds fair.  Since the memory is being palloc'd, I suppose
it would be safe to just ereport when the return value is -1?

> > There is a comment higher up in dict_snowball.c that seems to
> > use some handwaving about all this, or perhaps it is documenting
> > something else entirely.  In any event, I find the documentation
> > about dictCtx insufficient to explain why this memory handling
> > is correct.
>
> Fair complaint --- do you want to propose some new wording that
> references what header.h does?

Perhaps something along these lines?

/*
-* snowball saves alloced memory between calls, so we should
run it in our
-* private memory context. Note, init function is executed in long lived
-* context, so we just remember CurrentMemoryContext
+* snowball saves alloced memory between calls, which we force to be
+* allocated using palloc and friends via preprocessing macros (see
+* snowball/header.h), so we should run snowball in our private memory
+* context.  Note, init function is executed in long lived
context, so we
+* just remember CurrentMemoryContext.
 */




Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-23 Thread Tom Lane
Donald Dong  writes:
> On May 22, 2019, at 11:42 AM, Tom Lane  wrote:
>> You're still asking us to answer hypothetical questions unsupported
>> by evidence.  In what case does that really happen?

> I attached the query plan and debug_print_rel output for GEQO and
> standard_join_search.

>   planstate->total_cost   cheapest_total_path
> GEQO  54190.1354239.03
> STD   54179.0254273.73

> Here I observe GEQO  produces a lower
> cheapest_total_path->total_cost, but its planstate->total_cost is higher
> than what standard_join_search produces.

Well,

(1) the plan selected by GEQO is in fact more expensive than
the one found by the standard search.  Not by much --- as Andrew
observes, this difference is less than what the planner considers
"fuzzily the same" --- but nonetheless 54190.13 > 54179.02.

(2) the paths you show do not correspond to the finally selected
plans --- they aren't even the same shape.  (The Gathers are in
different places, to start with.)  I'm not sure where you were
capturing the path data, but it looks like you missed top-level
parallel-aggregation planning, and that managed to find some
plans that were marginally cheaper than the ones you captured.
Keep in mind that GEQO only considers join planning, not
grouping/aggregation.

Andrew's point about fuzzy cost comparison is also a good one,
though we needn't invoke it to explain these particular numbers.

regards, tom lane




Re: Minimal logical decoding on standbys

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 09:37:50 -0400, Robert Haas wrote:
> On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov  wrote:
> > > wal_level is PGC_POSTMASTER.
> >
> > But primary can be restarted without restart on standby. We require 
> > wal_level replica or highter (currently only logical) on standby. So online 
> > change from logical to replica wal_level is possible on standby's 
> > controlfile.
> 
> That's true, but Amit's scenario involved a change in wal_level during
> the execution of pg_create_logical_replication_slot(), which I think
> can't happen.

I don't see why not - we're talking about the wal_level in the WAL
stream, not the setting on the standby. And that can change during the
execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
record is replayed. I don't think it's actually a problem, as I
outlined in my response to Amit, though.

Greetings,

Andres Freund




Re: Minimal logical decoding on standbys

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> On Tue, 21 May 2019 at 21:49, Andres Freund  wrote:
> Yeah, I agree we should add such checks to minimize the possibility of
> reading logical records from a master that has insufficient wal_level.
> So to summarize :
> a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> b. Call this function call in CreateInitDecodingContext() as well.
> c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> error if there is an existing logical slot.
> 
> This made me think more of the race conditions. For instance, in
> pg_create_logical_replication_slot(), just after
> CheckLogicalDecodingRequirements and before actually creating the
> slot, suppose concurrently Controlfile->wal_level is changed from
> logical to replica.  So suppose a new slot does get created. Later the
> slot is read, so in pg_logical_slot_get_changes_guts(),
> CheckLogicalDecodingRequirements() is called where it checks
> ControlFile->wal_level value. But just before it does that,
> ControlFile->wal_level concurrently changes back to logical, because
> of replay of another param-change record. So this logical reader will
> think that the wal_level is sufficient, and will proceed to read the
> records, but those records are *before* the wal_level change, so these
> records don't have logical data.

I don't think that's an actual problem, because there's no decoding
before the slot exists and CreateInitDecodingContext() has determined
the start LSN. And by that point the slot exists, slo
XLOG_PARAMETER_CHANGE replay can error out.

Greetings,

Andres Freund




Re: with oids option not removed in pg_dumpall

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 16:26:38 +0300, Surafel Temesgen wrote:
> Thank you for applying
> 
> regards
> Surafel
> 
> On Thu, May 23, 2019 at 3:43 AM Michael Paquier  wrote:
> 
> > On Tue, May 21, 2019 at 05:24:57PM +0900, Michael Paquier wrote:
> > > Good catch.  Your cleanup looks correct to me.  Andres, perhaps you
> > > would prefer doing the cleanup yourself?
> >
> > As I am cleaning up the area for another issue, applied.

Thanks for finding and applying.

Greetings,

Andres Freund




Re: crash testing suggestions for 12 beta 1

2019-05-23 Thread Peter Geoghegan
On Thu, May 23, 2019 at 8:24 AM Jeff Janes  wrote:
> Now that beta is out, I wanted to do some crash-recovery testing where I 
> inject PANIC-inducing faults and see if it recovers correctly.

Thank you for doing this. It's important work.

> Making the ctid be tie-breakers in btree index is also tested inherently 
> (plus I think Peter tested that pretty thoroughly himself with similar 
> methods).

As you may know, the B-Tree code has a tendency to soldier on when an
index is corrupt. "Moving right" tends to conceal problems beyond
concurrent page splits. I didn't do very much fault injection type
testing with the B-Tree enhancements, but I did lean on amcheck
heavily during development. Note that a new, extremely thorough option
called "rootdescend" verification was added following the v12 work:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c1afd175b5b2e5c44f6da34988342e00ecdfb518

It probably wouldn't add noticeable overhead to use this during your
testing, and maybe to combine it with the "heapallindexed" option,
while using the bt_index_parent_check() variant -- that will detect
almost any imaginable index corruption. Admittedly, amcheck didn't
find any bugs in my code after the first couple of versions of the
patch series, so this approach seems unlikely to find any problems
now. Even still, it wouldn't be very difficult to do this extra step.
It seems worthwhile to be thorough here, given that we depend on the
B-Tree code so heavily.

-- 
Peter Geoghegan




Re: Read-only access to temp tables for 2PC transactions

2019-05-23 Thread Andres Freund
Hi,

On 2019-05-23 12:36:09 +0100, Simon Riggs wrote:
> The ONLY case where this matters is if someone does a PREPARE and then
> starts doing other work on the session. Which makes no sense in the normal
> workflow of a session. I'm sure there are tests that do that, but those
> tests are unrepresentative of sensible usage.

That's extremely common.

There's no way we can forbid using session after 2PC unconditionally,
it'd break most users of 2PC.

Greetings,

Andres Freund




Re: PostgreSQL 12 Beta 1 press release draft

2019-05-23 Thread Pantelis Theodosiou
On Thu, May 23, 2019 at 4:36 PM Pantelis Theodosiou 
wrote:

>
> On Thu, May 23, 2019 at 1:01 PM Jonathan S. Katz 
> wrote:
>
>> On 5/23/19 1:45 AM, David Rowley wrote:
>> > On Thu, 23 May 2019 at 15:31, Jonathan S. Katz 
>> wrote:
>> >> Attached is
>> >> v3 of the patch, along with a diff.
>> >
>> > Minor details, but this query is not valid:
>> >
>> >> WITH c AS MATERIALIZED (
>> >>   SELECT * FROM a WHERE a.x % 4
>> >> )
>> >> SELECT * FROM c JOIN d ON d.y = a.x;
>> >
>> > a.x % 4 is not a boolean clause, and "a" is not in the main query, so
>> > a.x can't be referenced there.
>>
>> ...that's the only gotcha I'm actually embarrassed about. Fixed.
>>
>>
> The   ON d.y = a.x  still needs to be changed to ON d.y = c.x
>
> Pantelis
>

Another minor point in the sentence  "... which is currently is ...":

> In PostgreSQL 12, the storage interface that is used by default is the
heap access method, which is currently is the only built-in method.

But I forgot the most important. Thank you for the new version and all the
work that has gone into it!


Re: Memory bug in dsnowball_lexize

2019-05-23 Thread Tom Lane
Mark Dilger  writes:
> In src/backend/snowball/libstemmer/utilities.c, 'create_s' uses
> malloc (not palloc) to allocate memory, and on memory exhaustion
> returns NULL rather than throwing an exception.

Actually not, see macros in src/include/snowball/header.h.

> In src/backend/snowball/dict_snowball.c, 'dsnowball_lexize'
> calls 'SN_set_current' and ignores the return value, thereby
> failing to notice the error, if any.

Hm.  This seems like possibly a bug, in that even if we cover the
malloc issue, there's no API guarantee that OOM is the only possible
reason for reporting failure.

> There is a comment higher up in dict_snowball.c that seems to
> use some handwaving about all this, or perhaps it is documenting
> something else entirely.  In any event, I find the documentation
> about dictCtx insufficient to explain why this memory handling
> is correct.

Fair complaint --- do you want to propose some new wording that
references what header.h does?

regards, tom lane




Re: PostgreSQL 12 Beta 1 press release draft

2019-05-23 Thread Pantelis Theodosiou
On Thu, May 23, 2019 at 1:01 PM Jonathan S. Katz 
wrote:

> On 5/23/19 1:45 AM, David Rowley wrote:
> > On Thu, 23 May 2019 at 15:31, Jonathan S. Katz 
> wrote:
> >> Attached is
> >> v3 of the patch, along with a diff.
> >
> > Minor details, but this query is not valid:
> >
> >> WITH c AS MATERIALIZED (
> >>   SELECT * FROM a WHERE a.x % 4
> >> )
> >> SELECT * FROM c JOIN d ON d.y = a.x;
> >
> > a.x % 4 is not a boolean clause, and "a" is not in the main query, so
> > a.x can't be referenced there.
>
> ...that's the only gotcha I'm actually embarrassed about. Fixed.
>
>
The   ON d.y = a.x  still needs to be changed to ON d.y = c.x

Pantelis


Re: nitpick about poor style in MergeAttributes

2019-05-23 Thread Mark Dilger
On Thu, May 23, 2019 at 7:54 AM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > On Wed, May 22, 2019 at 06:20:01PM -0700, Mark Dilger wrote:
> >> What to do about this is harder to say.  In the following
> >> patch, I'm just doing what I think is standard for callers
> >> of list_delete_cell, and assigning the return value back
> >> to the list (similar to how a call to repalloc should do).
> >> But since there is an implicit assumption that the list
> >> is never emptied by this operation, perhaps checking
> >> against NIL and elog'ing makes more sense?
>
> > Yes, I agree that this is a bit fuzzy, and this code is new as of
> > 705d433.  As you say, I agree that making sure that the return value
> > of list_delete_cell is not NIL is a sensible choice.
>
> Are we sure that's not just a newly-introduced bug, ie it has not
> been tested in cases where the tlist could become empty?  My first
> thought would be to assign the list pointer value back as per usual
> coding convention, not to double down on the assumption that this
> was well-considered code.

I don't think that is disputed.  I was debating between assigning
it back and also asserting that it is not NIL vs. assigning it back
and elog/ereporting if it is NIL.  Of course, this is assuming the
code was designed with the expectation that the list can never
become empty.  If you think it might become empty, and that the
surrounding code can handle that sensibly, then perhaps we
need neither the assertion nor the elog/ereport, though we still
need the assignment.

mark




crash testing suggestions for 12 beta 1

2019-05-23 Thread Jeff Janes
Now that beta is out, I wanted to do some crash-recovery testing where I
inject PANIC-inducing faults and see if it recovers correctly. A long-lived
Perl process keeps track of what it should find after the crash, and
verifies that it finds it.  You will probably be familiar with the general
theme from examples like the threads below.  Would anyone like to nominate
some areas to focus on?  I think the pluggable storage refactoring work
will be get inherently tested, so I'm not planning designing test
specifically for that (unless there is a non-core plugin I should test
with). Making the ctid be tie-breakers in btree index is also tested
inherently (plus I think Peter tested that pretty thoroughly himself with
similar methods).  I've already tested declarative partitioning where the
tuples do a lot of migrating, and tested prepared transactions.  Any other
suggestions for changes that might be risky and should be specifically
targeted for testing?



https://www.postgresql.org/message-id/CAMkU=1xeuubphdwdmb1wjn4+td4kpnenifatbxnk1xzhcw8...@mail.gmail.com


https://www.postgresql.org/message-id/CAMkU=1xBP8cqdS5eK8APHL=x6rhmmm2vg5g+qamduutsycw...@mail.gmail.com


Cheers,

Jeff


Re: refactoring - share str2*int64 functions

2019-05-23 Thread Fabien COELHO


Although I agree it is not worth a lot of trouble, and even if I don't do 
Windows, I think it valuable that the behavior is the same on all platform. 
The attached match shares pg_str2*int64 functions between frontend and 
backend by moving them to "common/", which avoids some code duplication.


This is more refactoring, and it fixes the behavior change on 32 bit 
architectures.


V2 is a rebase.

--
Fabien.diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 8eedb613a1..e8744f054c 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -21,6 +21,7 @@
 #include "catalog/heap.h"
 #include "catalog/pg_type.h"
 #include "commands/trigger.h"
+#include "common/string.h"
 #include "executor/executor.h"
 #include "executor/spi_priv.h"
 #include "miscadmin.h"
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c2626ee62..3735268e3a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -34,6 +34,7 @@
 
 #include "fmgr.h"
 #include "miscadmin.h"
+#include "common/string.h"
 #include "nodes/extensible.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
diff --git a/src/backend/parser/parse_node.c b/src/backend/parser/parse_node.c
index 3a7a858e3a..8a6d031d7c 100644
--- a/src/backend/parser/parse_node.c
+++ b/src/backend/parser/parse_node.c
@@ -25,10 +25,10 @@
 #include "parser/parse_expr.h"
 #include "parser/parse_relation.h"
 #include "utils/builtins.h"
-#include "utils/int8.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 #include "utils/varbit.h"
+#include "common/string.h"
 
 
 static void pcb_error_callback(void *arg);
@@ -498,7 +498,7 @@ make_const(ParseState *pstate, Value *value, int location)
 
 		case T_Float:
 			/* could be an oversize integer as well as a float ... */
-			if (scanint8(strVal(value), true, ))
+			if (pg_strtoint64(strVal(value), true, ))
 			{
 /*
  * It might actually fit in int32. Probably only INT_MIN can
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d317fd7006..70681588a1 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -12,6 +12,7 @@
  */
 #include "postgres.h"
 
+#include "common/string.h"
 #include "catalog/pg_publication.h"
 
 #include "replication/logical.h"
@@ -20,7 +21,6 @@
 #include "replication/pgoutput.h"
 
 #include "utils/inval.h"
-#include "utils/int8.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
@@ -111,7 +111,7 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 		 errmsg("conflicting or redundant options")));
 			protocol_version_given = true;
 
-			if (!scanint8(strVal(defel->arg), true, ))
+			if (!pg_strtoint64(strVal(defel->arg), true, ))
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 		 errmsg("invalid proto_version")));
diff --git a/src/backend/utils/adt/cash.c b/src/backend/utils/adt/cash.c
index c92e9d5046..618660f372 100644
--- a/src/backend/utils/adt/cash.c
+++ b/src/backend/utils/adt/cash.c
@@ -26,7 +26,6 @@
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 #include "utils/cash.h"
-#include "utils/int8.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 206576d4bd..1f17987e85 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -92,7 +92,6 @@
 #include "utils/datetime.h"
 #include "utils/float.h"
 #include "utils/formatting.h"
-#include "utils/int8.h"
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index 0ff9394a2f..98e7b52ce9 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -18,12 +18,12 @@
 #include 
 
 #include "common/int.h"
+#include "common/string.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
 #include "optimizer/optimizer.h"
-#include "utils/int8.h"
 #include "utils/builtins.h"
 
 
@@ -47,89 +47,6 @@ typedef struct
  * Formatting and conversion routines.
  *-*/
 
-/*
- * scanint8 --- try to parse a string into an int8.
- *
- * If errorOK is false, ereport a useful error message if the string is bad.
- * If errorOK is true, just return "false" for bad input.
- */
-bool
-scanint8(const char *str, bool errorOK, int64 *result)
-{
-	const char *ptr = str;
-	int64		tmp = 0;
-	bool		neg = false;
-
-	/*
-	 * Do our own scan, rather than relying on sscanf which might be broken
-	 * for long long.
-	 *
-	 * As INT64_MIN can't be stored as a positive 64 bit integer, accumulate
-	 * value as a negative number.
-	 */
-
-	/* skip leading spaces */
-	while (*ptr && isspace((unsigned char) *ptr))
-		ptr++;

Memory bug in dsnowball_lexize

2019-05-23 Thread Mark Dilger
Hackers,

In src/backend/snowball/libstemmer/utilities.c, 'create_s' uses
malloc (not palloc) to allocate memory, and on memory exhaustion
returns NULL rather than throwing an exception.  In this same
file, 'replace_s' calls 'create_s' and if it gets back NULL, returns
the error code -1.  Otherwise, it sets z->p to the allocated
memory.

In src/backend/snowball/libstemmer/api.c, 'SN_set_current' calls
'replace_s' and returns whatever 'replace_s' returned, which in
the case of memory exhaustion will be -1.

In src/backend/snowball/dict_snowball.c, 'dsnowball_lexize'
calls 'SN_set_current' and ignores the return value, thereby
failing to notice the error, if any.

I checked one of the stemmers, stem_ISO_8859_1_english.c,
and it treats z->p as an array without checking whether it is
NULL.  This will crash the backend in the above error case.

There is something else weird here, though.  The call to
'SN_set_current' is wrapped in a memory context switch, along
with a call to the stemmer, as if the caller expects any allocated
memory to be palloc'd, which it is not, given the underlying code's
use of malloc and calloc.

There is a comment higher up in dict_snowball.c that seems to
use some handwaving about all this, or perhaps it is documenting
something else entirely.  In any event, I find the documentation
about dictCtx insufficient to explain why this memory handling
is correct.

mark




RE: psql - add SHOW_ALL_RESULTS option

2019-05-23 Thread Fabien COELHO



Here is a v3 which fixes \errverbose, hopefully.


V5 is a rebase and an slightly improved documentation.

--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index b86764003d..4a1b562f24 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3341,7 +3341,7 @@ select 1\; select 2\; select 3;
 psql prints only the last query result
 it receives for each request; in this example, although all
 three SELECTs are indeed executed, psql
-only prints the 3.
+only prints the 3, unless SHOW_ALL_RESULTS is set.
 
 
   
@@ -3918,6 +3918,17 @@ bar
 
   
 
+  
+SHOW_ALL_RESULTS
+
+
+When this variable is set to on, all results of a combined
+(\;) query are shown instead of just the last one.
+Default is off.
+
+
+  
+
   
 SHOW_CONTEXT
 
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 44a782478d..dc575911ce 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -472,6 +472,16 @@ ResetCancelConn(void)
 #endif
 }
 
+static void
+ShowErrorMessage(const PGresult *result)
+{
+	const char *error = PQerrorMessage(pset.db);
+
+	if (strlen(error))
+		pg_log_info("%s", error);
+
+	CheckConnection();
+}
 
 /*
  * AcceptResult
@@ -482,7 +492,7 @@ ResetCancelConn(void)
  * Returns true for valid result, false for error state.
  */
 static bool
-AcceptResult(const PGresult *result)
+AcceptResult(const PGresult *result, bool show_error)
 {
 	bool		OK;
 
@@ -513,15 +523,8 @@ AcceptResult(const PGresult *result)
 break;
 		}
 
-	if (!OK)
-	{
-		const char *error = PQerrorMessage(pset.db);
-
-		if (strlen(error))
-			pg_log_info("%s", error);
-
-		CheckConnection();
-	}
+	if (!OK && show_error)
+		ShowErrorMessage(result);
 
 	return OK;
 }
@@ -701,7 +704,7 @@ PSQLexec(const char *query)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		res = NULL;
@@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		return 0;
@@ -999,199 +1002,113 @@ loop_exit:
 	return success;
 }
 
-
 /*
- * ProcessResult: utility function for use by SendQuery() only
- *
- * When our command string contained a COPY FROM STDIN or COPY TO STDOUT,
- * PQexec() has stopped at the PGresult associated with the first such
- * command.  In that event, we'll marshal data for the COPY and then cycle
- * through any subsequent PGresult objects.
- *
- * When the command string contained no such COPY command, this function
- * degenerates to an AcceptResult() call.
- *
- * Changes its argument to point to the last PGresult of the command string,
- * or NULL if that result was for a COPY TO STDOUT.  (Returning NULL prevents
- * the command status from being printed, which we want in that case so that
- * the status line doesn't get taken as part of the COPY data.)
- *
- * Returns true on complete success, false otherwise.  Possible failure modes
- * include purely client-side problems; check the transaction status for the
- * server-side opinion.
+ * Marshal the COPY data.  Either subroutine will get the
+ * connection out of its COPY state, then call PQresultStatus()
+ * once and report any error.
+ *
+ * For COPY OUT, direct the output to pset.copyStream if it's set,
+ * otherwise to pset.gfname if it's set, otherwise to queryFout.
+ * For COPY IN, use pset.copyStream as data source if it's set,
+ * otherwise cur_cmd_source.
+ *
+ * Update result if further processing is necessary.  (Returning NULL
+ * prevents the command status from being printed, which we want in that
+ * case so that the status line doesn't get taken as part of the COPY data.)
  */
 static bool
-ProcessResult(PGresult **results)
+HandleCopyResult(PGresult **result)
 {
 	bool		success = true;
-	bool		first_cycle = true;
+	FILE	   *copystream;
+	PGresult   *copy_result;
+	ExecStatusType result_status = PQresultStatus(*result);
 
-	for (;;)
+	Assert(result_status == PGRES_COPY_OUT ||
+		   result_status == PGRES_COPY_IN);
+
+	SetCancelConn();
+	if (result_status == PGRES_COPY_OUT)
 	{
-		ExecStatusType result_status;
-		bool		is_copy;
-		PGresult   *next_result;
+		bool		need_close = false;
+		bool		is_pipe = false;
 
-		if (!AcceptResult(*results))
+		if (pset.copyStream)
 		{
-			/*
-			 * Failure at this point is always a server-side failure or a
-			 * failure to submit the command string.  Either way, we're
-			 * finished with this command string.
-			 */
-			success = false;
-			break;
+			/* invoked by \copy */
+			copystream = pset.copyStream;
 		}
-
-		result_status = PQresultStatus(*results);
-		switch (result_status)
+		else if (pset.gfname)
 		{
-			case PGRES_EMPTY_QUERY:
-			case PGRES_COMMAND_OK:
-			case PGRES_TUPLES_OK:
-

Re: nitpick about poor style in MergeAttributes

2019-05-23 Thread Tom Lane
Michael Paquier  writes:
> On Wed, May 22, 2019 at 06:20:01PM -0700, Mark Dilger wrote:
>> What to do about this is harder to say.  In the following
>> patch, I'm just doing what I think is standard for callers
>> of list_delete_cell, and assigning the return value back
>> to the list (similar to how a call to repalloc should do).
>> But since there is an implicit assumption that the list
>> is never emptied by this operation, perhaps checking
>> against NIL and elog'ing makes more sense?

> Yes, I agree that this is a bit fuzzy, and this code is new as of
> 705d433.  As you say, I agree that making sure that the return value
> of list_delete_cell is not NIL is a sensible choice.

Are we sure that's not just a newly-introduced bug, ie it has not
been tested in cases where the tlist could become empty?  My first
thought would be to assign the list pointer value back as per usual
coding convention, not to double down on the assumption that this
was well-considered code.

regards, tom lane




Re: ACL dump ordering broken as well for tablespaces

2019-05-23 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Wed, May 22, 2019 at 06:35:31PM +, Bossart, Nathan wrote:
> > A bit of digging led me to the commit that removed databases and
> > tablespaces from pg_init_privs [0] and to a related thread [1].  IIUC
> > the problem is that using pg_init_privs for databases is complicated
> > by the ability to drop and recreate the template databases.
> 
> I don't quite get this argument.  If a user is willing to drop
> template1, then it is logic to also drop its initial privilege entries
> and recreate new ones from scratch.  I think that this deserves a
> closer lookup.  For tablespaces, we are limited by the ability of not
> sharing pg_init_privs?

Why do you feel that's the case?  The point of pg_init_privs is to
provide a way to go from initdb-time state to
whatever-the-current-privs-are.  Dropping initdb-time privs and
reverting back to a no-privs default wouldn't be right in any case
where the initdb-time privs are different from no-privs.

As an example- if someone dropped the template1 database and then
re-created it, it won't have the same privileges as the initdb-time
template1 and instead wouldn't have any privileges- but an actual new
installation would bring back template1 with the initdb-time privileges.

Basically, the pg_dump output in that case *should* include the
equivilant of "REVOKE initdb-time privs" if we want it to be a minimal
change from what is there at initdb-time, but the only way to have that
happen is to know what the initdb-time privs were, and that's not what
will be in pg_init_privs if you drop the entries from pg_init_priv when
template1 is dropped.

I'm not sure where/why tablespaces came into this discussion at all
since we don't have any initdb-time tablespaces that can be dropped or
recreated, and the ones we do have exist with no-privs at initdb-time,
so I don't think there's any reason we'd need to have entries in
pg_init_privs for those, and it seems like they should be able to just
use the buildACLQueries magic, though it looks like you might have to
add a flag that will basically be the same as the binary_upgrade flag to
say "this object doesn't have any init privs, so we aren't joining
against pg_init_privs, and don't include that in the query".

Unfortunately, that isn't going to work for databases though.  I haven't
got any particularly great solution there.  We could possibly have a
catalog which defined the initdb-time objects using their name instead
of the initdb-time OID but that seems awful grotty..  Or we could try to
do the same thing the user did and just DROP template1 and then recreate
it, in which case it'd be created with no-privs and then do the same as
tablespaces above, if it has a too-high OID, but again, that seems
pretty ugly.

Open to other thoughts, but I really don't think we can just stick
things into pg_init_privs for global objects and then remove them if
that object is dropped, and it isn't a shared catalog either, so there's
also that...  Having a shared catalog just in case someone wants to
drop/recreate template1 strikes me as pretty massive overkill.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: pgbench - add pseudo-random permutation function

2019-05-23 Thread Fabien COELHO


Hello Hironobu-san,

Here is a v15 which is a rebase, plus a large simplification of the modmul 
function if an int128 type is available, which is probably always…


Could you have a look and possibly revalidate?


Sorry for the late reply. I reviewed this patch.

Function nbits(), which was previously discussed, has been simplified by 
using the function pg_popcount64().


By adding the mathematical explanation, it has been easier to understand the 
operation of this function.


I believe that these improvements will have a positive impact on maintenance.

The patch could be applied successfully and the tests passed without 
problems.


So, I think the latest patch is fine.


Best regards,




On 3/3/19 12:55 PM, Fabien COELHO wrote:



Indeed, the patch needs a rebase & conflit resolution. I'll do it. Later.


Here is an update:

  - take advantage of pg_bitutils (although I noted that the "slow"
    popcount there could be speeded-up and shorten with a bitwise operator
    implementation that I just removed from pgbench).

  - add comments about the bijective transformations in the code.

As already stated, this function makes sense for people who want to test 
performance with pgbench using non uniform rands. If you do not want to do 
that, you will probably find the function pretty useless. I can't help it.


Also, non uniform rands is also a way to test pg lock contention behavior.


You have signed up as a reviewer for this patch.  Do you know when you'll 
have time to do the review?


Regards,





--
Fabien Coelho - CRI, MINES ParisTechdiff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ef22a484e7..24a2c147cc 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -938,7 +938,7 @@ pgbench  options  d
 
   
 default_seed 
-   seed used in hash functions by default
+   seed used in hash and pseudo-random permutation functions by default
   
 
   
@@ -1434,6 +1434,13 @@ SELECT 2 AS two, 3 AS three \gset p_
pow(2.0, 10), power(2.0, 10)
1024.0
   
+  
+   pr_perm(i, size [, seed ] )
+   integer
+   pseudo-random permutation in [0,size)
+   pr_perm(0, 4)
+   0, 1, 2 or 3
+  
   
random(lb, ub)
integer
@@ -1599,6 +1606,24 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
 
   
 
+  
+Function pr_perm implements a pseudo-random permutation.
+It allows to mix the output of non uniform random functions so that
+values drawn more often are not trivially correlated.
+It permutes integers in [0, size) using a seed by applying rounds of
+simple invertible functions, similarly to an encryption function,
+although beware that it is not at all cryptographically secure.
+Compared to hash functions discussed above, the function
+ensures that a perfect permutation is applied: there are no collisions
+nor holes in the output values.
+Values outside the interval are interpreted modulo the size.
+The function errors if size is not positive.
+If no seed is provided, :default_seed is used.
+For a given size and seed, the function is fully deterministic: if two
+permutations on the same size must not be correlated, use distinct seeds
+as outlined in the previous example about hash functions.
+  
+
   
As an example, the full definition of the built-in TPC-B-like
transaction is:
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 17c9ec32c5..13410787d4 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -19,6 +19,7 @@
 #define PGBENCH_NARGS_VARIABLE	(-1)
 #define PGBENCH_NARGS_CASE		(-2)
 #define PGBENCH_NARGS_HASH		(-3)
+#define PGBENCH_NARGS_PRPERM	(-4)
 
 PgBenchExpr *expr_parse_result;
 
@@ -370,6 +371,9 @@ static const struct
 	{
 		"hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A
 	},
+	{
+		"pr_perm", PGBENCH_NARGS_PRPERM, PGBENCH_PRPERM
+	},
 	/* keep as last array element */
 	{
 		NULL, 0, 0
@@ -482,6 +486,19 @@ make_func(yyscan_t yyscanner, int fnumber, PgBenchExprList *args)
 			}
 			break;
 
+		/* pseudo-random permutation function with optional seed argument */
+		case PGBENCH_NARGS_PRPERM:
+			if (len < 2 || len > 3)
+expr_yyerror_more(yyscanner, "unexpected number of arguments",
+  PGBENCH_FUNCTIONS[fnumber].fname);
+
+			if (len == 2)
+			{
+PgBenchExpr *var = make_variable("default_seed");
+args = make_elist(var, args);
+			}
+			break;
+
 		/* common case: positive arguments number */
 		default:
 			Assert(PGBENCH_FUNCTIONS[fnumber].nargs >= 0);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 8b84658ccd..c532978398 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -38,6 +38,7 @@
 #include "getopt_long.h"
 #include "libpq-fe.h"
 #include "portability/instr_time.h"
+#include "port/pg_bitutils.h"
 
 #include 
 #include 
@@ -1039,6 +1040,307 @@ 

Re: "long" type is not appropriate for counting tuples

2019-05-23 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-05-23 15:52, Robert Haas wrote:
>> On Thu, May 23, 2019 at 5:31 AM Peter Eisentraut
>>  wrote:
>>> Another option is that in cases where it doesn't affect storage layouts,
>>> like the counting tuples case that started this thread, code could just
>>> use long long int directly instead of int64.  Then if someone wants to
>>> make it 128 bits or 96 bits or whatever it would not be a problem.

>> I think that sort of thing tends not to work out well, because at some
>> point it's likely to be sent out via the wire protocol; at that point
>> we'll need a value of a certain width.  Better to use that width right
>> from the beginning.

> Hmm, by that argument, we shouldn't ever use any integer type other than
> int16, int32, and int64.
> I'm thinking for example that pgbench makes a lot of use of int64 and
> printing that out makes quite messy code.  Replacing that by long long
> int would make this much nicer and should be pretty harmless relative to
> your concern.

It does seem attractive to use long long in cases where we're not too
fussed about the exact width.  OTOH, that reasoning was exactly why we
used "long" in a lot of places back in the day, and sure enough it came
back to bite us.

On the whole I think I could live with a policy that says "tuple counts
shall be 'long long' when being passed around in code, but for persistent
storage or wire-protocol transmission, use 'int64'".

An alternative and much narrower policy is to say it's okay to do this
with an int64 value:

printf("processed %lld tuples", (long long) count);

In such code, all we're assuming is long long >= 64 bits, which
is completely safe per C99, and we dodge the need for a
platform-varying format string.

regards, tom lane




Re: pglz performance

2019-05-23 Thread Binguo Bao
Hi hackers!
I am a student participating in GSoC 2019. I am looking forward to working
with you all and learning from you.
My project would aim to provide the ability to de-TOAST a fully TOAST'd and
compressed field using an iterator.
For more details, please take a look at my proposal[0]. Any suggestions or
comments about my immature ideas would be much appreciated:)

I've implemented the first step of the project, the segment pglz
compression provides the ability to get the subset of the raw data without
decompressing the entire field.
And I've done some test[1] for the compressor. The test result is as
follows:
NOTICE:  Test summary:
NOTICE:  Payload 00010001
NOTICE:   Decompressor name  |Compression time (ns/bit)  |
Decompression time (ns/bit) | ratio
NOTICE:   pglz_decompress_hacked |23.747444   |
 0.578344 | 0.159809
NOTICE:   pglz_decompress_hacked8|23.764193   |
 0.677800 | 0.159809
NOTICE:   pglz_decompress_hacked16   |23.740351   |
 0.704730 | 0.159809
NOTICE:   pglz_decompress_vanilla|23.797917   |
 1.227868 | 0.159809
NOTICE:   pglz_decompress_hacked_seg |12.261808   |
 0.625634 | 0.184952

Comment: Compression speed increased by nearly 100% with compression rate
dropped by 15%

NOTICE:  Payload 00010001 sliced by 2Kb
NOTICE:   pglz_decompress_hacked |12.616956   |
 0.621223 | 0.156953
NOTICE:   pglz_decompress_hacked8|12.583685   |
 0.756741 | 0.156953
NOTICE:   pglz_decompress_hacked16   |12.512636   |
 0.774980 | 0.156953
NOTICE:   pglz_decompress_vanilla|12.493062   |
 1.262820 | 0.156953
NOTICE:   pglz_decompress_hacked_seg |11.986554   |
 0.622654 | 0.159590
NOTICE:  Payload 00010001 sliced by 4Kb
NOTICE:   pglz_decompress_hacked |15.514469   |
 0.565565 | 0.154213
NOTICE:   pglz_decompress_hacked8|15.529144   |
 0.699675 | 0.154213
NOTICE:   pglz_decompress_hacked16   |15.514040   |
 0.721145 | 0.154213
NOTICE:   pglz_decompress_vanilla|15.558958   |
 1.237237 | 0.154213
NOTICE:   pglz_decompress_hacked_seg |14.650309   |
 0.563228 | 0.153652
NOTICE:  Payload 00010006
NOTICE:   Decompressor name  |  Compression time (ns/bit)  |
Decompression time (ns/bit) | ratio
NOTICE:   pglz_decompress_hacked |8.610177   |
 0.153577 | 0.052294
NOTICE:   pglz_decompress_hacked8|8.566785   |
 0.168002 | 0.052294
NOTICE:   pglz_decompress_hacked16   |8.643126   |
 0.167537 | 0.052294
NOTICE:   pglz_decompress_vanilla|8.574498   |
 0.930738 | 0.052294
NOTICE:   pglz_decompress_hacked_seg |7.394731   |
 0.171924 | 0.056081
NOTICE:  Payload 00010006 sliced by 2Kb
NOTICE:   pglz_decompress_hacked |6.724060   |
 0.295043 | 0.065541
NOTICE:   pglz_decompress_hacked8|6.623018   |
 0.318527 | 0.065541
NOTICE:   pglz_decompress_hacked16   |6.898034   |
 0.318360 | 0.065541
NOTICE:   pglz_decompress_vanilla|6.712711   |
 1.045430 | 0.065541
NOTICE:   pglz_decompress_hacked_seg |6.630743   |
 0.302589 | 0.068471
NOTICE:  Payload 00010006 sliced by 4Kb
NOTICE:   pglz_decompress_hacked |6.624067   |
 0.220942 | 0.058865
NOTICE:   pglz_decompress_hacked8|6.659424   |
 0.240183 | 0.058865
NOTICE:   pglz_decompress_hacked16   |6.763864   |
 0.240564 | 0.058865
NOTICE:   pglz_decompress_vanilla|6.743574   |
 0.985348 | 0.058865
NOTICE:   pglz_decompress_hacked_seg |6.613123   |
 0.227582 | 0.060330
NOTICE:  Payload 00010008
NOTICE:   Decompressor name  |  Compression time (ns/bit)  |
Decompression time (ns/bit) | ratio
NOTICE:   pglz_decompress_hacked |52.425957   |
 1.050544 | 0.498941
NOTICE:   pglz_decompress_hacked8|52.204561   |
 1.261592 | 0.498941
NOTICE:   pglz_decompress_hacked16   |52.328491   |
 1.466751 | 0.498941
NOTICE:   pglz_decompress_vanilla|52.465308   |
 1.341271 | 0.498941
NOTICE:   pglz_decompress_hacked_seg |   

Re: "long" type is not appropriate for counting tuples

2019-05-23 Thread Peter Eisentraut
On 2019-05-23 15:52, Robert Haas wrote:
> On Thu, May 23, 2019 at 5:31 AM Peter Eisentraut
>  wrote:
>> Another option is that in cases where it doesn't affect storage layouts,
>> like the counting tuples case that started this thread, code could just
>> use long long int directly instead of int64.  Then if someone wants to
>> make it 128 bits or 96 bits or whatever it would not be a problem.
> 
> I think that sort of thing tends not to work out well, because at some
> point it's likely to be sent out via the wire protocol; at that point
> we'll need a value of a certain width.  Better to use that width right
> from the beginning.

Hmm, by that argument, we shouldn't ever use any integer type other than
int16, int32, and int64.

I'm thinking for example that pgbench makes a lot of use of int64 and
printing that out makes quite messy code.  Replacing that by long long
int would make this much nicer and should be pretty harmless relative to
your concern.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2019-05-23 Thread Fabien COELHO


V11 is just a rebase after the reindentation patches.


Indeed, yet again, I forgot the attachement:-(

I stared at the new test case for a while, and I must say it looks very 
cryptic. It's not exactly this patch's fault - all the pgbench tests are 
cryptic -


Perl is cryptic. Regexprs are cryptic.

but I think we need to do something about that before adding any more 
tests. I'm not sure what exactly, but I'd like them to be more like 
pg_regress tests, where you have an expected output and you compare it 
with the actual output. I realize that's not easy, because there are a lot 
of varying numbers in the output, but we've got to do something.


As a good first step, I wish the pgbench() function used named arguments. 
[...]


You would have something like this:

my $elapsed = pgbench(
 test_name => 'pgbench progress',
 opts => '-T 2 -P 1 -l --aggregate-interval=1'


I do not like them much in perl because it changes the code significantly, 
but why not. That would be another patch anyway.


A lighter but efficient option would be to add a few comments on the larger 
calls, see attached.


Please really find the attachement, and do not hesitate to share spare a few 
grey cells so that I will not forget about them in the futur:-)





--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 8b84658ccd..d65b213471 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5897,6 +5897,7 @@ threadRun(void *arg)
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
+	int			aborted = 0;
 	int			i;
 
 	/* for reporting progress: */
@@ -6126,6 +6127,10 @@ threadRun(void *arg)
 			 */
 			if (st->state == CSTATE_FINISHED || st->state == CSTATE_ABORTED)
 remains--;
+
+			/* count aborted clients */
+			if (st->state == CSTATE_ABORTED)
+aborted++;
 		}
 
 		/* progress report is made by thread 0 for all threads */
@@ -6149,7 +6154,10 @@ threadRun(void *arg)
 
 /*
  * Ensure that the next report is in the future, in case
- * pgbench/postgres got stuck somewhere.
+ * pgbench/postgres got stuck somewhere. This may skip
+ * some progress reports if the thread does not get enough
+ * cpu time. In such case, probably the whole bench should
+ * be ignored.
  */
 do
 {
@@ -6160,6 +6168,52 @@ threadRun(void *arg)
 	}
 
 done:
+	/*
+	 * Under -R, comply with -T and -P even if there is nothing to do,
+	 * (unless all clients aborted) and ensure that one report is printed.
+	 * This special behavior allows tap tests to check that a run lasts
+	 * as expected and that some progress is shown, even on very slow hosts.
+	 */
+	if (duration && throttle_delay && aborted < nstate && thread->tid == 0)
+	{
+		int64		thread_end = thread_start + (int64) duration * 100;
+		instr_time	now_time;
+		int64		now;
+
+		INSTR_TIME_SET_CURRENT(now_time);
+		now = INSTR_TIME_GET_MICROSEC(now_time);
+
+		while (now < thread_end)
+		{
+			if (progress && next_report <= thread_end)
+			{
+pg_usleep(next_report - now);
+INSTR_TIME_SET_CURRENT(now_time);
+now = INSTR_TIME_GET_MICROSEC(now_time);
+printProgressReport(thread, thread_start, now, , _report);
+
+/* schedule next report */
+do
+{
+	next_report += (int64) progress * 100;
+} while (now >= next_report);
+			}
+			else
+			{
+pg_usleep(thread_end - now);
+INSTR_TIME_SET_CURRENT(now_time);
+now = INSTR_TIME_GET_MICROSEC(now_time);
+			}
+		}
+
+		/*
+		 * Print a closing progress report if none were printed
+		 * and at least one was expected.
+		 */
+		if (progress && progress <= duration && last_report == thread_start)
+			printProgressReport(thread, thread_start, now, , _report);
+	}
+
 	INSTR_TIME_SET_CURRENT(start);
 	disconnect_all(state, nstate);
 	INSTR_TIME_SET_CURRENT(end);
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index dc2c72fa92..ed49c3bf2a 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -4,13 +4,14 @@ use warnings;
 use PostgresNode;
 use TestLib;
 use Test::More;
+use Time::HiRes qw{time};
 
 # start a pgbench specific server
 my $node = get_new_node('main');
 $node->init;
 $node->start;
 
-# invoke pgbench, with parameters:
+# invoke pgbench, return elapsed time, with parameters:
 #   $opts: options as a string to be split on spaces
 #   $stat: expected exit status
 #   $out: reference to a regexp list that must match stdout
@@ -49,13 +50,14 @@ sub pgbench
 	}
 
 	push @cmd, @args;
+	my $t0 = time();
 
 	$node->command_checks_all(\@cmd, $stat, $out, $err, $name);
 
 	# cleanup?
 	#unlink @filenames or die "cannot unlink files (@filenames): $!";
 
-	return;
+	return time() - $t0;
 }
 
 # Test concurrent insertion into table with serial column.  This
@@ -894,13 +896,67 @@ sub check_pgbench_logs
 
 my $bdir = 

Re: pgbench - add \aset to store results of a combined query

2019-05-23 Thread Fabien COELHO


V2 is a rebase.

A long time ago I submitted a pgbench \into command to store results of 
queries into variables independently of the query being processed, which got 
turn into \gset (;) and \cset (\;), which got committed, then \cset was 
removed because it was not "up to standard", as it could not work with empty 
query (the underlying issue is that pg silently skips empty queries, so that 
"\; SELECT 1 \; \; SELECT 3," returns 2 results instead of 4, a misplaced 
optimisation from my point of view).


Now there is a pgbench \gset which allows to extract the results of variables 
of the last query, but as it does both setting and ending a query at the same 
time, there is no way to set variables out of a combined (\;) query but the 
last, which is the kind of non orthogonal behavior that I dislike much.


This annoys me because testing the performance of combined queries cannot be 
tested if the script needs to extract variables.


To make the feature somehow accessible to combined queries, the attached 
patch adds the "\aset" (all set) command to store all results of queries 
which return just one row into variables, i.e.:


 SELECT 1 AS one \;
 SELECT 2 AS two UNION SELECT 2 \;
 SELECT 3 AS three \aset

will set both "one" and "three", while "two" is not set because there were 
two rows. It is a kind of more permissive \gset.


Because it does it for all queries, there is no need for synchronizing with 
the underlying queries, which made the code for \cset both awkward and with 
limitations. Hopefully this version might be "up to standard".

I'll see. I'm in no hurry:-)




--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ef22a484e7..200f6b0010 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -966,18 +966,27 @@ pgbench  options  d

 
  \gset [prefix]
+ \aset [prefix]
 
 
 
  
-  This command may be used to end SQL queries, taking the place of the
+  These commands may be used to end SQL queries, taking the place of the
   terminating semicolon (;).
  
 
  
-  When this command is used, the preceding SQL query is expected to
-  return one row, the columns of which are stored into variables named after
-  column names, and prefixed with prefix if provided.
+  When the \gset command is used, the preceding SQL query is
+  expected to return one row, the columns of which are stored into variables
+  named after column names, and prefixed with prefix
+  if provided.
+ 
+
+ 
+  When the \aset command is used, all combined SQL queries
+  (separated by \;) which return just one row see their
+  columns stored into variables named after column names, and prefixed with
+  prefix if provided. Other queries are ignored.
  
 
  
@@ -986,6 +995,8 @@ pgbench  options  d
   p_two and p_three
   with integers from the third query.
   The result of the second query is discarded.
+  The result of the two last combined queries are stored in variables
+  four and five.
 
 UPDATE pgbench_accounts
   SET abalance = abalance + :delta
@@ -994,6 +1005,7 @@ UPDATE pgbench_accounts
 -- compound of two queries
 SELECT 1 \;
 SELECT 2 AS two, 3 AS three \gset p_
+SELECT 4 AS four \; SELECT 5 AS five \aset
 
  
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 8b84658ccd..6895187455 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -461,6 +461,7 @@ typedef enum MetaCommand
 	META_SHELL,	/* \shell */
 	META_SLEEP,	/* \sleep */
 	META_GSET,	/* \gset */
+	META_ASET,	/* \aset */
 	META_IF,	/* \if */
 	META_ELIF,	/* \elif */
 	META_ELSE,	/* \else */
@@ -493,6 +494,7 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  * varprefix 	SQL commands terminated with \gset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
+ * aset			do gset on all possible queries of a combined query (\;).
  * expr			Parsed expression, if needed.
  * stats		Time spent in this command.
  */
@@ -505,6 +507,7 @@ typedef struct Command
 	int			argc;
 	char	   *argv[MAX_ARGS];
 	char	   *varprefix;
+	bool		aset;
 	PgBenchExpr *expr;
 	SimpleStats stats;
 } Command;
@@ -2479,6 +2482,8 @@ getMetaCommand(const char *cmd)
 		mc = META_ENDIF;
 	else if (pg_strcasecmp(cmd, "gset") == 0)
 		mc = META_GSET;
+	else if (pg_strcasecmp(cmd, "aset") == 0)
+		mc = META_ASET;
 	else
 		mc = META_NONE;
 	return mc;
@@ -2715,7 +2720,7 @@ sendCommand(CState *st, Command *command)
  * Returns true if everything is A-OK, false if any error occurs.
  */
 static bool
-readCommandResponse(CState *st, char *varprefix)
+readCommandResponse(CState *st, char *varprefix, bool is_aset)
 {
 	PGresult   *res;
 	PGresult   *next_res;
@@ -2735,7 +2740,7 @@ readCommandResponse(CState *st, char *varprefix)
 		{
 			case 

Re: "long" type is not appropriate for counting tuples

2019-05-23 Thread Robert Haas
On Thu, May 23, 2019 at 5:31 AM Peter Eisentraut
 wrote:
> Another option is that in cases where it doesn't affect storage layouts,
> like the counting tuples case that started this thread, code could just
> use long long int directly instead of int64.  Then if someone wants to
> make it 128 bits or 96 bits or whatever it would not be a problem.

I think that sort of thing tends not to work out well, because at some
point it's likely to be sent out via the wire protocol; at that point
we'll need a value of a certain width.  Better to use that width right
from the beginning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Minimal logical decoding on standbys

2019-05-23 Thread Robert Haas
On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov  wrote:
> > wal_level is PGC_POSTMASTER.
>
> But primary can be restarted without restart on standby. We require wal_level 
> replica or highter (currently only logical) on standby. So online change from 
> logical to replica wal_level is possible on standby's controlfile.

That's true, but Amit's scenario involved a change in wal_level during
the execution of pg_create_logical_replication_slot(), which I think
can't happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Minimal logical decoding on standbys

2019-05-23 Thread Sergei Kornilov
Hello

> wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level 
replica or highter (currently only logical) on standby. So online change from 
logical to replica wal_level is possible on standby's controlfile.

regards, Sergei




Re: with oids option not removed in pg_dumpall

2019-05-23 Thread Surafel Temesgen
Thank you for applying

regards
Surafel

On Thu, May 23, 2019 at 3:43 AM Michael Paquier  wrote:

> On Tue, May 21, 2019 at 05:24:57PM +0900, Michael Paquier wrote:
> > Good catch.  Your cleanup looks correct to me.  Andres, perhaps you
> > would prefer doing the cleanup yourself?
>
> As I am cleaning up the area for another issue, applied.
> --
> Michael
>


Re: nitpick about poor style in MergeAttributes

2019-05-23 Thread Mark Dilger
On Wed, May 22, 2019 at 10:21 PM Michael Paquier  wrote:
>
> On Wed, May 22, 2019 at 06:20:01PM -0700, Mark Dilger wrote:
> > What to do about this is harder to say.  In the following
> > patch, I'm just doing what I think is standard for callers
> > of list_delete_cell, and assigning the return value back
> > to the list (similar to how a call to repalloc should do).
> > But since there is an implicit assumption that the list
> > is never emptied by this operation, perhaps checking
> > against NIL and elog'ing makes more sense?
>
> Yes, I agree that this is a bit fuzzy, and this code is new as of
> 705d433.  As you say, I agree that making sure that the return value
> of list_delete_cell is not NIL is a sensible choice.
>
> I don't think that an elog() is in place here though as this does not
> rely directly on catalog contents, what about just an assertion?

I think assigning the return value (as I did in my small patch) and
then asserting that 'schema' is not NIL would be good.

> Here is an idea of message for the elog(ERROR) if we go that way:
> "no remaining columns after merging column \"%s\"".

Perhaps.  I like your idea of adding an assertion better.

mark




Re: Minimal logical decoding on standbys

2019-05-23 Thread Robert Haas
On Thu, May 23, 2019 at 8:08 AM Amit Khandekar  wrote:
> This made me think more of the race conditions. For instance, in
> pg_create_logical_replication_slot(), just after
> CheckLogicalDecodingRequirements and before actually creating the
> slot, suppose concurrently Controlfile->wal_level is changed from
> logical to replica.  So suppose a new slot does get created. Later the
> slot is read, so in pg_logical_slot_get_changes_guts(),
> CheckLogicalDecodingRequirements() is called where it checks
> ControlFile->wal_level value. But just before it does that,
> ControlFile->wal_level concurrently changes back to logical, because
> of replay of another param-change record. So this logical reader will
> think that the wal_level is sufficient, and will proceed to read the
> records, but those records are *before* the wal_level change, so these
> records don't have logical data.
>
> Do you think this is possible, or I am missing something?

wal_level is PGC_POSTMASTER.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] Unlogged tables cleanup

2019-05-23 Thread Robert Haas
On Thu, May 23, 2019 at 2:43 AM Michael Paquier  wrote:
> On Tue, May 21, 2019 at 08:39:18AM -0400, Robert Haas wrote:
> > Yes.  I thought I had described it.  You create an unlogged table,
> > with an index of a type that does not smgrimmedsync(), your
> > transaction commits, and then the system crashes, losing the _init
> > fork for the index.
>
> The init forks won't magically go away, except in one case for empty
> routines not going through shared buffers.

No magic is required.  If you haven't called fsync(), the file might
not be there after a system crash.

Going through shared_buffers guarantees that the file will be
fsync()'d before the next checkpoint, but I'm talking about a scenario
where you crash before the next checkpoint.

> Then, empty routines going through shared buffers fill in one or more
> buffers, mark it/them as empty, dirty it/them, log the page(s) and then
> unlock the buffer(s).  If a crash happens after the transaction
> commits, so we would still have the init page in WAL, and at the end
> of recovery we would know about it.

Yeah, but the problem is that the currently system requires us to know
about it at the *beginning* of recovery.  See my earlier remarks:

Suppose we create an unlogged table and then crash. The main fork
makes it to disk, and the init fork does not.  Before WAL replay, we
remove any main forks that have init forks, but because the init fork
was lost, that does not happen.  Recovery recreates the init fork.
After WAL replay, we try to copy_file() each _init fork to the
corresponding main fork. That fails, because copy_file() expects to be
able to create the target file, and here it can't do that because it
already exists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fuzzy thinking in is_publishable_class

2019-05-23 Thread Peter Eisentraut
On 2019-05-09 15:41, Tom Lane wrote:
>> I think we can get rid of the ability to reload the information_schema
>> after initdb.  That was interesting in the early phase of its
>> development, but now it just creates complications.
> We've relied on that more than once to allow minor-release updates of
> information_schema views, so I think losing the ability to do it is
> a bad idea.

In those cases we used CREATE OR REPLACE VIEW, which preserves OIDs.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Remove useless associativity/precedence from parsers

2019-05-23 Thread Robert Haas
On Thu, May 23, 2019 at 12:00 AM Tom Lane  wrote:
> > One possible
> > idea is a way to mark a rule %weak, meaning that it should only be
> > used if no non-%weak rule could apply.  I'm not sure if that would
> > really be the best way, but it's one idea.  A more general version
> > would be some kind of ability to give rules different strengths; in
> > the case of a grammar conflict, the "stronger" rule would win.
>
> Hmmm ... not apparent to me that that's really going to help.
> Maybe it will, but it sounds like more likely it's just another
> mechanism with not-as-obvious-as-you-thought side effects.

That's possible; I'm open to other ideas.  If you wanted to be really
explicit about it, you could have a way to stick an optional name on a
grammar rule, and a way to say that the current rule should lose to a
list of named other rules.

It seems pretty clear, though, that our use of %prec proves that we
can't just write a grammar that is intrinsically conflict-free; we
sometimes need to have conflicts and then tell the parser generator
which option to prefer.  And I think it's also pretty clear that %prec
is, for anything other than operator precedence, a horrible way of
doing that.  A method that was merely mediocre could still be a big
improvement over what we have available today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Minimal logical decoding on standbys

2019-05-23 Thread Amit Khandekar
On Tue, 21 May 2019 at 21:49, Andres Freund  wrote:
>
> Hi,
>
> Sorry for the late response.
>
> On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:
> > On Sat, 13 Apr 2019 at 00:57, Andres Freund  wrote:
> > > > Not sure why this is happening. On slave, wal_level is logical, so
> > > > logical records should have tuple data. Not sure what does that have
> > > > to do with wal_level of master. Everything should be there on slave
> > > > after it replays the inserts; and also slave wal_level is logical.
> > >
> > > The standby doesn't write its own WAL, only primaries do. I thought we
> > > forbade running with wal_level=logical on a standby, when the primary is
> > > only set to replica.  But that's not what we do, see
> > > CheckRequiredParameterValues().
> > >
> > > I've not yet thought this through, but I think we'll have to somehow
> > > error out in this case.  I guess we could just check at the start of
> > > decoding what ControlFile->wal_level is set to,
> >
> > By "start of decoding", I didn't get where exactly. Do you mean
> > CheckLogicalDecodingRequirements() ?
>
> Right.
>
>
> > > and then raise an error
> > > in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> > > wal_level to something lower?
> >
> > Didn't get where exactly we should error out. We don't do
> > XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
> > something else, which I didn't understand.
>
> I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
> decode.c. Adding handling for that, and just checking wal_level, ought
> to be fairly doable? But, see below:
>
>
> > What I am thinking is :
> > In CheckLogicalDecodingRequirements(), besides checking wal_level,
> > also check ControlFile->wal_level when InHotStandby. I mean, when we
> > are InHotStandby, both wal_level and ControlFile->wal_level should be
> > >= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
> > slot when master has incompatible wal_level.
>
> That still allows the primary to change wal_level after logical decoding
> has started, so we need the additional checks.
>
> I'm not yet sure how to best deal with the fact that wal_level might be
> changed by the primary at basically all times. We would eventually get
> an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
> that's not necessarily sufficient - if a primary changes its wal_level
> to lower, it could remove information logical decoding needs *before*
> logical decoding reaches the XLOG_PARAMETER_CHANGE record.
>
> So I suspect we need conflict handling in xlog_redo's
> XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> slots, we ought to be safe.
>
> Therefore I think the check in CheckLogicalDecodingRequirements() needs
> to be something like:
>
> if (RecoveryInProgress())
> {
> if (!InHotStandby)
> ereport(ERROR, "logical decoding on a standby required hot_standby to 
> be enabled");
> /*
>  * This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
>  * wal_level has changed, we verify that there are no existin glogical
>  * replication slots. And to avoid races around creating a new slot,
>  * CheckLogicalDecodingRequirements() is called once before creating the 
> slot,
>  * andd once when logical decoding is initially starting up.
>  */
> if (ControlFile->wal_level != LOGICAL)
> ereport(ERROR, "...");
> }
>
> And then add a second CheckLogicalDecodingRequirements() call into
> CreateInitDecodingContext().
>
> What do you think?

Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica.  So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

Do you think this is possible, or I am missing something? If that's
possible, I was considering some other mechanisms. Like, while reading
each wal_level-change record by a logical reader, save the value in
the 

Re: PostgreSQL 12 Beta 1 press release draft

2019-05-23 Thread Jonathan S. Katz
On 5/23/19 1:45 AM, David Rowley wrote:
> On Thu, 23 May 2019 at 15:31, Jonathan S. Katz  wrote:
>> Attached is
>> v3 of the patch, along with a diff.
> 
> Minor details, but this query is not valid:
> 
>> WITH c AS MATERIALIZED (
>>   SELECT * FROM a WHERE a.x % 4
>> )
>> SELECT * FROM c JOIN d ON d.y = a.x;
> 
> a.x % 4 is not a boolean clause, and "a" is not in the main query, so
> a.x can't be referenced there.

...that's the only gotcha I'm actually embarrassed about. Fixed.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: Read-only access to temp tables for 2PC transactions

2019-05-23 Thread Simon Riggs
On Tue, 14 May 2019 at 10:53, Stas Kelvich  wrote:


> One of the problems regarding the use of temporary tables in prepared
> transactions
> is that such transaction will hold locks for a temporary table after being
> prepared.
> That locks will prevent the backend from exiting since it will fail to
> acquire lock
> needed to delete temp table during exit. Also, re-acquiring such lock
> after server
> restart seems like an ill-defined operation.
>
...

> Any thoughts?
>

It occurs to me that there is no problem to solve here.

When we PREPARE, it is because we expect to COMMIT or ABORT soon afterwards.

If we are using an external transaction manager, the session is idle while
we wait for the manager to decide whether to commit or abort. Or the
session is disconnected or server is crashes. Either way, nothing happens
between PREPARE and resolution. So there is no need at all for locking of
temporary tables after the prepare.

The ONLY case where this matters is if someone does a PREPARE and then
starts doing other work on the session. Which makes no sense in the normal
workflow of a session. I'm sure there are tests that do that, but those
tests are unrepresentative of sensible usage.

If you were using session temporary tables while using a transaction mode
pool then you're already going to have problems, so that aspect is a
non-issue.

So I think we should ban this by definition. Say that we expect that you
won't do any work on the session until COMMIT/ABORT. That means we can then
drop locks on sesion temporary tables and drop on-commit temp tables when
we hit the prepare, not try and hold them for later.

A patch is needed to implement the above, but I think we can forget the
current patch as not needed.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-05-23 Thread Julian Schauder
Hey David,


> "multiple statements in a single query", did you mean to write
> session
> or maybe transaction there?

Maybe the wording isn't perfect. It is required that the querys are
sent as a single batch. Try the exact bash-script Andreas used for
updating the parent.

> Which version?

Tested including 11.2. Initially found on 11.1. Memory-consumption
Scales somewhat linearly with existing partitions and ';' delimited
Querys per single Batch.


regards
-- 
Julian Schauder





Re: Patch to fix write after end of array in hashed agg initialization

2019-05-23 Thread Andrew Gierth
> "Andrew" == Andrew Gierth  writes:

 Andrew> My inclination is to fix this in the planner rather than the
 Andrew> executor; there seems no good reason to actually hash a
 Andrew> duplicate column more than once.

I take this back; I don't believe it's possible to eliminate duplicates
in all cases. Consider (a,b) IN (select c,c from...), where a,b,c are
different types; I don't think we can assume that (a=c) and (b=c)
cross-type comparisons will necessarily induce the same hash function on
c, and so we might legitimately need to keep it duplicated.

So I'm going with a simpler method of ensuring the array is adequately
sized at execution time and not touching the planner at all. Draft patch
is attached, will commit it later.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 5b4a602952..6b8ef40599 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1312,9 +1312,14 @@ build_hash_table(AggState *aggstate)
  * by themselves, and secondly ctids for row-marks.
  *
  * To eliminate duplicates, we build a bitmapset of the needed columns, and
- * then build an array of the columns included in the hashtable.  Note that
- * the array is preserved over ExecReScanAgg, so we allocate it in the
- * per-query context (unlike the hash table itself).
+ * then build an array of the columns included in the hashtable. We might
+ * still have duplicates if the passed-in grpColIdx has them, which can happen
+ * in edge cases from semijoins/distinct; these can't always be removed,
+ * because it's not certain that the duplicate cols will be using the same
+ * hash function.
+ *
+ * Note that the array is preserved over ExecReScanAgg, so we allocate it in
+ * the per-query context (unlike the hash table itself).
  */
 static void
 find_hash_columns(AggState *aggstate)
@@ -1335,6 +1340,7 @@ find_hash_columns(AggState *aggstate)
 		AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
 		List	   *hashTlist = NIL;
 		TupleDesc	hashDesc;
+		int			maxCols;
 		int			i;
 
 		perhash->largestGrpColIdx = 0;
@@ -1359,15 +1365,24 @@ find_hash_columns(AggState *aggstate)
 	colnos = bms_del_member(colnos, attnum);
 			}
 		}
-		/* Add in all the grouping columns */
-		for (i = 0; i < perhash->numCols; i++)
-			colnos = bms_add_member(colnos, grpColIdx[i]);
+
+		/*
+		 * Compute maximum number of input columns accounting for possible
+		 * duplications in the grpColIdx array, which can happen in some edge
+		 * cases where HashAggregate was generated as part of a semijoin or a
+		 * DISTINCT.
+		 */
+		maxCols = bms_num_members(colnos) + perhash->numCols;
 
 		perhash->hashGrpColIdxInput =
-			palloc(bms_num_members(colnos) * sizeof(AttrNumber));
+			palloc(maxCols * sizeof(AttrNumber));
 		perhash->hashGrpColIdxHash =
 			palloc(perhash->numCols * sizeof(AttrNumber));
 
+		/* Add all the grouping columns to colnos */
+		for (i = 0; i < perhash->numCols; i++)
+			colnos = bms_add_member(colnos, grpColIdx[i]);
+
 		/*
 		 * First build mapping for columns directly hashed. These are the
 		 * first, because they'll be accessed when computing hash values and
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index fed69dc9e1..2e5ce8cc32 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2270,3 +2270,21 @@ select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
  ba   |0 | 1
 (2 rows)
 
+-- Make sure that generation of HashAggregate for uniqification purposes
+-- does not lead to array overflow due to unexpected duplicate hash keys
+-- see cafeejokku0u+a_a9r9316djw-yw3-+gtgvy3ju655qrhr3j...@mail.gmail.com
+explain (costs off)
+  select 1 from tenk1
+   where (hundred, thousand) in (select twothousand, twothousand from onek);
+ QUERY PLAN  
+-
+ Hash Join
+   Hash Cond: (tenk1.hundred = onek.twothousand)
+   ->  Seq Scan on tenk1
+ Filter: (hundred = thousand)
+   ->  Hash
+ ->  HashAggregate
+   Group Key: onek.twothousand, onek.twothousand
+   ->  Seq Scan on onek
+(8 rows)
+
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 230f44666a..ca0d5e66b2 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -988,3 +988,10 @@ select v||'a', case v||'a' when 'aa' then 1 else 0 end, count(*)
 select v||'a', case when v||'a' = 'aa' then 1 else 0 end, count(*)
   from unnest(array['a','b']) u(v)
  group by v||'a' order by 1;
+
+-- Make sure that generation of HashAggregate for uniqification purposes
+-- does not lead to array overflow due to unexpected duplicate hash keys
+-- see cafeejokku0u+a_a9r9316djw-yw3-+gtgvy3ju655qrhr3j...@mail.gmail.com
+explain (costs off)
+  select 1 from tenk1
+   where 

Re: Create function using quote_literal issues

2019-05-23 Thread Mohamed DIA
I found the solution by defining r as record and using
 FOR r in EXECUTE v_select

Thanks

On Thu, May 23, 2019 at 9:49 AM Mohamed DIA  wrote:

> Hi,
> I am trying to use a create function in order to update some values in a
> table (see below code).
> However, when I run the function, it never enters into the following loop
> *FOR r IN SELECT * FROM immatriculationemployeursucctemp2 where
> succursale = quote_literal(s.succursale) order by row_number*
>
> However, if I remove the condition *where  succursale =
> quote_literal(s.succursale)*   then it works
>
> I need to filter on every value of succursale
> Is there a way to achieve it without removing ?
> Any help will be appreciated. I'm struggling with it for a while now
>
> CREATE OR REPLACE FUNCTION create_new_emp_succ_numbers() RETURNS SETOF
> list_succursale AS
> $BODY$
> DECLARE
> r immatriculationemployeursucctemp2%rowtype;
> s list_succursale%rowtype;
>seq_priv INTEGER := 1;
>
> BEGIN
>
> FOR s IN SELECT * FROM list_succursale where  succursale
> in('010100062D1','010102492S1')
>
> LOOP
>
>
> FOR r IN SELECT * FROM immatriculationemployeursucctemp2 where
> succursale = quote_literal(s.succursale) order by row_number
>
>
> LOOP
>
> update immatriculationemployeursucctemp set no_employeur= '10' ||
> lpad(seq_priv::text,6,'0') || '0' || r.row_number-1 where employer_type=10
> and id=r.id;
>
>
>
> END LOOP;
> seq_priv := seq_priv + 1;
>  RETURN NEXT s;
> END LOOP;
>
> RETURN;
> END
> $BODY$
> LANGUAGE 'plpgsql' ;
>
> SELECT * FROM create_new_emp_succ_numbers();
>


Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-05-23 Thread David Rowley
On Thu, 23 May 2019 at 21:19, Julian Schauder
 wrote:
> > "multiple statements in a single query", did you mean to write
> > session
> > or maybe transaction there?
>
> Maybe the wording isn't perfect. It is required that the querys are
> sent as a single batch. Try the exact bash-script Andreas used for
> updating the parent.

Thanks for explaining.

> > Which version?
>
> Tested including 11.2. Initially found on 11.1. Memory-consumption
> Scales somewhat linearly with existing partitions and ';' delimited
> Querys per single Batch.

Yeah, unfortunately, if the batch contains 100 of those statements
then the planner is going to eat 100 times the memory since it stores
all 100 plans at once.

Since your pruning all but 1 partition then the situation should be
much better for you when you can upgrade to v12. Unfortunately, that's
still about 5 months away.

The best thing you can do for now is going to be either reduce the
number of partitions or reduce the number of statements in the
batch... or install more memory.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Create function using quote_literal issues

2019-05-23 Thread Mohamed DIA
Hi,
I am trying to use a create function in order to update some values in a
table (see below code).
However, when I run the function, it never enters into the following loop
*FOR r IN SELECT * FROM immatriculationemployeursucctemp2 where  succursale
= quote_literal(s.succursale) order by row_number*

However, if I remove the condition *where  succursale =
quote_literal(s.succursale)*   then it works

I need to filter on every value of succursale
Is there a way to achieve it without removing ?
Any help will be appreciated. I'm struggling with it for a while now

CREATE OR REPLACE FUNCTION create_new_emp_succ_numbers() RETURNS SETOF
list_succursale AS
$BODY$
DECLARE
r immatriculationemployeursucctemp2%rowtype;
s list_succursale%rowtype;
   seq_priv INTEGER := 1;

BEGIN

FOR s IN SELECT * FROM list_succursale where  succursale
in('010100062D1','010102492S1')

LOOP


FOR r IN SELECT * FROM immatriculationemployeursucctemp2 where
succursale = quote_literal(s.succursale) order by row_number


LOOP

update immatriculationemployeursucctemp set no_employeur= '10' ||
lpad(seq_priv::text,6,'0') || '0' || r.row_number-1 where employer_type=10
and id=r.id;



END LOOP;
seq_priv := seq_priv + 1;
 RETURN NEXT s;
END LOOP;

RETURN;
END
$BODY$
LANGUAGE 'plpgsql' ;

SELECT * FROM create_new_emp_succ_numbers();


Re: "long" type is not appropriate for counting tuples

2019-05-23 Thread Peter Eisentraut
On 2019-05-22 21:21, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Could we define int64 to be long long int on all platforms and just
>> always use %lld?
> 
> Hmmm ... maybe.  Once upon a time we had to cope with compilers
> that didn't have "long long", but perhaps that time is past.

It's required by C99, and the configure test for C99 checks it.

> Another conceivable hazard is somebody deciding that the world
> needs a platform where "long long" is 128 bits.  I don't know
> how likely that is to happen.

Another option is that in cases where it doesn't affect storage layouts,
like the counting tuples case that started this thread, code could just
use long long int directly instead of int64.  Then if someone wants to
make it 128 bits or 96 bits or whatever it would not be a problem.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Should we warn against using too many partitions?

2019-05-23 Thread David Rowley
Over on [1] I raised a concern about the lack of any warning in our
documents to inform users that they might not want to use thousands of
partitions.  More recently there's [2], also suffering from OOM using
100 partitions.  Perhaps there's more too this, but the planner using
a lot of memory planning updates and deletes to partitioned tables
does seem to be a surprise to many people.

I had hoped we could get something it the documents sooner rather than
later about this. Probably the v12 patch will need to be adjusted now
that the memory consumption will be reduced when many partitions are
pruned, but I still think v12 needs to have some sort of warning in
there.

https://commitfest.postgresql.org/23/2065/

I'm moving this to a new thread with a better title, rather than
tagging onto that old thread that's become rather long.

[1] 
https://www.postgresql.org/message-id/cakjs1f8rw-mhq8aewd5dv0+8a1wh5thhdymgw9y5sxqne0x...@mail.gmail.com
[2] https://www.postgresql.org/message-id/87ftp6l2qr@credativ.de

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-23 Thread Andrew Gierth
> "Finnerty" == Finnerty, Jim  writes:

 Finnerty> planstate-> total_cost   cheapest_total_path
 Finnerty> GEQO 54190.1354239.03
 Finnerty> STD  54179.0254273.73

These differences aren't significant - the standard join search has a
"fuzz factor" built into it, such that paths have to be more than 1%
better in cost in order to actually be considered as being better than
an existing path.

-- 
Andrew (irc:RhodiumToad)




  1   2   >