date:20190215

Re: 2019-03 CF Summary / Review - Tranche #2

2019-02-15 Thread Tom Lane

Andres Freund  writes:
> - Inline Common Table Expressions

>   NR: I think it's likely that Tom will commit this soon, we're mostly
>   debating syntax and similar things at this point (and man, I'm looking
>   forward to this).

Unless there are a bunch of votes real soon against the [NOT] MATERIALIZED
syntax, I'm going to commit it that way.  "Real soon" means "probably
tomorrow".

regards, tom lane

Re: 2019-03 CF Summary / Review - Tranche #2

2019-02-15 Thread Peter Geoghegan

On Fri, Feb 15, 2019 at 9:45 PM Andres Freund  wrote:
> - Make nbtree keys unique by appending heap TID, suffix truncation
>
>   NR: Important, seemingly carefully worked on feature. But it's
>   *complicated* stuff, and there's only been a bit of design review by
>   Heikki. The author's a committer.  I don't feel qualified to judge
>   this.

I think that's fair. The best way of understanding the risks as a
non-expert is to think of the patch as having two distinct components:

1. Make nbtree entries unique + add suffix truncation -- the basic mechanism.

2. Make more intelligent decisions about where to split pages, to work
with suffix truncation, while still mostly caring about the balance of
free space among each half of the split, and caring about a number of
other new concerns besides these two.

You're right that some parts of the patch are very complicated, but
those are all contained in the second component. That has been the
main focus of Heikki's review, by far. This second component is
concerned with picking a split point that is already known to be legal
based on the *existing* criteria. If there were bugs here, they could
not result in data corruption. The worst outcome I can conceive of is
that an index would be bloated in a new and novel way. It would be
possible to correct that in a point release without breaking on-disk
compatibility. That would be painful, certainly, but it's far from the
worst outcome.

Granted, there are also one or two subtle things in the first, more
critical component, but these are also the things that were
established earliest and have received the most testing. And, amcheck
is now capable of doing point lookups using the same code paths as
index scans (calling _bt_search()) to relocate each and every tuple on
the leaf level, starting from the root. The first component does not
change anything about how crash recovery or VACUUM works, either. It's
all about how keys compare, and how new pivot tuples are generated --
it's mostly about the key space, while changing very little about the
physical on-disk representation. (It builds on the on-disk
representation changes added in Postgres 11, for INCLUDE indexes.)

> - SortSupport implementation on inet/cdir
>
>   WOA: This is a substantial amount of code submitted first for the last
>   CF.
>
>   Andres: punt to v13

I was kind of involved here. I think that it's fair to punt, based on
the rule about submitting a big patch to the last CF.

> - amcheck verification for GiST
>
>   WOA: Some locking changes are apparently needed, possibly a bit too
>   much to fix up for v12?

I had hoped that Andrey Borodin would get back to me on this soon. It
does still seem unsettled.

> - insensitive/non-deterministic collations
>
>   NR: Peter has stated that he's targeting v12, but I'm not sure this
>   had enough review? But it's not *that* complicated...

I could help out here.

--
Peter Geoghegan

Re: 2019-03 CF Summary / Review - Tranche #2

2019-02-15 Thread Pavel Stehule

Hi



> - block level PRAGMA
>
>   NR: My reading of this thread is that the proposal is closer to being
>   rejected than merged.
>
>   Andres: reject or punt?
>
>
This patch is very simple and has strong sense for users of
plpgsql_checks.  For first moment, only plpgsql_check's users can get some
advance from it. But if we implement autonomous transactions, and I hope so
this feature will be implemented, then this code can be used for Oracle's
PL/SQL syntax compatible implementation. There is not any disadvantage - it
is clean, and compatible with ADA, PL/SQL .. I implemented just only block
level PRAGMA, and there was not any disagreement.

Regards

Pavel

Re: Too rigorous assert in reorderbuffer.c

2019-02-15 Thread Arseny Sher


Alvaro Herrera  writes:

> Thanks for checking!  I also run it on all branches, everything passes.
> Pushed now.

I'm sorry to bother you with this again, but due to new test our
internal buildfarm revealed that ajacent assert on cmin is also lie. You
see, we can't assume cmin is stable because the same key (relnode, tid)
might refer to completely different tuples if tuple was inserted by
aborted subxact, immeditaly reclaimed and then space occupied by another
one. Fix is attached.

Technically this might mean a user-facing bug, because we only pick the
first cmin which means we might get visibility wrong, allowing to see
some version too early (i.e real cmin of tuple is y, but decoding thinks
it is x, and x < y). However, I couldn't quickly make up an example
where this would actually lead to bad consequences. I tried to create
such extra visible row in pg_attribute, but that's ok because loop in
RelationBuildTupleDesc spins exactly natts times and ignores what is
left unscanned. It is also ok with pg_class, because apparently
ScanPgRelation also fishes out the (right) first tuple and doesn't check
for duplicates appearing later in the scan. Maybe I just haven't tried
hard enough though.

Attached 'aborted_subxact_test.patch' is an illustration of such wrong
cmin visibility on pg_attribute. It triggers assertion failure, but
otherwise ok (no user-facing issues), as I said earlier, so I am
disinclined to include it in the fix.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2b486b5e9f..505c7ff134 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1318,29 +1318,36 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		(void *) ,
 		HASH_ENTER | HASH_FIND,
 		);
-		if (!found)
-		{
-			ent->cmin = change->data.tuplecid.cmin;
-			ent->cmax = change->data.tuplecid.cmax;
-			ent->combocid = change->data.tuplecid.combocid;
-		}
-		else
+		if (found)
 		{
 			/*
-			 * Maybe we already saw this tuple before in this transaction,
-			 * but if so it must have the same cmin.
+			 * Tuple with such key (relnode, tid) might be inserted in aborted
+			 * subxact, space reclaimed and then filled with another tuple
+			 * from our xact or another; then we might update it again. It
+			 * means cmin can become valid/invalid at any moment, but valid
+			 * values are always monotonic.
 			 */
-			Assert(ent->cmin == change->data.tuplecid.cmin);
+			Assert((ent->cmin == InvalidCommandId) ||
+   (change->data.tuplecid.cmin == InvalidCommandId) ||
+   (change->data.tuplecid.cmin >= ent->cmin));
 
 			/*
-			 * cmax may be initially invalid, but once set it can only grow,
-			 * and never become invalid again.
+			 * Practically the same for cmax; to see invalid cmax after
+			 * valid, we need to insert new tuple in place of one reclaimed
+			 * from our aborted subxact.
 			 */
 			Assert((ent->cmax == InvalidCommandId) ||
-   ((change->data.tuplecid.cmax != InvalidCommandId) &&
-	(change->data.tuplecid.cmax > ent->cmax)));
-			ent->cmax = change->data.tuplecid.cmax;
+   (change->data.tuplecid.cmax == InvalidCommandId) ||
+   (change->data.tuplecid.cmax > ent->cmax));
 		}
+
+		/*
+		 * The 'right' (potentially belonging to committed xact) cmin and cmax
+		 * are always the latest.
+		 */
+		ent->cmin = change->data.tuplecid.cmin;
+		ent->cmax = change->data.tuplecid.cmax;
+		ent->combocid = change->data.tuplecid.combocid;
 	}
 }
 
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 4afb1d963e..32c2650d1a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -3,11 +3,11 @@
 MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
-REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
+# REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer
+	oldest_xmin snapshot_transfer aborted_subxact
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/specs/aborted_subxact.spec b/contrib/test_decoding/specs/aborted_subxact.spec
new file mode 100644
index 00..cb066fd5d2
--- /dev/null
+++ b/contrib/test_decoding/specs/aborted_subxact.spec
@@ -0,0 +1,35 @@
+# Test DDL in aborted subxact
+
+setup
+{
+SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact
+CREATE TABLE harvest(apples int);
+}
+
+teardown
+{
+DROP TABLE

Re: 2019-03 CF Summary / Review - Tranche #2

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-14 12:37:52 -0800, Andres Freund wrote:
> - pg_stat_statements should notice FOR UPDATE clauses
>
>   NR: This seems like to get in, given how sipmle it is. Some quibbles
>   about the appropriate approach aside.
>
>
> Ok, my flight's about to land. So that's it for this round.

- Protect syscache from bloating with negative cache entries

  WOA: I think unless the feature is drastically scaled down in scope,
  there's no chance to get anything into v12.

  Andres: punt to v13, unless a smaller chunk can be split off

- SERIALIZABLE with parallel query

  NR: This seems like it's pretty much committable, and the author is a
  committer these days.

- Removing [Merge]Append nodes which contain a single subpath

  NR: I can't quite tell the state of this patch just by reading the
  thread. It's longstanding, and the code doesn't look terrible, but Tom
  appears to still be a bit unhappy.

  Andres: ???

- verify ALTER TABLE SET NOT NULL by valid constraints

  NR: Seems like a pretty useful feature. The code isn't very
  invasive. The patch has been lingering around for a while. We should
  merge this.

- Reduce amount of WAL generated by CREATE INDEX for gist, gin and
  sp-gist

  NR: This unfortunately has barely gotten any review so far, and has
  had a number of issues authors have discovered themselves.  It's a bit
  sad that a useful patch has gotten this little review, but I think
  it's probably a stretch to get it into v12 unless some senior
  reviewers show up.

- GiST VACUUM

  NR: Has gotten a fair bit of review by Heikki, but there still seems
  to be a number of unresolved issues. Not sure if there's cycles to get
  this done unless Heikki has time.

- libpq compression

  NR: While a very useful feature, the patch seems pretty raw, there's
  disagreements about code structure, and some cryptographic risks need
  to be discussed or at least better documented.

  Andres: punt to v13

- Evaluate immutable functions during planning (in FROM clause)

  NR: As far as I can tell this CF entry should have been either WO or
  even rejected for the last two CFs.  Even if the review feedback had
  been addressed, it seems there's very substantial architecture
  concerns that haven't been responded to.

  Andres: no chance for v12, CF entry should probably be closed.

- Global shared meta cache

  NR: This is extremely invasive, and is more PoC work than anything
  else.

  Andres: punt to v13

- Remove self join on a unique column

  NR: This still seems pretty raw, and there's not been a ton of
  detailed review (although plenty of more general discussion). I don't
  quite see how we can get this into shape for v12, although some review
  would be good to allow the feature to progress.

  Andres: punt to v13

- Inline Common Table Expressions

  NR: I think it's likely that Tom will commit this soon, we're mostly
  debating syntax and similar things at this point (and man, I'm looking
  forward to this).

- Autoprepare: implicitly replace literals with parameters and store
  generalized plan

  NR: I think there's no chance to get this into v12, given the state of
  the patch. There's not even agreement that we want this feature
  (although I think we can't avoid it for much longer), not to speak of
  agreement on the architecture.

  I think this needs a lot more attention to ever get anywhere.

  Andres: punt to v13

- Tid scan improvements (ordering and range scan)

  NR: The patch has been through recent significant architectural
  changes, albeit to an architecture more similar to an earlier
  approach. There's not been meaningful review since. On the other hand,
  the patch isn't actually all that complex. Seems like a stretch to get
  into v12, but possible if e.g. Tom wants to pick it up.

  Andres: +0.5 for punting to v13

- Block level parallel vacuum

  NR: Cool feature, but I don't think this has gotten even remotely
  enough review to be mergable into v12.

  Andres: punt to v13

- Speed up planning with partitions

  NR: Important feature, but based on a skim of the discussion this
  doesn't seem ready for v12.

  Andres: punt to v13

- Make nbtree keys unique by appending heap TID, suffix truncation

  NR: Important, seemingly carefully worked on feature. But it's
  *complicated* stuff, and there's only been a bit of design review by
  Heikki. The author's a committer.  I don't feel qualified to judge
  this.

- KNN for B-tree

  NR: The patch still seems to lack a bit of review. Possible but a
  stretch for v12. (While the thread is quite old, there've been
  substantial gaps where it wasn't worked on, so I don't think it's one
  of the really bad cases of not getting review.)

  Andres: +0.5 for punting to v13

- New vacuum option to do only freezing

  NR: Seems simple enough. We probably can merge this.

- Speed up transaction completion faster after many relations are
  accessed in a transaction

  NR: This has only been submitted 2019-02-12. While

Re: Conflict handling for COPY FROM

2019-02-15 Thread Andres Freund

Hi,

On 2018-08-23 17:11:04 +0300, Surafel Temesgen wrote:
> COPY ... WITH ON CONFLICT LOG maximum_error, LOG FILE NAME '…';

This doesn't seem to address Robert's point that a log file requires to
be super user only, which seems to restrict the feature more than
necessary?

- Andres

Re: chained transactions

2019-02-15 Thread Andres Freund

Hi,

On 2019-01-02 16:02:21 +0100, Peter Eisentraut wrote:
> +++ b/src/backend/access/transam/xact.c
> @@ -189,6 +189,7 @@ typedef struct TransactionStateData
>   boolstartedInRecovery;  /* did we start in recovery? */
>   booldidLogXid;  /* has xid been included in WAL 
> record? */
>   int parallelModeLevel;  /* 
> Enter/ExitParallelMode counter */
> + boolchain;  /* start a new block after this 
> one */
>   struct TransactionStateData *parent;/* back link to parent */
>  } TransactionStateData;
>  
> @@ -2760,6 +2761,36 @@ StartTransactionCommand(void)
>   MemoryContextSwitchTo(CurTransactionContext);
>  }
>  
> +
> +/*
> + * Simple system for saving and restoring transaction characteristics
> + * (isolation level, read only, deferrable).  We need this for transaction
> + * chaining, so that we can set the characteristics of the new transaction to
> + * be the same as the previous one.  (We need something like this because the
> + * GUC system resets the characteristics at transaction end, so for example
> + * just skipping the reset in StartTransaction() won't work.)
> + */
> +static int   save_XactIsoLevel;
> +static bool  save_XactReadOnly;
> +static bool  save_XactDeferrable;

We normally don't define variables in the middle of a file?  Also, why
do these need to be global vars rather than defined where we do
chaining? I'm imagining a SavedTransactionState struct declared on the
stack that's then passed to Save/Restore?

Not crucial, but I do wonder if we can come up with a prettier approach
for this.

Greetings,

Andres Freund

Re: SQL/JSON: functions

2019-02-15 Thread Andres Freund

Hi,

On 2018-12-05 02:01:19 +0300, Nikita Glukhov wrote:

> + if (!PG_ARGISNULL(1) &&
> + strncmp("any", VARDATA(type), VARSIZE_ANY_EXHDR(type)))
> + {
> + JsonLexContext *lex;
> + JsonTokenType tok;
> +
> + lex = makeJsonLexContext(json, false);
> +
> + /* Lex exactly one token from the input and check its type. */
> + PG_TRY();
> + {
> + json_lex(lex);
> + }
> + PG_CATCH();
> + {
> + if (ERRCODE_TO_CATEGORY(geterrcode()) == 
> ERRCODE_DATA_EXCEPTION)
> + {
> + FlushErrorState();
> + MemoryContextSwitchTo(mcxt);
> + PG_RETURN_BOOL(false);  /* invalid json */
> + }
> + PG_RE_THROW();
> + }
> + PG_END_TRY();


It baffles me that a year after I raised this as a serious issue, in
this thread, this patch still contains code like this.


Greetings,

Andres Freund

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-02-15 Thread Andres Freund

Hi,


On 2019-01-30 10:26:52 +0800, Hubert Zhang wrote:
> Hi Michael, Robert
> For you question about the hook position, I want to explain more about the
> background why we want to introduce these hooks.
> We wrote a diskquota extension 
> [ ...]
> On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang  wrote:
> 
> > > For this particular purpose, I don't immediately see why you need a
> >> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
> >> > guaranteed to end up in smgrextend()?
> >> Yes, that's a bit awkward.

Please note that on PG lists the customary and desired style is to quote
inline in messages rather than top-quote.

Greetings,

Andres Freund

Re: [PATCH v20] GSSAPI encryption support

2019-02-15 Thread Andres Freund

Hi,

On 2018-12-18 14:12:46 -0500, Robbie Harwood wrote:
> From 6915ae2507bf7910c5eecfbd0b84805531c16a07 Mon Sep 17 00:00:00 2001
> From: Robbie Harwood 
> Date: Thu, 10 May 2018 16:12:03 -0400
> Subject: [PATCH] libpq GSSAPI encryption support
> 
> On both the frontend and backend, prepare for GSSAPI encryption
> support by moving common code for error handling into a separate file.
> Fix a TODO for handling multiple status messages in the process.
> Eliminate the OIDs, which have not been needed for some time.
> Add frontend and backend encryption support functions.  Keep the
> context initiation for authentication-only separate on both the
> frontend and backend in order to avoid concerns about changing the
> requested flags to include encryption support.
> 
> In postmaster, pull GSSAPI authorization checking into a shared
> function.  Also share the initiator name between the encryption and
> non-encryption codepaths.
> 
> Modify pqsecure_write() to take a non-const pointer.  The pointer will
> not be modified, but this change (or a const-discarding cast, or a
> malloc()+memcpy()) is necessary for GSSAPI due to const/struct
> interactions in C.
> 
> For HBA, add "hostgss" and "hostnogss" entries that behave similarly
> to their SSL counterparts.  "hostgss" requires either "gss", "trust",
> or "reject" for its authentication.
> 
> Simiarly, add a "gssmode" parameter to libpq.  Supported values are
> "disable", "require", and "prefer".  Notably, negotiation will only be
> attempted if credentials can be acquired.  Move credential acquisition
> into its own function to support this behavior.
> 
> Finally, add documentation for everything new, and update existing
> documentation on connection security.
> 
> Thanks to Michael Paquier for the Windows fixes.

Could some of these be split into separate patches that could be more
eagerly merged? This is a somewhat large patch...


> diff --git a/src/interfaces/libpq/fe-secure.c 
> b/src/interfaces/libpq/fe-secure.c
> index a06fc7dc82..f4f196e3b4 100644
> --- a/src/interfaces/libpq/fe-secure.c
> +++ b/src/interfaces/libpq/fe-secure.c
> @@ -220,6 +220,13 @@ pqsecure_read(PGconn *conn, void *ptr, size_t len)
>   n = pgtls_read(conn, ptr, len);
>   }
>   else
> +#endif
> +#ifdef ENABLE_GSS
> + if (conn->gssenc)
> + {
> + n = pg_GSS_read(conn, ptr, len);
> + }
> + else
>  #endif
>   {
>   n = pqsecure_raw_read(conn, ptr, len);
> @@ -287,7 +294,7 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
>   * to determine whether to continue/retry after error.
>   */
>  ssize_t
> -pqsecure_write(PGconn *conn, const void *ptr, size_t len)
> +pqsecure_write(PGconn *conn, void *ptr, size_t len)
>  {
>   ssize_t n;
>  
> @@ -297,6 +304,13 @@ pqsecure_write(PGconn *conn, const void *ptr, size_t len)
>   n = pgtls_write(conn, ptr, len);
>   }
>   else
> +#endif
> +#ifdef ENABLE_GSS
> + if (conn->gssenc)
> + {
> + n = pg_GSS_write(conn, ptr, len);
> + }
> + else
>  #endif
>   {
>   n = pqsecure_raw_write(conn, ptr, len);

Not a fan of this. Seems like we'll grow more and more such branches
over time? Wouldn't the right thing be to have callbacks in PGconn (and
similarly in the backend)?  Seems like if that's done reasonably it'd
also make integration of compression easier, because that could just
layer itself between encryption and wire?


Greetings,

Andres Freund

Re: REL_11_STABLE: dsm.c - cannot unpin a segment that is not pinned

2019-02-15 Thread Thomas Munro

On Sat, Feb 16, 2019 at 3:38 PM Justin Pryzby  wrote:
> I saw this error once last week while stress testing to reproduce earlier 
> bugs,
> but tentatively thought it was a downstream symptom of those bugs (since
> fixed), and now wanted to check that #15585 and others were no longer
> reproducible.  Unfortunately I got this error while running same test case [2]
> as for previous bug ('could not attach').
>
> 2019-02-14 23:40:41.611 MST [32287] ERROR:  cannot unpin a segment that is 
> not pinned
>
> On commit faf132449c0cafd31fe9f14bbf29ca0318a89058 (REL_11_STABLE including
> both of last week's post-11.2 DSA patches), I reproduced twice, once within
> ~2.5 hours, once within 30min.
>
> I'm not able to reproduce on master running overnight and now 16+hours.

Oh, I think I know why: dsm_unpin_segment() containt another variant
of the race fixed by 6c0fb941 (that was for dsm_attach() being
confused by segments with the same handle that are concurrently going
away, but dsm_unpin_segment() does a handle lookup too, so it can be
confused by the same phenomenon).  Untested, but the fix is probably:

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index cfbebeb31d..23ccc59f13 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -844,8 +844,8 @@ dsm_unpin_segment(dsm_handle handle)
LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
for (i = 0; i < dsm_control->nitems; ++i)
{
-   /* Skip unused slots. */
-   if (dsm_control->item[i].refcnt == 0)
+   /* Skip unused slots and segments that are
concurrently going away. */
+   if (dsm_control->item[i].refcnt <= 1)
continue;

/* If we've found our handle, we can stop searching. */

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-06 15:19:56 +, Timmer, Marius wrote:
> On Mon, Jan 04, 2019 at 03:06, Michael Paquier wrote:
> > I have moved the patch to next CF, waiting on author as the latest
> > patch does not apply.  Could it be rebased?

> The patch is rebased and applies now.

I see you've marked the patch as needs review - but as the patch
previously was marked as ready-for-committer, and I assume nothing
substantial has changed, I think RFC might still be the accurate state?

- Andres

Re: Remove Deprecated Exclusive Backup Mode

2019-02-15 Thread Andres Freund

On 2019-01-05 13:19:20 -0500, Stephen Frost wrote:
> Greetings,
> 
> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> > On 12/12/2018 05:31, Robert Haas wrote:
> > > Most of the features I've been involved in removing have been
> > > deprecated for 5+ years.  The first release where this one was
> > > deprecated was only 2 years ago.  So it feels dramatically faster to
> > > me than what I think we have typically done.
> > 
> > I was just looking this up as well, and I find it too fast.  The
> > nonexclusive backups were introduced in 9.6.  So I'd say that we could
> > remove the exclusive ones when 9.5 goes EOL.  (That would mean this
> > patch could be submitted for PostgreSQL 13, since 9.5 will go out of
> > support around the time PG13 would be released.)
> 
> I don't agree with either the notion that we have to wait 5 years in
> this case or that we've only had a good alternative to the exclusive
> backup mode since 9.5 as we've had pg_basebackup since 9.1.

I don't agree with a general 5 year deprecation window either. But it
seems pretty clear that there's no majority for removing exclusive
backups in v12.   I think it'd be good to make the warning about its
impending death more explicit, but otherwise mark this CF entry either
as rejected or returned with feedback.

Greetings,

Andres Freund

Re: allow online change primary_conninfo

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-03 15:33:38 +0300, Sergei Kornilov wrote:
> +/*
> + * Actual processing SIGHUP signal
> + */
> +static void
> +ProcessWalRcvSigHup(void)
> +{
> + ProcessConfigFile(PGC_SIGHUP);
> +
> + /*
> +  * If primary_conninfo has been changed while walreceiver is running,
> +  * shut down walreceiver so that a new walreceiver is started and
> +  * initiates replication with the new connection information.
> +  */
> + if (strcmp(current_conninfo, PrimaryConnInfo) != 0)
> + ereport(FATAL,
> + (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +  errmsg("terminating walreceiver process due to 
> change of primary_conninfo"),
> +  errdetail("In a moment starts streaming WAL 
> with new configuration.")));
> +
> + /*
> +  * And the same for primary_slot_name.
> +  */
> + if (strcmp(current_slotname, PrimarySlotName) != 0)
> + ereport(FATAL,
> + (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +  errmsg("terminating walreceiver process due to 
> change of primary_slot_name"),
> +  errdetail("In a moment starts streaming WAL 
> with new configuration.")));
> +
> + XLogWalRcvSendHSFeedback(true);
> +}

I don't quite think this is the right design. IMO the startup process
should signal the walreceiver to shut down, and the wal receiver should
never look at the config. Otherwise there's chances for knowledge of
pg.conf to differ between the processes.

Greetings,

Andres Freund

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-08 18:30:18 +0300, Alexey Kondratov wrote:
> From 99c6d94f37a797400d41545a271ff111b92e9361 Mon Sep 17 00:00:00 2001
> From: Alexey Kondratov 
> Date: Fri, 21 Dec 2018 14:00:30 +0300
> Subject: [PATCH] pg_rewind: options to use restore_command from
>  postgresql.conf or command line.
> 
> ---
>  doc/src/sgml/ref/pg_rewind.sgml   |  30 +-
>  src/backend/Makefile  |   4 +-
>  src/backend/commands/extension.c  |   1 +
>  src/backend/utils/misc/.gitignore |   1 -
>  src/backend/utils/misc/Makefile   |   8 -
>  src/backend/utils/misc/guc.c  | 434 +--
>  src/bin/pg_rewind/Makefile|   2 +-
>  src/bin/pg_rewind/parsexlog.c | 166 +-
>  src/bin/pg_rewind/pg_rewind.c | 100 +++-
>  src/bin/pg_rewind/pg_rewind.h |  10 +-
>  src/bin/pg_rewind/t/001_basic.pl  |   4 +-
>  src/bin/pg_rewind/t/002_databases.pl  |   4 +-
>  src/bin/pg_rewind/t/003_extrafiles.pl |   4 +-
>  src/bin/pg_rewind/t/RewindTest.pm |  93 +++-
>  src/common/.gitignore |   1 +
>  src/common/Makefile   |   9 +-
>  src/{backend/utils/misc => common}/guc-file.l | 518 --
>  src/include/common/guc-file.h |  50 ++
>  src/include/utils/guc.h   |  39 +-
>  src/tools/msvc/Mkvcbuild.pm   |   7 +-
>  src/tools/msvc/clean.bat  |   2 +-
>  21 files changed, 973 insertions(+), 514 deletions(-)
>  delete mode 100644 src/backend/utils/misc/.gitignore
>  rename src/{backend/utils/misc => common}/guc-file.l (60%)
>  create mode 100644 src/include/common/guc-file.h

As noted in a message of a few minutes ago, I'm very doubtful that the
approach of using guc-file.l like that is a good idea. But if we go for
that, that part of the patch *NEEDS* to be split into a separate
commit/patch. It's too hard to see functional changes otherwise.


> @@ -291,9 +299,46 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, 
> XLogRecPtr targetPagePtr,
>  
>   if (xlogreadfd < 0)
>   {
> - printf(_("could not open file \"%s\": %s\n"), xlogfpath,
> -strerror(errno));
> - return -1;
> + bool  restore_ok;
> +
> + /*
> +  * If we have no restore_command to execute, then exit.
> +  */
> + if (private->restoreCommand == NULL)
> + {
> + printf(_("could not open file \"%s\": %s\n"), 
> xlogfpath,
> +strerror(errno));
> + return -1;
> + }
> +
> + /*
> +  * Since we have restore_command to execute, then try 
> to retreive
> +  * missing WAL file from the archive.
> +  */
> + restore_ok = RestoreArchivedWAL(private->datadir,
> + 
> xlogfname,
> + 
> WalSegSz,
> + 
> private->restoreCommand);
> +
> + if (restore_ok)
> + {
> + xlogreadfd = open(xlogfpath, O_RDONLY | 
> PG_BINARY, 0);
> +
> + if (xlogreadfd < 0)
> + {
> + printf(_("could not open restored from 
> archive file \"%s\": %s\n"), xlogfpath,
> + strerror(errno));
> + return -1;
> + }
> + else
> + pg_log(PG_DEBUG, "using restored from 
> archive version of file \"%s\"\n", xlogfpath);
> + }
> + else
> + {
> + printf(_("could not restore file \"%s\" from 
> archive: %s\n"), xlogfname,
> +strerror(errno));
> + return -1;
> + }
>   }
>   }

I suggest moving this to a separate function.




> + if (restore_command != NULL)
> + {
> + if (restore_wals)
> + {
> + fprintf(stderr, _("%s: conflicting options: both -r and 
> -R are specified\n"),
> + progname);
> + fprintf(stderr, _("You must run %s with either 
> -r/--use-postgresql-conf or -R/--restore-command.\n"),
> +

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-02-15 Thread Andres Freund

Hi,

On 2018-11-07 12:58:11 +0300, Alexey Kondratov wrote:
> On 30.10.2018 06:01, Michael Paquier wrote:
> > - Reusing the GUC parser is something I would avoid as well.  Not worth
> > the complexity.
> 
> Yes, I don't like it either. I will try to make guc-file.l frontend safe.

It sounds like a seriously bad idea to use a different parser for
pg_rewind.  Why don't you just use postgres for it? As in
/path/to/postgres -D /path/to/datadir/ -C shared_buffers
?

Re: Copy function for logical replication slots

2019-02-15 Thread Andres Freund

Hi,

On 2019-01-15 10:56:04 +0900, Masahiko Sawada wrote:

> + pg_copy_physical_replication_slot
> +
> +
> pg_copy_physical_replication_slot(src_slot_name
>  name, dst_slot_name , 
> temporary 
> bool)
> +   
> +   
> +(slot_name name, 
> lsn pg_lsn)
> +   
> +   
> +Copies an existing physical replication slot name 
> src_slot_name
> +to a physical replication slot named 
> dst_slot_name.
> +The copied physical slot starts to reserve WAL from the same 
> LSN as the
> +source slot.
> +temporary is optional. If 
> temporary
> +is omitted, the same value as the source slot is used.
> +   
> +  
> +
> +  
> +   
> +
> + pg_copy_logical_replication_slot
> +
> +
> pg_copy_logical_replication_slot(src_slot_name
>  name, dst_slot_name , 
> plugin name , 
> temporary 
> boolean)
> +   
> +   
> +(slot_name name, 
> lsn pg_lsn)
> +   
> +   
> +Copies an existing logical replication slot name 
> src_slot_name
> +to a logical replication slot named 
> dst_slot_name
> +while changing the output plugin and persistence. The copied logical 
> slot starts
> +from the same LSN as the source logical slot. 
> Both plugin and
> +temporary are optional. If 
> plugin
> +or temporary are omitted, the same values as
> +the source logical slot are used.
> +   
> +  

Would it make sense to move the differing options to the end of the
argument list? Right now we have a few common params, then a different
one, and then another common one?


> @@ -271,7 +272,7 @@ CreateInitDecodingContext(char *plugin,
>   StrNCpy(NameStr(slot->data.plugin), plugin, NAMEDATALEN);
>   SpinLockRelease(>mutex);
>
> - ReplicationSlotReserveWal();
> + ReplicationSlotReserveWal(restart_lsn);

Why do we even need to call this? It ought to be guaranteed that there's
sufficient WAL, right?  And somehow it seems harder to understand to me
that the reserve routine gets an LSN.


>  /*
>   * Reserve WAL for the currently active slot.
>   *
> - * Compute and set restart_lsn in a manner that's appropriate for the type of
> - * the slot and concurrency safe.
> + * If an lsn to reserve is not requested, compute and set restart_lsn
> + * in a manner that's appropriate for the type of the slot and concurrency 
> safe.
> + * If the reseved WAL is requested, set restart_lsn and check if the 
> corresponding
> + * wal segment is available.
>   */
>  void
> -ReplicationSlotReserveWal(void)
> +ReplicationSlotReserveWal(XLogRecPtr requested_lsn)
>  {
>   ReplicationSlot *slot = MyReplicationSlot;
>
> @@ -1005,47 +1007,57 @@ ReplicationSlotReserveWal(void)
>* The replication slot mechanism is used to prevent removal of required
>* WAL. As there is no interlock between this routine and checkpoints, 
> WAL
>* segments could concurrently be removed when a now stale return value 
> of
> -  * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case 
> that
> -  * this happens we'll just retry.
> +  * ReplicationSlotsComputeRequiredLSN() is used. If the lsn to reserve 
> is
> +  * not requested, in the unlikely case that this happens we'll just 
> retry.
>*/
>   while (true)
>   {
>   XLogSegNo   segno;
>   XLogRecPtr  restart_lsn;
>
> - /*
> -  * For logical slots log a standby snapshot and start logical 
> decoding
> -  * at exactly that position. That allows the slot to start up 
> more
> -  * quickly.
> -  *
> -  * That's not needed (or indeed helpful) for physical slots as 
> they'll
> -  * start replay at the last logged checkpoint anyway. Instead 
> return
> -  * the location of the last redo LSN. While that slightly 
> increases
> -  * the chance that we have to retry, it's where a base backup 
> has to
> -  * start replay at.
> -  */
> - if (!RecoveryInProgress() && SlotIsLogical(slot))
> + if (!XLogRecPtrIsInvalid(requested_lsn))
>   {
> - XLogRecPtr  flushptr;
> -
> - /* start at current insert position */
> - restart_lsn = GetXLogInsertRecPtr();
> + /* Set the requested lsn */
>   SpinLockAcquire(>mutex);
> - slot->data.restart_lsn = restart_lsn;
> + slot->data.restart_lsn = requested_lsn;
>   SpinLockRelease(>mutex);
> -
> - /* make sure we have enough information to start */
> - flushptr = LogStandbySnapshot();
> -
> - /* and make sure it's fsynced to disk */
> - XLogFlush(flushptr);
>   }
>

Re: Channel binding

2019-02-15 Thread Bruce Momjian

On Sat, Feb 16, 2019 at 10:12:19AM +0900, Michael Paquier wrote:
> On Fri, Feb 15, 2019 at 04:17:07PM -0500, Bruce Momjian wrote:
> > We removed channel binding from PG 11 in August of 2018 because we were
> > concerned about downgrade attacks.  Are there any plans to enable it for
> > PG 12?
> 
> The original implementation of channel binding for SCRAM has included
> support for two channel binding types: tls-unique and
> tls-server-end-point.  The original implementation also had a
> connection parameter called scram_channel_binding to control the
> channel binding type to use or to disable it.
> 
> What has been removed via 7729113 are tls-unique and the libpq
> parameter, and we still have basic channel binding support.  The
> reasons behind that is that tls-unique future is uncertain as of TLS
> 1.3, and that tls-server-end-point will still be supported.  This also
> simplified the protocol as it is not necessary to let the client
> decide which channel binding to use.

Well, my point was that this features was considered to be very
important for PG 11, but for some reason there has been no advancement
of this features for PG 12.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] Restricting maximum keep segments by repslots

2019-02-15 Thread Andres Freund

Hi,

On 2019-01-30 10:42:04 +0900, Kyotaro HORIGUCHI wrote:
> From 270aff9b08ced425b4c4e23b53193285eb2359a6 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi 
> Date: Thu, 21 Dec 2017 21:20:20 +0900
> Subject: [PATCH 1/6] Add WAL relief vent for replication slots
> 
> Adds a capability to limit the number of segments kept by replication
> slots by a GUC variable.

Maybe I'm missing something, but how does this prevent issues with
active slots that are currently accessing the WAL this patch now
suddenly allows to be removed? Especially for logical slots that seems
not unproblematic?

Besides that, this patch needs substantial spelling / language / comment
polishing. Horiguchi-san, it'd probably be good if you could make a
careful pass, and then maybe a native speaker could go over it?

Greetings,

Andres Freund

Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk

2019-02-15 Thread Andres Freund

Hi,

On 2018-12-03 06:38:43 -0500, Dave Cramer wrote:
> From 4d023cfc1fed0b5852b4da1aad6a32549b03ce26 Mon Sep 17 00:00:00 2001
> From: Dave Cramer 
> Date: Fri, 30 Nov 2018 18:23:49 -0500
> Subject: [PATCH 1/5] Respect client initiated CopyDone in walsender
> 
> ---
>  src/backend/replication/walsender.c | 36 ++--
>  1 file changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/src/backend/replication/walsender.c 
> b/src/backend/replication/walsender.c
> index 46edb52..93f2648 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -770,6 +770,14 @@ logical_read_xlog_page(XLogReaderState *state, 
> XLogRecPtr targetPagePtr, int req
>   sendTimeLineValidUpto = state->currTLIValidUntil;
>   sendTimeLineNextTLI = state->nextTLI;
>  
> + /*
> + * If the client sent CopyDone while we were waiting,
> + * bail out so we can wind up the decoding session.
> + */
> + if (streamingDoneSending)
> + return -1;
> +
> +  /* more than one block available */
>   /* make sure we have enough WAL available */
>   flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
>  
> @@ -1341,8 +1349,12 @@ WalSndWaitForWal(XLogRecPtr loc)
>* It's important to do this check after the recomputation of
>* RecentFlushPtr, so we can send all remaining data before 
> shutting
>* down.
> -  */
> - if (got_STOPPING)
> +  *
> +  * We'll also exit here if the client sent CopyDone because it 
> wants
> +  * to return to command mode.
> + */
> +
> + if (got_STOPPING || streamingDoneReceiving)
>   break;
>  
>   /*
> @@ -2095,7 +2107,14 @@ WalSndCheckTimeOut(void)
>   }
>  }
>  
> -/* Main loop of walsender process that streams the WAL over Copy messages. */
> +/*
> + * Main loop of walsender process that streams the WAL over Copy messages.
> + *
> + * The send_data callback must enqueue complete CopyData messages to libpq
> + * using pq_putmessage_noblock or similar, since the walsender loop may send
> + * CopyDone then exit and return to command mode in response to a client
> + * CopyDone between calls to send_data.
> + */

Wait, how is it ok to end CopyDone before all the pending data has been
sent out?



> diff --git a/src/backend/replication/logical/reorderbuffer.c 
> b/src/backend/replication/logical/reorderbuffer.c
> index 23466ba..66b6e90 100644
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -1497,7 +1497,9 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId 
> xid,
>   rb->begin(rb, txn);
>  
>   iterstate = ReorderBufferIterTXNInit(rb, txn);
> - while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != 
> NULL)
> + while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != 
> NULL &&
> +(rb->continue_decoding_cb == NULL ||
> + rb->continue_decoding_cb()))
>   {
>   Relationrelation = NULL;
>   Oid reloid;

> @@ -1774,8 +1776,11 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId 
> xid,
>   ReorderBufferIterTXNFinish(rb, iterstate);
>   iterstate = NULL;
>  
> - /* call commit callback */
> - rb->commit(rb, txn, commit_lsn);
> + if (rb->continue_decoding_cb == NULL || 
> rb->continue_decoding_cb())
> + {
> + /* call commit callback */
> + rb->commit(rb, txn, commit_lsn);
> + }


I'm doubtful it's ok to simply stop in the middle of a transaction.



> @@ -1194,17 +1212,6 @@ WalSndWriteData(LogicalDecodingContext *ctx, 
> XLogRecPtr lsn, TransactionId xid,
>  
>   CHECK_FOR_INTERRUPTS();
>  
> - /* Try to flush pending output to the client */
> - if (pq_flush_if_writable() != 0)
> - WalSndShutdown();
> -
> - /* Try taking fast path unless we get too close to walsender timeout. */
> - if (now < TimestampTzPlusMilliseconds(last_reply_timestamp,
> - 
>   wal_sender_timeout / 2) &&
> - !pq_is_send_pending())
> - {
> - return;
> - }

As somebody else commented on the thread, I'm also doubtful this is
ok. This'll introduce significant additional blocking unless I'm missing
something?



>   /* If we have pending write here, go to slow path */
>   for (;;)
> @@ -1358,7 +1365,14 @@ WalSndWaitForWal(XLogRecPtr loc)
>   break;
>  
>   /*
> -  * We only send regular messages to the client for full decoded
> +  * If we have received CopyDone from the client, sent CopyDone
> +

Re: Refactoring the checkpointer's fsync request queue

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-13 18:40:05 +1300, Thomas Munro wrote:
> Thanks!  And sorry for not replying sooner -- I got distracted by
> FOSDEM (and the associated 20 thousand miles of travel).  On that trip
> I had a chance to discuss this patch with Andres Freund in person, and
> he opined that it might be better for the fsync request queue to work
> in terms of pathnames.  Instead of the approach in this patch, where a
> backend sends an fsync request for { reflfilenode, segno } inside
> mdwrite(), and then the checkpointer processes the request by calling
> smgrdimmedsyncrel(), he speculated that it'd be better to have
> mdwrite() send an fsync request for a pathname, and then the
> checkpointer would just open that file by name and fsync() it.  That
> is, the checkpointer wouldn't call back into smgr.
> 
> One of the advantages of that approach is that there are probably
> other files that need to be fsync'd for each checkpoint that could
> benefit from being offloaded to the checkpointer.  Another is that you
> break the strange cycle mentioned above.

The other issue is that I think your approach moves the segmentation
logic basically out of md into smgr. I think that's wrong. We shouldn't
presume that every type of storage is going to have segmentation that's
representable in a uniform way imo.


> Another consideration if we do that is that the existing scheme has a
> kind of hierarchy that allows fsync requests to be cancelled in bulk
> when you drop relations and databases.  That is, the checkpointer
> knows about the internal hierarchy of tablespace, db, rel, seg.  If we
> get rid of that and have just paths, it seems like a bad idea to teach
> the checkpointer about the internal structure of the paths (even
> though we know they contain the same elements encoded somehow).  You'd
> have to send an explicit cancel for every key; that is, if you're
> dropping a relation, you need to generate a cancel message for every
> segment, and if you're dropping a database, you need to generate a
> cancel message for every segment of every relation.

I can't see that being a problem - compared to the overhead of dropping
a relation, that doesn't seem to be a meaningfully large cost?

Greetings,

Andres Freund

REL_11_STABLE: dsm.c - cannot unpin a segment that is not pinned

2019-02-15 Thread Justin Pryzby

I saw this error once last week while stress testing to reproduce earlier bugs,
but tentatively thought it was a downstream symptom of those bugs (since
fixed), and now wanted to check that #15585 and others were no longer
reproducible.  Unfortunately I got this error while running same test case [2]
as for previous bug ('could not attach').

2019-02-14 23:40:41.611 MST [32287] ERROR:  cannot unpin a segment that is not 
pinned

On commit faf132449c0cafd31fe9f14bbf29ca0318a89058 (REL_11_STABLE including
both of last week's post-11.2 DSA patches), I reproduced twice, once within
~2.5 hours, once within 30min.

I'm not able to reproduce on master running overnight and now 16+hours.

See also:

pg11.1: dsa_area could not attach to segment - resolved by commit 
6c0fb941892519ad6b8873e99c4001404fb9a128
  [1] 
https://www.postgresql.org/message-id/20181231221734.GB25379%40telsasoft.com
pg11.1: dsa_area could not attach to segment:
  [2] 
https://www.postgresql.org/message-id/20190211040132.GV31721%40telsasoft.com
BUG #15585: infinite DynamicSharedMemoryControlLock waiting in parallel query
  [3] 
https://www.postgresql.org/message-id/flat/15585-324ff6a93a18da46%40postgresql.org
dsa_allocate() faliure - resolved by commit 
7215efdc005e694ec93678a6203dbfc714d12809 (also doesn't affect master)
  [4] 
https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com

#0  0x7f96a589e277 in raise () from /lib64/libc.so.6
#1  0x7f96a589f968 in abort () from /lib64/libc.so.6
#2  0x0088b6f7 in errfinish (dummy=dummy@entry=0) at elog.c:555
#3  0x0088eee8 in elog_finish (elevel=elevel@entry=22, 
fmt=fmt@entry=0xa52cb0 "cannot unpin a segment that is not pinned") at 
elog.c:1376
#4  0x007578b2 in dsm_unpin_segment (handle=1780672242) at dsm.c:914
#5  0x008aee15 in dsa_release_in_place (place=0x7f96a6991640) at 
dsa.c:618
#6  0x007571a9 in dsm_detach (seg=0x2470f78) at dsm.c:731
#7  0x00509233 in DestroyParallelContext (pcxt=0x24c18c0) at 
parallel.c:900
#8  0x0062db65 in ExecParallelCleanup (pei=0x25aacf8) at 
execParallel.c:1154
#9  0x00640588 in ExecShutdownGather (node=node@entry=0x2549b60) at 
nodeGather.c:406
#10 0x00630208 in ExecShutdownNode (node=0x2549b60) at 
execProcnode.c:767
#11 0x0067724f in planstate_tree_walker 
(planstate=planstate@entry=0x2549a48, walker=walker@entry=0x6301c0 
, context=context@entry=0x0) at nodeFuncs.c:3739
#12 0x006301dd in ExecShutdownNode (node=0x2549a48) at 
execProcnode.c:749
#13 0x0067724f in planstate_tree_walker 
(planstate=planstate@entry=0x25495d0, walker=walker@entry=0x6301c0 
, context=context@entry=0x0) at nodeFuncs.c:3739
#14 0x006301dd in ExecShutdownNode (node=0x25495d0) at 
execProcnode.c:749
#15 0x0067724f in planstate_tree_walker 
(planstate=planstate@entry=0x25494b8, walker=walker@entry=0x6301c0 
, context=context@entry=0x0) at nodeFuncs.c:3739
#16 0x006301dd in ExecShutdownNode (node=0x25494b8) at 
execProcnode.c:749
#17 0x0067724f in planstate_tree_walker 
(planstate=planstate@entry=0x25492a8, walker=walker@entry=0x6301c0 
, context=context@entry=0x0) at nodeFuncs.c:3739
#18 0x006301dd in ExecShutdownNode (node=node@entry=0x25492a8) at 
execProcnode.c:749
#19 0x00628fbd in ExecutePlan (execute_once=, 
dest=0xd96e60 , direction=, numberTuples=0, 
sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, 
planstate=0x25492a8, estate=0x2549038) at execMain.c:1787
#20 standard_ExecutorRun (queryDesc=0x2576990, direction=, 
count=0, execute_once=) at execMain.c:364
#21 0x005c635f in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x2579820, into=into@entry=0x0, 
es=es@entry=0x2529170, 
queryString=queryString@entry=0x243f348 "explain analyze SELECT 
colcld.child c, parent p, array_agg(colpar.attname::text ORDER BY 
colpar.attnum) cols, array_agg(format_type(colpar.atttypid, colpar.atttypmod) 
ORDER BY colpar.attnum) AS types "..., params=params@entry=0x0, 
queryEnv=queryEnv@entry=0x0, planduration=planduration@entry=0x7fff1048afa0) at 
explain.c:535
#22 0x005c665f in ExplainOneQuery (query=, 
cursorOptions=, into=0x0, es=0x2529170, 
queryString=0x243f348 "explain analyze SELECT colcld.child c, parent p, 
array_agg(colpar.attname::text ORDER BY colpar.attnum) cols, 
array_agg(format_type(colpar.atttypid, colpar.atttypmod) ORDER BY 
colpar.attnum) AS types "..., params=0x0, queryEnv=0x0) at explain.c:371
#23 0x005c6bbe in ExplainQuery (pstate=pstate@entry=0x2461bd8, 
stmt=stmt@entry=0x24f87a8, 
queryString=queryString@entry=0x243f348 "explain analyze SELECT 
colcld.child c, parent p, array_agg(colpar.attname::text ORDER BY 
colpar.attnum) cols, array_agg(format_type(colpar.atttypid, colpar.atttypmod) 
ORDER BY colpar.attnum) AS types "..., params=params@entry=0x0, 
queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x2461b40) at

Re: 2019-03 CF Summary / Review - Tranche #1

2019-02-15 Thread Amit Kapila

On Fri, Feb 15, 2019 at 2:08 AM Andres Freund  wrote:
>
> Hi,
>
> - Avoid creation of the free space map for small tables
>
>   NR: the majority of this patch has been committed,
>

This is a correct assessment.

> I assume the
>   remaining pieces will too.
>

Yes, I am busy these days with something else, but I will surely get
back to reviewing/committing the remaining stuff for PG12.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Compressed TOAST Slicing

2019-02-15 Thread Andres Freund

Hi Stephen,


On 2018-12-06 12:54:18 -0800, Paul Ramsey wrote:
> On Sun, Dec 2, 2018 at 7:03 AM Rafia Sabih 
> wrote:
> >
> > The idea looks good and believing your performance evaluation it seems
> > like a practical one too.
> 
> Thank you kindly for the review!
> 
> > A comment explaining how this check differs for is_slice case would be
> helpful.
> > Looks like PG indentation is not followed here for n.
> 
> I have attached updated patches that add the comment and adhere to the Pg
> variable declaration indentation styles,
> ATB!
> P

You were mentioning committing this at the Brussels meeting... :)

Greetings,

Andres Freund

Re: Delay locking partitions during INSERT and UPDATE

2019-02-15 Thread Andres Freund

Hi,

On 2019-01-31 13:46:33 -0500, Robert Haas wrote:
> I have reviewed this patch and I am in favor of it.  I think it likely
> needs minor rebasing because of the heap_open -> table_open renaming.
> I also agree that it's worth taking some deadlock risk for the rather
> massive performance gain, although I suspect it's likely that a few
> users are going to complain about deadlocks and I wonder whether we'll
> have to put some energy into that problem at some point.  However, I
> think what we probably want to do there is reduce the probability of
> deadlocks through some trickery or maybe invent some new locking
> mechanisms that work around the problem.  The alternative of trying to
> block this patch seems unpalatable.

Are you saying that such workarounds would have to be merged at the same
time as this patch? Or that we'd address such complaints that way at a
later time?

Greetings,

Andres Freund

Re: Patch for SortSupport implementation on inet/cdir

2019-02-15 Thread Andres Freund

Hi Brandur,

On 2019-02-09 20:12:53 +1300, Edmund Horner wrote:
> I had a look at this.  Your V2 patch applies cleanly, and the code was
> straightforward and well commented.  I appreciate the big comment at the
> top of network_abbrev_convert explaining how you encode the data.

I've marked the CF entry as waiting-on-author, due to this review.

Brandur, unfortunately this patch has only been submitted to the last
commitfest for v12. Our policy is that all nontrivial patches should be
submitted to at least one earlier commitfest. Therefore I unfortunately
think that v12 is not a realistic target, sorry :(

Greetings,

Andres Freund

Re: [Patch][WiP] Tweaked LRU for shared buffers

2019-02-15 Thread Tomas Vondra

On 2/16/19 1:48 AM, Benjamin Manes wrote:
> Hi,
> 
> I was not involved with Andrey and his team's work, which looks like a
> very promising first pass. I can try to clarify a few minor details.
> 
> What is CAR? Did you mean ARC, perhaps?
> 
> 
> CAR is the Clock variants of ARC: CAR: Clock with Adaptive Replacement
> 
> 

Thanks, will check.

> I believe the main interest in ARC is its claim of adaptability with
> high hit rates. Unfortunately the reality is less impressive as it fails
> to handle many frequency workloads, e.g. poor scan resistance, and the
> adaptivity is poor due to the accesses being quite noisy. For W-TinyLFU,
> we have recent improvements which show near perfect adaptivity
> 
>  in
> our stress case that results in double the hit rate of ARC and is less
> than 1% from optimal.
> 

Interesting.

> Can you point us to the thread/email discussing those ideas? I have
> tried searching through archives, but I haven't found anything :-(
> 
> 
> This thread
> 
> doesn't explain Andrey's work, but includes my minor contribution. The
> longer thread discusses the interest in CAR, et al.
> 

Thanks.

> Are you suggesting to get rid of the buffer rings we use for
> sequential scans, for example? IMHO that's going to be tricky, e.g.
> because of synchronized sequential scans.
> 
> 
> If you mean "synchronized" in terms of multi-threading and locks, then
> this is a solved problem
>  in
> terms of caching.

No, I "synchronized scans" are an optimization to reduce I/O when
multiple processes do sequential scan on the same table. Say one process
is half-way through the table, when another process starts another scan.
Instead of scanning it from block #0 we start at the position of the
first process (at which point they "synchronize") and then wrap around
to read the beginning.

I was under the impression that this somehow depends on the small
circular buffers, but I've now checked the code and I see that's bogus.


> My limited understanding is that the buffer rings are
> used to protect the cache from being polluted by scans which flush the
> LRU-like algorithms. This allows those policies to capture more frequent
> items. It also avoids lock contention on the cache due to writes caused
> by misses, where Clock allows lock-free reads but uses a global lock on
> writes. A smarter cache eviction policy and concurrency model can handle
> this without needing buffer rings to compensate.
> 

Right - that's the purpose of circular buffers.

> Somebody should write a patch to make buffer eviction completely
> random, without aiming to get it committed. That isn't as bad of a
> strategy as it sounds, and it would help with assessing improvements
> in this area.
> 
> 
> A related and helpful patch would be to capture the access log and
> provide anonymized traces. We have a simulator
>  with dozens of
> policies to quickly provide a breakdown. That would let you know the hit
> rates before deciding on the policy to adopt.
> 

Interesting. I assume the trace is essentially a log of which blocks
were requested? Is there some trace format specification somewhere?

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Optimze usage of immutable functions as relation

2019-02-15 Thread Andres Freund

On 2018-11-08 15:08:03 +0100, Antonin Houska wrote:
> Aleksandr Parfenov  wrote:
> 
> > I fixed a typo and some comments. Please find new version attached.
> 
> I've checked this verstion too.
> 
> * is_simple_stable_function()
> 
> instead of fetching HeapTuple from the syscache manually, you might want to
> consider using functions from lsyscache.c (get_func_rettype, get_func_retset,
> etc.), or adding a function that returns (subset of) the fields you need in a
> single call.
> 
> * pull_up_simple_function():
> 
> As you assume that ret->functions is a single-item list
> 
>   Assert(list_length(rte->functions) == 1);
> 
> the following iteration is not necessary:
> 
>   foreach(lc, functions_list)
> 
> Also, there seems to be a lot of copy & paste from pull_up_simple_values(), so
> some refactoring would make sense.

Given this I think the appropriate state of the CF entry would have been
waiting-for-author, not needs review. Or alternatively
returned-with-feedback or rejected.  I'm a bit confused as to why the
patch was moved to the next CF twice?

- Andres

Re: Channel binding

2019-02-15 Thread Michael Paquier

On Fri, Feb 15, 2019 at 04:17:07PM -0500, Bruce Momjian wrote:
> We removed channel binding from PG 11 in August of 2018 because we were
> concerned about downgrade attacks.  Are there any plans to enable it for
> PG 12?

The original implementation of channel binding for SCRAM has included
support for two channel binding types: tls-unique and
tls-server-end-point.  The original implementation also had a
connection parameter called scram_channel_binding to control the
channel binding type to use or to disable it.

What has been removed via 7729113 are tls-unique and the libpq
parameter, and we still have basic channel binding support.  The
reasons behind that is that tls-unique future is uncertain as of TLS
1.3, and that tls-server-end-point will still be supported.  This also
simplified the protocol as it is not necessary to let the client
decide which channel binding to use.

Downgrade attacks at protocol level are something different though, as
it is possible to trick libpq to lower down the initial level of
authentication wanted (say from SCRAM to MD5, or even worse from MD5
to trust).  What we need here is additional frontend facility to allow
a client to authorize only a subset of authentication protocols.  With
what's is v11 and HEAD, any driver speaking the Postgres protocol can
implement that.
--
Michael

signature.asc
Description: PGP signature

Re: [Patch][WiP] Tweaked LRU for shared buffers

2019-02-15 Thread Tomas Vondra



On 2/16/19 12:51 AM, Peter Geoghegan wrote:
> On Fri, Feb 15, 2019 at 3:30 PM Tomas Vondra
>  wrote:
>> That TPS chart looks a bit ... wild. How come the master jumps so much
>> up and down? That's a bit suspicious, IMHO.
> 
> Somebody should write a patch to make buffer eviction completely 
> random, without aiming to get it committed. That isn't as bad of a 
> strategy as it sounds, and it would help with assessing improvements 
> in this area.
> 
> We know that the cache replacement algorithm behaves randomly when 
> there is extreme contention, while also slowing everything down due
> to maintaining the clock.

Possibly, although I still find it strange that the throughput first
grows, then at shared_buffers 1GB it drops, and then at 3GB it starts
growing again. Considering this is on 200GB data set, I doubt the
pressure/contention is much different with 1GB and 3GB, but maybe it is.

> A unambiguously better caching algorithm would at a minimum be able
> to beat our "cheap random replacement" prototype as well as the
> existing clocksweep algorithm in most or all cases. That seems like a
> reasonably good starting point, at least.
> 

Yes, comparison to cheap random replacement would be an interesting
experiment.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch][WiP] Tweaked LRU for shared buffers

2019-02-15 Thread Peter Geoghegan

On Fri, Feb 15, 2019 at 3:30 PM Tomas Vondra
 wrote:
> That TPS chart looks a bit ... wild. How come the master jumps so much
> up and down? That's a bit suspicious, IMHO.

Somebody should write a patch to make buffer eviction completely
random, without aiming to get it committed. That isn't as bad of a
strategy as it sounds, and it would help with assessing improvements
in this area.

We know that the cache replacement algorithm behaves randomly when
there is extreme contention, while also slowing everything down due to
maintaining the clock. A unambiguously better caching algorithm would
at a minimum be able to beat our "cheap random replacement" prototype
as well as the existing clocksweep algorithm in most or all cases.
That seems like a reasonably good starting point, at least.

-- 
Peter Geoghegan

Re: [Patch][WiP] Tweaked LRU for shared buffers

2019-02-15 Thread Tomas Vondra



Hi,

On 2/13/19 3:37 PM, Andrey Borodin wrote:
> Hi, hackers!
> 
> We have held education project at Sirius edu center (Sochi, Russia)
> with mentors from Yandex. The group of 5 students was working on
> improving the shared buffers eviction algorithm: Andrey Chausov, Yuriy
> Skakovskiy, Ivan Edigaryev, Arslan Gumerov, Daria Filipetskaya. I’ve
> been a mentor for the group. For two weeks we have been looking into
> known caching algorithms and tried to adapt some of them for PostgreSQL
> codebase.
>
> While a lot of algorithms appeared to be too complex to be hacked in 
> 2 weeks, we decided to implement and test the working version of
> tweaked LRU eviction algorithm.
> 
> ===How it works===
> Most of the buffers are organized into the linked list. Firstly
> admitted pages jump into 5/8th of the queue. The last ⅛ of the queue is
> governed by clock sweep algorithm to improve concurrency.
> 

Interesting. Where do these numbers (5/8 and 1/8) come from?

> ===So how we tested the patch===
> We used sampling on 4 Yandex.Cloud compute instances with 16 vCPU
> cores, 8GB of RAM, 200GB database in 30-minute YCSB-like runs with Zipf
> distribution. We found that on read-only workload our algorithm is
> showing consistent improvement over the current master branch. On
> read-write workloads we haven’t found performance improvements yet,
> there was too much noise from checkpoints and bgwriter (more on it in
> TODO section).
> Charts are here: [0,1]

That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.

How do I reproduce this benchmark? I'm aware of pg_ycsb, but maybe
you've used some other tools?

Also, have you tried some other benchmarks (like, regular TPC-B as
implemented by pgbench, or read-only pgbench)? We need such benchmarks
with a range of access patterns to check for regressions.

BTW what do you mean by "sampling"?

> We used this config: [2]
> 

That's only half the information - it doesn't say how many clients were
running the benchmark etc.

> ===TODO===
> We have taken some ideas expressed by Ben Manes in the pgsql-hackers
> list. But we could not implement all of them during the time of the
> program. For example, we tried to make LRU bumps less write-contentious
> by storing them in a circular buffer. But this feature was not stable
> enough.

Can you point us to the thread/email discussing those ideas? I have
tried searching through archives, but I haven't found anything :-(

This message does not really explain the algorithms, and combined with
the absolute lack of comments in the linked commit, it'd somewhat
difficult to form an opinion.

> The patch in its current form also requires improvements. So, we 
> shall reduce the number of locks at all (in this way we have tried 
> bufferization, but not only it). “Clock sweep” algorithm at the last
> ⅛ part of the logical sequence should be improved too (see 
> ClockSweepTickAtTheAnd() and places of its usage).
OK

> Unfortunately, we didn’t have more time to bring CAR and W-TinyLFU
> to testing-ready state.

What is CAR? Did you mean ARC, perhaps?

> We have a working implementation of frequency sketch [3] to make a
> transition between the admission cycle and LRU more concise with TinyLFU
> filter. Most probably, work in this direction will be continued.

OK

> Also, the current patch does not change bgwriter behavior: with a
> piece of knowledge from LRU, we can predict that some pages will not be
> changed in the nearest future. This information should be used to
> schedule the background writes better.

Sounds interesting.

> We also think that with proper eviction algorithm shared buffers
> should be used instead of specialized buffer rings.
> 

Are you suggesting to get rid of the buffer rings we use for sequential
scans, for example? IMHO that's going to be tricky, e.g. because of
synchronized sequential scans.

> We will be happy to hear your feedback on our work! Thank you :)
> 
> [0] LRU TPS https://yadi.sk/i/wNNmNw3id22nMQ
> [1] LRU hitrate https://yadi.sk/i/l1o710C3IX6BiA
> [2] Benchmark config https://yadi.sk/d/L-PIVq--ujw6NA
> [3] Frequency sketch code 
> https://github.com/heyfaraday/postgres/commit/a684a139a72cd50dd0f9d031a8aa77f998607cf1
> 
> With best regards, almost serious cache group.
> 

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Early WIP/PoC for inlining CTEs

2019-02-15 Thread Tomas Vondra

On 2/14/19 8:22 PM, Merlin Moncure wrote:
> On Thu, Feb 14, 2019 at 10:02 AM Alvaro Herrera
>  wrote:
>>
>> On 2019-Feb-14, Peter Eisentraut wrote:
>>
>>> On 14/02/2019 16:11, Tom Lane wrote:
 ... so, have we beaten this topic to death yet?  Can we make a decision?

 Personally, I'd be happy with either of the last two patch versions
 I posted (that is, either AS [[NOT] MATERIALIZED] or
 AS [MATERIALIZE [ON|OFF]] syntax).  But we gotta pick something.
>>>
>>> If we're not really planning to add any more options, I'd register a
>>> light vote for MATERIALIZED.  It reads easier, seems more grammatically
>>> correct, and uses an existing word.
>>
>> +1 for MATERIALIZED, as I proposed in
>> https://postgr.es/m/20170503173305.fetj4tz7kd56tjlr@alvherre.pgsql
> 
> Seconded!
> 

+1 to MATERIALIZED too

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Channel binding

2019-02-15 Thread Bruce Momjian

We removed channel binding from PG 11 in August of 2018 because we were
concerned about downgrade attacks.  Are there any plans to enable it for
PG 12?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: propagating replica identity to partitions

2019-02-15 Thread Alvaro Herrera

On 2019-Feb-07, Dmitry Dolgov wrote:

> Could there be a race condition somewhere? The idea and the code looks
> reasonable, but when I'm testing ALTER TABLE ... REPLICA IDENTITY with this
> patch and concurrent partition creation, I've got the following error on 
> ATTACH
> PARTITION:
> 
> ERROR:  42P17: replica index does not exist in partition "test373"
> LOCATION:  MatchReplicaIdentity, tablecmds.c:15018
> 
> This error seems unstable, other time I've got a deadlock. And I don't observe
> this behaviour on the master.

Can you share your reproducer?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL insert delay settings

2019-02-15 Thread Tomas Vondra



On 2/15/19 7:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-15 08:50:03 -0500, Stephen Frost wrote:
>> * Andres Freund (and...@anarazel.de) wrote:
>>> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
 On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
 peter.eisentr...@2ndquadrant.com> wrote:
> On 14/02/2019 11:03, Tomas Vondra wrote:
>> But if you add extra sleep() calls somewhere (say because there's also
>> limit on WAL throughput), it will affect how fast VACUUM works in
>> general. Yet it'll continue with the cost-based throttling, but it will
>> never reach the limits. Say you do another 20ms sleep somewhere.
>> Suddenly it means it only does 25 rounds/second, and the actual write
>> limit drops to 4 MB/s.
>
> I think at a first approximation, you probably don't want to add WAL
> delays to vacuum jobs, since they are already slowed down, so the rate
> of WAL they produce might not be your first problem.  The problem is
> more things like CREATE INDEX CONCURRENTLY that run at full speed.
>
> That leads to an alternative idea of expanding the existing cost-based
> vacuum delay system to other commands.
>
> We could even enhance the cost system by taking WAL into account as an
> additional factor.

 This is really what I was thinking- let’s not have multiple independent
 ways of slowing down maintenance and similar jobs to reduce their impact on
 I/o to the heap and to WAL.
>>>
>>> I think that's a bad idea. Both because the current vacuum code is
>>> *terrible* if you desire higher rates because both CPU and IO time
>>> aren't taken into account. And it's extremely hard to control.  And it
>>> seems entirely valuable to be able to limit the amount of WAL generated
>>> for replication, but still try go get the rest of the work done as
>>> quickly as reasonably possible wrt local IO.
>>
>> I'm all for making improvements to the vacuum code and making it easier
>> to control.
>>
>> I don't buy off on the argument that there is some way to segregate the
>> local I/O question from the WAL when we're talking about these kinds of
>> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
>> do I think we do our users a service by giving them independent knobs
>> for both that will undoubtably end up making it more difficult to
>> understand and control what's going on overall.
>>
>> Even here, it seems, you're arguing that the existing approach for
>> VACUUM is hard to control; wouldn't adding another set of knobs for
>> controlling the amount of WAL generated by VACUUM make that worse?  I
>> have a hard time seeing how it wouldn't.
> 
> I think it's because I see them as, often, having two largely 
> independent use cases. If your goal is to avoid swamping replication 
> with WAL, you don't necessarily care about also throttling VACUUM
> (or REINDEX, or CLUSTER, or ...)'s local IO.  By forcing to combine
> the two you just make the whole feature less usable.
> 

I agree with that.

> I think it'd not be insane to add two things:
> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>   rewrites, ...)
> - Account for WAL writes in the current vacuum costing logic, by
>   accounting for it using a new cost parameter
> 
> Then VACUUM would be throttled by the *minimum* of the two, which seems
> to make plenty sense to me, given the usecases.
> 

Is it really minimum? If you add another cost parameter to the vacuum
model, then there's almost no chance of actually reaching the limit
because the budget (cost_limit) is shared with other stuff (local I/O).

FWIW I do think the ability to throttle WAL is a useful feature, I just
don't want to shoot myself in the foot by making other things worse.

As you note, the existing VACUUM throttling is already hard to control,
this seems to make it even harder.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: shared-memory based stats collector

2019-02-15 Thread Alvaro Herrera

On 2019-Feb-15, Andres Freund wrote:

> It's fine to do such renames, just do them as separate patches. It's
> hard enough to review changes this big...

Talk about moving the whole file to another subdir ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL insert delay settings

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-15 08:50:03 -0500, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> > > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> > > peter.eisentr...@2ndquadrant.com> wrote:
> > > > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > > > But if you add extra sleep() calls somewhere (say because there's also
> > > > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > > > general. Yet it'll continue with the cost-based throttling, but it 
> > > > > will
> > > > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > > > limit drops to 4 MB/s.
> > > >
> > > > I think at a first approximation, you probably don't want to add WAL
> > > > delays to vacuum jobs, since they are already slowed down, so the rate
> > > > of WAL they produce might not be your first problem.  The problem is
> > > > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> > > >
> > > > That leads to an alternative idea of expanding the existing cost-based
> > > > vacuum delay system to other commands.
> > > >
> > > > We could even enhance the cost system by taking WAL into account as an
> > > > additional factor.
> > > 
> > > This is really what I was thinking- let’s not have multiple independent
> > > ways of slowing down maintenance and similar jobs to reduce their impact 
> > > on
> > > I/o to the heap and to WAL.
> > 
> > I think that's a bad idea. Both because the current vacuum code is
> > *terrible* if you desire higher rates because both CPU and IO time
> > aren't taken into account. And it's extremely hard to control.  And it
> > seems entirely valuable to be able to limit the amount of WAL generated
> > for replication, but still try go get the rest of the work done as
> > quickly as reasonably possible wrt local IO.
> 
> I'm all for making improvements to the vacuum code and making it easier
> to control.
> 
> I don't buy off on the argument that there is some way to segregate the
> local I/O question from the WAL when we're talking about these kinds of
> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
> do I think we do our users a service by giving them independent knobs
> for both that will undoubtably end up making it more difficult to
> understand and control what's going on overall.
> 
> Even here, it seems, you're arguing that the existing approach for
> VACUUM is hard to control; wouldn't adding another set of knobs for
> controlling the amount of WAL generated by VACUUM make that worse?  I
> have a hard time seeing how it wouldn't.

I think it's because I see them as, often, having two largely
independent use cases. If your goal is to avoid swamping replication
with WAL, you don't necessarily care about also throttling VACUUM (or
REINDEX, or CLUSTER, or ...)'s local IO.  By forcing to combine the two
you just make the whole feature less usable.

I think it'd not be insane to add two things:
- WAL write rate limiting, independent of the vacuum stuff. It'd also be
  used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
  rewrites, ...)
- Account for WAL writes in the current vacuum costing logic, by
  accounting for it using a new cost parameter

Then VACUUM would be throttled by the *minimum* of the two, which seems
to make plenty sense to me, given the usecases.

Greetings,

Andres Freund

Re: Copy function for logical replication slots

2019-02-15 Thread Petr Jelinek

Hi,

On 15/01/2019 02:56, Masahiko Sawada wrote:
> On Tue, Nov 27, 2018 at 3:46 AM Petr Jelinek
>  wrote:
>>> +
>>> +   /*
>>> +* The requested wal lsn is no longer available. We don't 
>>> want to retry
>>> +* it, so raise an error.
>>> +*/
>>> +   if (!XLogRecPtrIsInvalid(requested_lsn))
>>> +   {
>>> +   char filename[MAXFNAMELEN];
>>> +
>>> +   XLogFileName(filename, ThisTimeLineID, segno, 
>>> wal_segment_size);
>>> +   ereport(ERROR,
>>> +   (errmsg("could not reserve WAL 
>>> segment %s", filename)));
>>> +   }
>>
>> I would reword the comment to something like "The caller has requested a
>> specific wal lsn which we failed to reserve. We can't retry here as the
>> requested wal is no longer available." (It took me a while to understand
>> this part).
>>
>> Also the ereport should have errcode as it's going to be thrown to user
>> sessions and it might be better if the error itself used same wording as
>> CheckXLogRemoved() and XLogRead() for consistency. What do you think?
>>
> 
> I agreed your both comments. I've changed the above comment and
> ereport. Attached the updated version patch.
> 

I went through this again and I am pretty much happy with the current
version. So I am going to mark it as RFC.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: shared-memory based stats collector

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-15 17:29:00 +0900, Kyotaro HORIGUCHI wrote:
> At Thu, 7 Feb 2019 13:10:08 -0800, Andres Freund  wrote 
> in <20190207211008.nc3axviivmcoa...@alap3.anarazel.de>
> > Hi,
> > 
> > On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
> > > diff --git a/src/backend/access/transam/xlog.c 
> > > b/src/backend/access/transam/xlog.c
> > > index 7eed5866d2..e52ae54821 100644
> > > --- a/src/backend/access/transam/xlog.c
> > > +++ b/src/backend/access/transam/xlog.c
> > > @@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
> > >   _secs, _usecs);
> > >  
> > >   /* Accumulate checkpoint timing summary data, in milliseconds. */
> > > - BgWriterStats.m_checkpoint_write_time +=
> > > + BgWriterStats.checkpoint_write_time +=
> > >   write_secs * 1000 + write_usecs / 1000;
> > > - BgWriterStats.m_checkpoint_sync_time +=
> > > + BgWriterStats.checkpoint_sync_time +=
> > >   sync_secs * 1000 + sync_usecs / 1000;
> > 
> > Why does this patch do renames like this in the same entry as actual
> > functional changes?
> 
> Just because it is no longer "messages". I'm ok to preserve them
> as historcal names. Reverted.

It's fine to do such renames, just do them as separate patches. It's
hard enough to review changes this big...


> > >  /*
> > >   * Structures in which backends store per-table info that's waiting to be
> > > @@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
> > >   * Hash table for O(1) t_id -> tsa_entry lookup
> > >   */
> > >  static HTAB *pgStatTabHash = NULL;
> > > +static HTAB *pgStatPendingTabHash = NULL;
> > >  
> > >  /*
> > >   * Backends store per-function info that's waiting to be sent to the 
> > > collector
> > >   * in this hash table (indexed by function OID).
> > >   */
> > >  static HTAB *pgStatFunctions = NULL;
> > > -
> > > -/*
> > > - * Indicates if backend has some function stats that it hasn't yet
> > > - * sent to the collector.
> > > - */
> > > -static bool have_function_stats = false;
> > > +static HTAB *pgStatPendingFunctions = NULL;
> > 
> > So this patch leaves us with a pgStatFunctions that has a comment
> > explaining it's about "waiting to be sent" stats, and then additionally
> > a pgStatPendingFunctions?
> 
> Mmm. Thanks . I changed the comment and separated pgSTatPending*
> stuff from there and merged with pgstat_pending_*. And unified
> the naming.

I think my point is larger than that - I don't see why the pending
hashtables are needed at all. They seem purely superflous.


> > 
> > > + if (cxt->tabhash)
> > > + dshash_detach(cxt->tabhash);
> > 
> > Huh, why do we detach here?
> 
> To release the lock on cxt->dbentry. It may be destroyed.

Uh, how?



> - Separte shared database stats from db_stats hash.
> 
> - Consider relaxing dbentry locking.
> 
> - Try removing pgStatPendingFunctions
> 
> - ispell on it.

Additionally:
- consider getting rid of all the pending stuff, not just for functions,
  - as far as I can tell it's unnecessary

Thanks,

Andres

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Andres Freund

Hi,

On 2019-02-14 16:45:38 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote:
> >> Hah, I just realized you have to add -mlzcnt in order for these builtins
> >> to use the lzcnt instructions.  It goes from something like
> >> 
> >> bsrq   %rax, %rax
> >> xorq   $63, %rax
> 
> > I'm confused how this is a general count leading zero operation? Did you
> > use constants or something that allowed ot infer a range in the test? If
> > so the compiler probably did some optimizations allowing it to do the
> > above.
> 
> No.  If you compile
> 
> int myclz(unsigned long long x)
> {
>   return __builtin_clzll(x);
> }
> 
> at -O2, on just about any x86_64 gcc, you will get
> 
> myclz:
> .LFB1:
> .cfi_startproc
> bsrq%rdi, %rax
> xorq$63, %rax
> ret
> .cfi_endproc

Yea, sorry for the noise. I misremembered the bsrq mnemonic.

bsr has a latency of three cycles, xor of one. lzcnt a latency of
three. So it's mildly faster to use lzcnt (it uses fewer ports, and has
a shorter latency). But I doubt we have code where that's noticable.

Greetings,

Andres Freund

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Tom Lane

Alvaro Herrera  writes:
> Here's a final version that I intend to push shortly, to have time
> before EOB today to handle any fallout.

I think this is likely to result in a lot of complaints about
rightmost_one_pos[] being unreferenced, in non-HAVE__BUILTIN_CTZ
builds.  Probably that has to be an extern rather than static
in the header.  leftmost_one_pos[] likewise.

I might have a go at improving the configure tests later ---
I still don't like that they're compile-time-optimizable.
But that can wait.

regards, tom lane

Re: House style for DocBook documentation?

2019-02-15 Thread Peter Eisentraut

On 2019-01-26 09:31, Peter Eisentraut wrote:
> On 25/01/2019 15:37, Chapman Flack wrote:
>>> External links already create footnotes in the PDF output.  Is that
>>> different from what you are saying?
>> That might, only, indicate that I was just thinking aloud in email and
>> had not gone and checked in PDF output to see how the links were handled.
>>
>> Yes, it could very possibly indicate that.
>>
>> If they are already processed that way, does that mean the
>>
>> o  Do not use text with  so the URL appears in printed output
>>
>> in README.links should be considered obsolete, and removed even, and
>> doc authors should feel free to put link text in  without
>> hesitation?
> 
> I think it's obsolete, yes.

Committed that change.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: explain plans with information about (modified) gucs

2019-02-15 Thread Andres Freund

On 2019-02-15 17:10:28 +0100, Tomas Vondra wrote:
> True. So you agree printing the values as text (as the patch currently
> does) seems enough? I guess I'm fine with that.

I do agree, yes.

Re: Reporting script runtimes in pg_regress

2019-02-15 Thread Tom Lane

John Naylor  writes:
> On 2/15/19, Peter Eisentraut  wrote:
>> We should also strive to align "FAILED" properly.  This is currently
>> quite unreadable:
>> 
>> int4 ... ok (128 ms)
>> int8 ... FAILED (153 ms)
>> oid  ... ok (163 ms)
>> float4   ... ok (231 ms)

> If I may play devil's advocate, who cares how long it takes a test to
> fail? If it's not difficult, leaving the time out for failures would
> make them stand out more.

Actually, I'd supposed that that might be useful info, sometimes.
For example it might help you guess whether a timeout had elapsed.

regards, tom lane

Re: Commit Fest 2019-01 is now closed

2019-02-15 Thread Peter Eisentraut

On 2019-02-10 20:30, Magnus Hagander wrote:
> On Fri, Feb 8, 2019 at 2:12 PM Peter Eisentraut
>  > wrote:
> 
> On 08/02/2019 12:27, Magnus Hagander wrote:
> > I'd be more than happy for somebody with morge knowledge of such
> matters
> > than me to think up a better color scheme. The only reason it has
> those
> > colors is that they're the default ones in the Bootstrap CSS
> framework.
> 
> Can we have that column just normal black-and-white?
> 
> 
> Sure! Do you mean like the update pushed now, or to remove the label
> completely?

What you pushed is white-on-black, not black-on-white that I meant.
Sorry, my "black-and-white" was obviously ambiguous.

If we're going to have a the white-on-something scheme, then it would be
useful to have different colors for different releases.  Otherwise there
is no value in it and it should just be plain black-on-white text.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Alvaro Herrera

Here's a final version that I intend to push shortly, to have time
before EOB today to handle any fallout.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 085650a174ff080f578bb289d3707173aaf4f07b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 15 Feb 2019 13:07:02 -0300
Subject: [PATCH v2] Fix compiler builtin usage in new pg_bitutils.c

Split out these new functions in three parts: one in a new file that
uses the compiler builtin and gets compiled with the -mpopcnt compiler
option if it exists; another one that uses the compiler builtin but not
the compiler option; and finally the fallback with open-coded
algorithms.

Split out the configure logic: in the original commit, it was selecting
to use the -mpopcnt compiler switch together with deciding whether to
use the compiler builtin, but those two things are really separate.
Split them out.  Also, expose whether the builtin exists to
Makefile.global, so that src/port's Makefile can decide whether to
compile the hw-optimized file.

Remove CPUID test for CTZ/CLZ.  Make pg_{right,left}most_ones use either
the compiler intrinsic or open-coded algo; trying to use the
HW-optimized version is a waste of time.  Make them static inline
functions.

Discussion: https://postgr.es/m/20190213221719.GA15976@alvherre.pgsql
---
 config/c-compiler.m4|  22 +-
 configure   |  66 --
 configure.in|   6 +-
 src/Makefile.global.in  |   3 +
 src/include/port/pg_bitutils.h  | 176 ++-
 src/port/Makefile   |  16 +-
 src/port/pg_bitutils.c  | 378 
 src/port/pg_bitutils_hwpopcnt.c |  36 +++
 8 files changed, 327 insertions(+), 376 deletions(-)
 create mode 100644 src/port/pg_bitutils_hwpopcnt.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 05fa82518f8..7c0d52b515f 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -381,22 +381,16 @@ fi])# PGAC_C_BUILTIN_OP_OVERFLOW
 # PGAC_C_BUILTIN_POPCOUNT
 # -
 AC_DEFUN([PGAC_C_BUILTIN_POPCOUNT],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_popcount])])dnl
-AC_CACHE_CHECK([for __builtin_popcount], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mpopcnt"
-AC_COMPILE_IFELSE([AC_LANG_SOURCE(
-[static int x = __builtin_popcount(255);])],
-[Ac_cachevar=yes],
-[Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
-if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="-mpopcnt"
+[AC_CACHE_CHECK([for __builtin_popcount], pgac_cv__builtin_popcount,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcount(255);]
+)],
+[pgac_cv__builtin_popcount=yes],
+[pgac_cv__builtin_popcount=no])])
+if test x"$pgac_cv__builtin_popcount" = x"yes"; then
 AC_DEFINE(HAVE__BUILTIN_POPCOUNT, 1,
   [Define to 1 if your compiler understands __builtin_popcount.])
-fi
-undefine([Ac_cachevar])dnl
-])# PGAC_C_BUILTIN_POPCOUNT
+fi])# PGAC_C_BUILTIN_POPCOUNT
 
 
 
diff --git a/configure b/configure
index 73e9c235b69..2e3cc372a6e 100755
--- a/configure
+++ b/configure
@@ -651,7 +651,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
-CFLAGS_POPCNT
+have__builtin_popcount
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -733,6 +733,7 @@ CPP
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
 CFLAGS_VECTOR
+CFLAGS_POPCNT
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
 LLVM_CXXFLAGS
@@ -6581,6 +6582,48 @@ fi
 
 fi
 
+# Optimization flags and options for bit-twiddling
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT" >&5
+$as_echo_n "checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT... " >&6; }
+if ${pgac_cv_prog_CC_cflags__mpopcnt+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS_POPCNT} -mpopcnt"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__mpopcnt=yes
+else
+  pgac_cv_prog_CC_cflags__mpopcnt=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__mpopcnt" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__mpopcnt" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__mpopcnt" = x"yes"; then
+  CFLAGS_POPCNT="${CFLAGS_POPCNT} -mpopcnt"
+fi
+
+
+
+
 CFLAGS_VECTOR=$CFLAGS_VECTOR
 
 
@@ -14111,32 +14154,28 @@ $as_echo "#define HAVE__BUILTIN_CTZ 1" >>confdefs.h
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_popcount" >&5
 $as_echo_n "checking for __builtin_popcount... " >&6; }
-if ${pgac_cv_popcount+:} false; then :
+if ${pgac_cv__builtin_popcount+:} false;

Re: Reporting script runtimes in pg_regress

2019-02-15 Thread Tom Lane

Peter Eisentraut  writes:
> If we're going to keep this, should we enable the prove --timer option as 
> well?

As far as that goes: I've found that in some of my older Perl
installations, prove doesn't recognize the --timer switch.
So turning that on would require a configuration probe of some
sort, which seems like more trouble than it's worth.

regards, tom lane

Re: explain plans with information about (modified) gucs

2019-02-15 Thread Tomas Vondra




On 2/14/19 8:55 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-01-15 02:39:49 +0100, Tomas Vondra wrote:
>>
>>
>> On 1/14/19 11:13 PM, Alvaro Herrera wrote:
>>> On 2019-Jan-14, Tomas Vondra wrote:
>>>
 The one slightly annoying issue is that currently all the options are
 formatted as text, including e.g. cpu_tuple_cost. That's because the GUC
 options may have show hook, so I can't access the value directly (not
 sure if there's an option around it).
>>>
>>> I think the problem is that you'd have to know how to print the value,
>>> which can be in one of several different C types.  You'd have to grow
>>> some more infrastructure in the GUC tables, I think, and that doesn't
>>> seem worth the trouble.  Printing as text seems enough.
>>>
>>
>> I don't think the number of formats is such a big issue - the range of
>> formats is quite limited: PGC_BOOL, PGC_INT, PGC_REAL, PGC_STRING and
>> PGC_ENUM. But the show hook simply returns string, and I'm not sure it's
>> guaranteed it matches the raw value (afaik the assign/show hooks can do
>> all kinds of funny stuff).
> 
> Yea, but the underlying values are quite useless for
> humans. E.g. counting shared_buffers, wal_buffers etc in weird units is
> not helpful. So you'd need to support the different units for each.
> 

True. So you agree printing the values as text (as the patch currently
does) seems enough? I guess I'm fine with that.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Protect syscache from bloating with negative cache entries

2019-02-15 Thread Tomas Vondra

On 2/14/19 4:49 PM, 'Bruce Momjian' wrote:
> On Thu, Feb 14, 2019 at 01:31:49AM +, Tsunakawa, Takayuki wrote:
>> From: Bruce Momjian [mailto:br...@momjian.us]
 That being said, having a "minimal size" threshold before starting
 with the time-based eviction may be a good idea.
>>>
>>> Agreed.  I see the minimal size as a way to keep the systems tables
>>> in cache, which we know we will need for the next query.
>>
>> Isn't it the maximum size, not minimal size?  Maximum size allows
>> to keep desired amount of system tables in memory as well as to
>> control memory consumption to avoid out-of-memory errors (OS crash!).
>> I'm wondering why people want to take a different approach to
>> catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers,
>> temp_buffers, SLRU buffers, work_mem, and other DBMSs.
> 
> Well, that is an _excellent_ question, and one I had to think about.
> 

I think we're talking about two different concepts here:

1) minimal size - We don't do any extra eviction at all, until we reach
this cache size. So we don't get any extra overhead from it. If a system
does not have issues.

2) maximal size - We ensure the cache size is below this threshold. If
there's more data, we evict enough entries to get below it.

My proposal is essentially to do just (1), so the cache can grow very
large if needed but then it shrinks again after a while.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Alvaro Herrera

On 2019-Feb-15, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Ah, I understand it now:
> > https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
> > if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
> > SIGILL or anything ... it'll just silently compute the wrong result.
> > That's certainly not what I call a fallback!
> 
> Yeah, that's pretty nasty; it means there's no backstop for whether
> your choose function gets it right :-(

Hopefully other tests will fail in some visible way, though.  My fear is
whether we have such systems in buildfarm.

> Is POPCNT any better in this respect?

I couldn't find how is POPCNT encoded.  
https://stackoverflow.com/a/28803917/242383

I did find these articles:
http://danluu.com/assembly-intrinsics/
https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati

This suggests that this all a largely pointless exercise at least on
Intel and GCC/Clang.  It may be better on AMD ... but to get really
better performance we'd need to be coding the popcnt calls in assembly
rather than using the compiler intrinsics, even with -mpopcnt, because
the intrinsics suck.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reporting script runtimes in pg_regress

2019-02-15 Thread John Naylor

On 2/15/19, Peter Eisentraut  wrote:
> We should also strive to align "FAILED" properly.  This is currently
> quite unreadable:
>
>  int4 ... ok (128 ms)
>  int8 ... FAILED (153 ms)
>  oid  ... ok (163 ms)
>  float4   ... ok (231 ms)

If I may play devil's advocate, who cares how long it takes a test to
fail? If it's not difficult, leaving the time out for failures would
make them stand out more.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Protect syscache from bloating with negative cache entries

2019-02-15 Thread Tomas Vondra

On 2/14/19 3:46 PM, Bruce Momjian wrote:
> On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote:
>> Hi,
>>
>> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
>>> Instead, I added an accounting(?) interface function.
>>>
>>> | MemoryContextGettConsumption(MemoryContext cxt);
>>>
>>> The API returns the current consumption in this memory
>>> context. This allows "real" memory accounting almost without
>>> overhead.
>>
>> That's definitely *NOT* almost without overhead. This adds additional
>> instructions to one postgres' hottest set of codepaths.
>>
>> I think you're not working incrementally enough here. I strongly suggest
>> solving the negative cache entry problem, and then incrementally go from
>> there after that's committed. The likelihood of this patch ever getting
>> merged otherwise seems extremely small.
> 
> Agreed --- the patch is going in the wrong direction.
> 

I recall endless discussions about memory accounting in the
"memory-bounded hash-aggregate" patch a couple of years ago, and the
overhead was one of the main issues there. So yeah, trying to solve that
problem here is likely to kill this patch (or at least significantly
delay it).

ISTM there's a couple of ways to deal with that:

1) Ignore the memory amounts entirely, and do just time-base eviction.

2) If we want some size thresholds (e.g. to disable eviction for
backends with small caches etc.) use the number of entries instead. I
don't think that's particularly worse that specifying size in MB.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Tom Lane

Alvaro Herrera  writes:
> I noticed in Compiler Explorer that some (ancient?) Power cpus
> implement instruction "popcntb", and GCC support for those uses
> -mpopcntb switch enabling __builtin_popcount() to use it.  I added the
> switch to configure.in but I'm not sure how well that will work ...  I
> don't know if this is represented in buildfarm.

I experimented a bit with this on an old Apple laptop.  Apple's
compiler rejects -mpopcntb altogether.  FreeBSD's compiler
(gcc 4.2.1) recognizes the switch, but I could not get it to
emit the instruction, even when specifying -mcpu=power5,
which ought to enable it according to the gcc docs:

 ... The `-mpopcntb' option allows GCC to generate the
 popcount and double precision FP reciprocal estimate instruction
 implemented on the POWER5 processor and other processors that
 support the PowerPC V2.02 architecture.

A more recent gcc info file also mentions

 The `-mpopcntd' option
 allows GCC to generate the popcount instruction implemented on the
 POWER7 processor and other processors that support the PowerPC
 V2.06 architecture.

but the gcc version I have on this laptop doesn't know that switch.
In any case, I'm pretty sure Apple never shipped a CPU that could
run either instruction.

I suspect that probing for either option may not be worth the
configure cycles it'd consume :-( ... there are just way too
few of those specific POWER variants out there anymore, even
granting that you have a compiler that will play along.

Moreover, you can't turn on -mpopcntb without having some POWER
equivalent to the CPUID test.

However, if you want to leave the option for this open in
future, it really makes the file name pg_bitutils_sse42.c
quite inappropriate.  How about pg_bitutils_hwpopcnt.c
or something like that?

regards, tom lane

Re: Protect syscache from bloating with negative cache entries

2019-02-15 Thread Tomas Vondra





On 2/13/19 1:23 AM, Tsunakawa, Takayuki wrote:
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>> I'm at a loss how call syscache for users. I think it is "catalog
>> cache". The most basic component is called catcache, which is
>> covered by the syscache layer, both of then are not revealed to
>> users, and it is shown to user as "catalog cache".
>>
>> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
>> exists) "catalog_cache_entry_limit" and
>> "catalog_cache_prune_ratio" make sense?
> 
> PostgreSQL documentation uses "system catalog" in its table of contents, so 
> syscat_cache_xxx would be a bit more familiar?  I'm for either catalog_ and 
> syscat_, but what name shall we use for the relation cache?  catcache and 
> relcache have different element sizes and possibly different usage patterns, 
> so they may as well have different parameters just like MySQL does.  If we 
> follow that idea, then the name would be relation_cache_xxx.  However, from 
> the user's viewpoint, the relation cache is also created from the system 
> catalog like pg_class and pg_attribute...
> 

I think "catalog_cache_..." is fine. If we end up with a similar
patchfor relcache, we can probably call it "relation_cache_".

I'd be OK even with "system_catalog_cache_..." - I don't think it's
overly long (better to have a longer but descriptive name), and "syscat"
just seems like unnecessary abbreviation.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq compression

2019-02-15 Thread Konstantin Knizhnik





On 15.02.2019 18:26, Tomas Vondra wrote:



On 2/15/19 3:03 PM, Konstantin Knizhnik wrote:


On 15.02.2019 15:42, Peter Eisentraut wrote:

On 2018-06-19 09:54, Konstantin Knizhnik wrote:

The main drawback of streaming compression is that you can not
decompress some particular message without decompression of all previous
messages.

It seems this would have an adverse effect on protocol-aware connection
proxies: They would have to uncompress everything coming in and
recompress everything going out.

The alternative of compressing each packet individually would work much
better: A connection proxy could peek into the packet header and only
uncompress the (few, small) packets that it needs for state and routing.


Individual compression of each message depreciate all idea of libpq
compression. Messages are two small to efficiently compress each of
them separately. So using streaming compression algorithm is
absolutely necessary here.


Hmmm, I see Peter was talking about "packets" while you're talking about
"messages". Are you talking about the same thing?
Sorry, but there are no "packet" in libpq protocol, so I assumed that 
packet=message.

In any protocol-aware proxy has to proceed each message.



Anyway, I was going to write about the same thing - that per-message
compression would likely eliminate most of the benefits - but I'm
wondering if it's actually true. That is, how much will the compression
ratio drop if we compress individual messages?
Compression of small messages without shared dictionary will give awful 
results.

Assume that average record and so message size is 100 bytes.
Just perform very simple experiment create file with 100 equal 
characters and try to compress it.
With zlib result will be 173 bytes. So after "compression" size of file 
is increase 1.7 times.
This is why there is no other way to efficiently compress libpq traffic 
without usage of streaming compression

(when dictionary is  shared and updated for all messages).


Obviously, if there are just tiny messages, it might easily eliminate
any benefits (and in fact it would add overhead). But I'd say we're way
more interested in transferring large data sets (result sets, data for
copy, etc.) and presumably those messages are much larger. So maybe we
could compress just those, somehow?
Please notice that copy stream consists of individual messages for each 
record.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: libpq compression

2019-02-15 Thread Tomas Vondra

On 2/15/19 3:03 PM, Konstantin Knizhnik wrote:
> 
> 
> On 15.02.2019 15:42, Peter Eisentraut wrote:
>> On 2018-06-19 09:54, Konstantin Knizhnik wrote:
>>> The main drawback of streaming compression is that you can not
>>> decompress some particular message without decompression of all previous
>>> messages.
>> It seems this would have an adverse effect on protocol-aware connection
>> proxies: They would have to uncompress everything coming in and
>> recompress everything going out.
>>
>> The alternative of compressing each packet individually would work much
>> better: A connection proxy could peek into the packet header and only
>> uncompress the (few, small) packets that it needs for state and routing.
>>
> Individual compression of each message depreciate all idea of libpq 
> compression. Messages are two small to efficiently compress each of
> them separately. So using streaming compression algorithm is
> absolutely necessary here.
> 

Hmmm, I see Peter was talking about "packets" while you're talking about
"messages". Are you talking about the same thing?

Anyway, I was going to write about the same thing - that per-message
compression would likely eliminate most of the benefits - but I'm
wondering if it's actually true. That is, how much will the compression
ratio drop if we compress individual messages?

Obviously, if there are just tiny messages, it might easily eliminate
any benefits (and in fact it would add overhead). But I'd say we're way
more interested in transferring large data sets (result sets, data for
copy, etc.) and presumably those messages are much larger. So maybe we
could compress just those, somehow?

> Concerning possible problem with proxies I do not think that it is
> really a problem.
> Proxy is very rarely located somewhere in the "middle" between client
> and database servers.
> It is usually launched either in the same network as DBMS client (for
> example, if client is application server) either in the same network
> with database server.
> In both cases there is not so much sense to pass compressed traffic
> through the proxy.
> If proxy and DBMS server are located in the same network, then proxy
> should perform decompression and send
> decompressed messages to the database server.
> 

I don't think that's entirely true. It makes perfect sense to pass
compressed traffic in various situations - even in local network the
network bandwidth matters quite a lot, these days. Because "local
network" may be "same availability zone" or "same DC" etc.

That being said, I'm not sure it's a big deal / issue when the proxy has
to deal with compression.

Either it's fine to forward decompressed data, so the proxy performs
just decompression, which requires much less CPU. (It probably needs to
compress data in the opposite direction, but it's usually quite
asymmetric - much more data is sent in one direction). Or the data has
to be recompressed, because it saves enough network bandwidth. It's
essentially a trade-off between using CPU and network bandwidth.

IMHO it'd be nonsense to adopt the per-message compression based merely
on the fact that it might be easier to handle on proxies. We need to
know if we can get reasonable compression ratio with that approach,
because if not then it's useless that it's more proxy-friendly.

Do the proxies actually need to recompress the data? Can't they just
decompress it to determine which messages are in the data, and then
forward the original compressed stream? That would be much cheaper,
because decompression requires much less CPU. Although, some proxies
(like connection pools) probably have to compress the connections
independently ...

> Thank you very much for noticing this problem with compatibility
> compression and protocol-aware connection proxies.
> I have wrote that current compression implementation (zpq_stream.c) can
> be used not only for libpq backend/frontend, but
> also for compression any other streaming data. But I could not imaging
> what other data sources can require compression.
> And proxy is exactly such case: it also needs to compress/decompress
> messages.
> It is one more argument to make interface of zpq_stream as simple as
> possible and encapsulate all inflating/deflating logic in this code.
> It can be achieved by passing arbitrary rx/tx function to zpq_create
> function.
> 

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Tom Lane

Alvaro Herrera  writes:
> Ah, I understand it now:
> https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
> if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
> SIGILL or anything ... it'll just silently compute the wrong result.
> That's certainly not what I call a fallback!

Yeah, that's pretty nasty; it means there's no backstop for whether
your choose function gets it right :-(

Is POPCNT any better in this respect?

regards, tom lane

Re: [HACKERS] Time to change pg_regress diffs to unified by default?

2019-02-15 Thread Christoph Berg

Re: Andrew Gierth 2019-02-15 <874l95m8w7@news-spur.riddles.org.uk>
> Also while we're tweaking regression test output, would it be possible
> to have some indicator of whether a test passed because a variant file
> in the resultmap was ignored in favor of the standard result?
> 
> The current system of silently ignoring the resultmap means that nobody
> ever notices when resultmap entries become obsolete

By the same argument, it should always print which variant file was
used so determining which _N.out files are still in use is possible.

Christoph

Re: query logging of prepared statements

2019-02-15 Thread Justin Pryzby

Sigh, resending to -hackers for real.

On Fri, Feb 08, 2019 at 07:29:53AM -0600, Justin Pryzby wrote:
> A couple months ago, I implemented prepared statements for PyGreSQL.  While
> updating our application in advance of their release with that feature, I
> noticed that our query logs were several times larger.

Previously sent to -general (and quoted fully below), resending to -hackers
with patch.
https://www.postgresql.org/message-id/20190208132953.GF29720%40telsasoft.com
https://www.postgresql.org/docs/current/runtime-config-logging.html

I propose that the prepared statement associated with an EXECUTE should be
included in log "DETAIL" only when log_error_verbosity=VERBOSE, for both SQL
EXECUTE and PQexecPrepared (if at all).  I'd like to be able to continue
logging DETAIL (including bind parameters), so want like to avoid setting
"TERSE" just to avoid redundant messages.  (It occurs to me that the GUC should
probably stick to its existing documented behavior rather than be extended,
which suggests that the duplicitive portions of the logs should simply be
removed, rather than conditionalized.  Let me know what you think).

With attached patch, I'm not sure if !*portal_name && !portal->prepStmtName
would catch cases other than PQexecParams (?)

Compare unpatched server to patched server to patched server with
log_error_verbosity=verbose.

|$ psql postgres -xtc "SET log_error_verbosity=default;SET log_statement='all'; 
SET client_min_messages=log" -c "PREPARE q AS SELECT 2" -c "EXECUTE q"
|SET
|LOG:  statement: PREPARE q AS SELECT 2
|PREPARE
|LOG:  statement: EXECUTE q
|DETAIL:  prepare: PREPARE q AS SELECT 2
|?column? | 2

|$ PGHOST=/tmp PGPORT=5678 psql postgres -xtc "SET 
log_error_verbosity=default;SET log_statement='all'; SET 
client_min_messages=log" -c "PREPARE q AS SELECT 2" -c "EXECUTE q"
|SET
|LOG:  statement: PREPARE q AS SELECT 2
|PREPARE
|LOG:  statement: EXECUTE q
|?column? | 2

|$ PGHOST=/tmp PGPORT=5678 psql postgres -xtc "SET 
log_error_verbosity=verbose;SET log_statement='all'; SET 
client_min_messages=log" -c "PREPARE q AS SELECT 2" -c "EXECUTE q"
|SET
|LOG:  statement: PREPARE q AS SELECT 2
|PREPARE
|LOG:  statement: EXECUTE q
|DETAIL:  prepare: PREPARE q AS SELECT 2
|?column? | 2

For PQexecPrepared library call:

|$ xPGPORT=5678 xPGHOST=/tmp PYTHONPATH=../PyGreSQL/build/lib.linux-x86_64-2.7/ 
python2.7 -c "import pg; d=pg.DB('postgres'); d.query('SET 
client_min_messages=log; SET log_error_verbosity=default; SET 
log_statement=\"all\"'); d.query('SELECT 1; PREPARE q AS SELECT \$1'); 
d.query_prepared('q',[1]); d.query_formatted('SELECT %s', [2])"
|LOG:  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  execute q: SELECT 1; PREPARE q AS SELECT $1
|DETAIL:  parameters: $1 = '1'
|LOG:  execute : SELECT $1
|DETAIL:  parameters: $1 = '2'

|$ PGPORT=5678 PGHOST=/tmp PYTHONPATH=../PyGreSQL/build/lib.linux-x86_64-2.7/ 
python2.7 -c "import pg; d=pg.DB('postgres'); d.query('SET 
client_min_messages=log; SET log_error_verbosity=default; SET 
log_statement=\"all\"'); d.query('SELECT 1; PREPARE q AS SELECT \$1'); 
d.query_prepared('q',[1]); d.query_formatted('SELECT %s', [2])"
|LOG:  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  execute q
|DETAIL:  parameters: $1 = '1'
|LOG:  execute : SELECT $1
|DETAIL:  parameters: $1 = '2'

|$ PGPORT=5678 PGHOST=/tmp PYTHONPATH=../PyGreSQL/build/lib.linux-x86_64-2.7/ 
python2.7 -c "import pg; d=pg.DB('postgres'); d.query('SET 
client_min_messages=log; SET log_error_verbosity=verbose; SET 
log_statement=\"all\"'); d.query('SELECT 1; PREPARE q AS SELECT \$1'); 
d.query_prepared('q',[1]); d.query_formatted('SELECT %s', [2])"
|LOG:  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  execute q: SELECT 1; PREPARE q AS SELECT $1
|DETAIL:  parameters: $1 = '1'
|LOG:  execute : SELECT $1
|DETAIL:  parameters: $1 = '2'

Also, I noticed that if the statement was prepared using SQL PREPARE (rather
than PQprepare), and if it used simple query with multiple commands, they're
all included in the log, like this when executed with PQexecPrepared:
|LOG:  execute q: SET log_error_verbosity=verbose; SET client_min_messages=log; 
PREPARE q AS SELECT $1

And looks like this for SQL EXECUTE:
|[pryzbyj@telsasoft-db postgresql]$ psql postgres -txc "SET 
client_min_messages=log;SELECT 1;PREPARE q AS SELECT 2" -c "EXECUTE q"
|PREPARE
|LOG:  statement: EXECUTE q
|DETAIL:  prepare: SET client_min_messages=log;SELECT 1;PREPARE q AS SELECT 2
|?column? | 2

On Fri, Feb 08, 2019 at 07:29:53AM -0600, Justin Pryzby wrote:
> A couple months ago, I implemented prepared statements for PyGreSQL.  While
> updating our application in advance of their release with that feature, I
> noticed that our query logs were several times larger.
> 
> With non-prepared statements, we logged to CSV:
> |message| SELECT 1
> 
> With SQL EXECUTE, we log:
> |message| statement: EXECUTE sqlex(2);
> |detail | prepare: prepare sqlex AS SELECT $1;
> 
> With

Re: Reporting script runtimes in pg_regress

2019-02-15 Thread Tom Lane

Peter Eisentraut  writes:
> On 2019-02-15 14:32, Peter Eisentraut wrote:
>> test event_trigger... ok  128 ms
>> test fast_default ... ok  173 ms
>> test stats... ok  637 ms

That looks reasonable, although on machines where test runtimes run
into the tens of seconds, there's not going to be nearly as much
whitespace as this example suggests.

> We should also strive to align "FAILED" properly.

Hmm.  The reasonable ways to accomplish that look to be either
(a) pad "ok" to the width of "FAILED", or (b) rely on emitting a tab.
I don't much like either, especially from the localization angle.
One should also note that FAILED often comes along with additional
verbiage, such as "(ignored)" or a note about process exit status;
so I think making such cases line up totally neatly is a lost cause
anyway.

How do you feel about letting it do this:

  int4 ... ok  128 ms
  int8 ... FAILED  153 ms
  oid  ... ok  163 ms
  float4   ... ok  231 ms

regards, tom lane

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Alvaro Herrera

On 2019-Feb-14, Tom Lane wrote:

> Then the correct test to see if we want to build pg_popcount.c (BTW,
> please pick a less generic name for that) and the choose function
> is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty
> CFLAGS_POPCNT.

I used pg_bitutils_sse42.c to host the specially-compiled functions.
The name is not entirely correct, but seems clear enough.

I noticed in Compiler Explorer that some (ancient?) Power cpus
implement instruction "popcntb", and GCC support for those uses
-mpopcntb switch enabling __builtin_popcount() to use it.  I added the
switch to configure.in but I'm not sure how well that will work ...  I
don't know if this is represented in buildfarm.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a3c654f9446ed0f8ead57d4f7202554311135dbf Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 15 Feb 2019 11:14:04 -0300
Subject: [PATCH] Fix compiler builtin usage in new pg_bitutils.c

Split out these new functions in three parts: one in a new file that
uses the compiler builtin and gets compiled with the -mpopcnt compiler
option if it exists; another one that uses the compiler builtin but not
the compiler option; and finally the fallback with open-coded
algorithms.

Split out the configure logic: in the original commit, it was selecting
to use the -mpopcnt compiler switch together with deciding to use the
compiler builtin.  However, some compilers implement the builtin even
though they don't have the compiler switch, so split both things.  Also,
expose whether the builtin exists to Makefile.global, so that src/port's
can decide whether to compile the special file.

Remove CPUID test for CTZ/CLZ.
---
 config/c-compiler.m4   |  22 +-
 configure  | 105 +++--
 configure.in   |   7 +-
 src/Makefile.global.in |   3 +
 src/include/port/pg_bitutils.h | 176 +++-
 src/port/Makefile  |  16 +-
 src/port/pg_bitutils.c | 374 -
 src/port/pg_bitutils_sse42.c   |  36 
 8 files changed, 365 insertions(+), 374 deletions(-)
 create mode 100644 src/port/pg_bitutils_sse42.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 05fa82518f8..7c0d52b515f 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -381,22 +381,16 @@ fi])# PGAC_C_BUILTIN_OP_OVERFLOW
 # PGAC_C_BUILTIN_POPCOUNT
 # -
 AC_DEFUN([PGAC_C_BUILTIN_POPCOUNT],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_popcount])])dnl
-AC_CACHE_CHECK([for __builtin_popcount], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS -mpopcnt"
-AC_COMPILE_IFELSE([AC_LANG_SOURCE(
-[static int x = __builtin_popcount(255);])],
-[Ac_cachevar=yes],
-[Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
-if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_POPCNT="-mpopcnt"
+[AC_CACHE_CHECK([for __builtin_popcount], pgac_cv__builtin_popcount,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static int x = __builtin_popcount(255);]
+)],
+[pgac_cv__builtin_popcount=yes],
+[pgac_cv__builtin_popcount=no])])
+if test x"$pgac_cv__builtin_popcount" = x"yes"; then
 AC_DEFINE(HAVE__BUILTIN_POPCOUNT, 1,
   [Define to 1 if your compiler understands __builtin_popcount.])
-fi
-undefine([Ac_cachevar])dnl
-])# PGAC_C_BUILTIN_POPCOUNT
+fi])# PGAC_C_BUILTIN_POPCOUNT
 
 
 
diff --git a/configure b/configure
index 73e9c235b69..fa0f1216a0a 100755
--- a/configure
+++ b/configure
@@ -651,7 +651,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
-CFLAGS_POPCNT
+have__builtin_popcount
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
@@ -733,6 +733,7 @@ CPP
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
 CFLAGS_VECTOR
+CFLAGS_POPCNT
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
 LLVM_CXXFLAGS
@@ -6581,6 +6582,87 @@ fi
 
 fi
 
+# Optimization flags and options for bit-twiddling
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT" >&5
+$as_echo_n "checking whether ${CC} supports -mpopcnt, for CFLAGS_POPCNT... " >&6; }
+if ${pgac_cv_prog_CC_cflags__mpopcnt+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS_POPCNT} -mpopcnt"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__mpopcnt=yes
+else
+  pgac_cv_prog_CC_cflags__mpopcnt=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_prog_CC_cflags__mpopcnt" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__mpopcnt" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__mpopcnt" = x"yes"; then
+

Re: [HACKERS] Time to change pg_regress diffs to unified by default?

2019-02-15 Thread Andrew Gierth

> "Peter" == Peter Eisentraut  writes:

 Peter> (This might be one of those rare times where one hopes for a
 Peter> buildfarm failure for verification. :-) )

Also while we're tweaking regression test output, would it be possible
to have some indicator of whether a test passed because a variant file
in the resultmap was ignored in favor of the standard result?

The current system of silently ignoring the resultmap means that nobody
ever notices when resultmap entries become obsolete

-- 
Andrew (irc:RhodiumToad)

Re: Log a sample of transactions

2019-02-15 Thread Adrien NAYRAT


On 2/14/19 9:14 PM, Andres Freund wrote:

Hi,



Hello Andres,


On 2019-01-26 11:44:58 +0100, Adrien NAYRAT wrote:

+ 
+  log_transaction_sample_rate (real)
+  
+   log_transaction_sample_rate configuration 
parameter
+  
+  
+   
+
+ Set the fraction of transactions whose statements are logged. It 
applies
+ to each new transaction regardless of their duration. The default is
+ 0, meaning do not log statements from this 
transaction.
+ Setting this to 1 logs all statements for all 
transactions.
+ Logging can be disabled inside a transaction by setting this to 0.
+ log_transaction_sample_rate is helpful to track a
+ sample of transaction.
+
+   
+  
+
   


I wonder if this doesn't need a warning, explaining that using this when
there are large transactions could lead to slowdowns.


Yes, I will add some wording





  
diff --git a/src/backend/access/transam/xact.c 
b/src/backend/access/transam/xact.c
index 7c3a9c1e89..1a52c10c39 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1821,6 +1821,9 @@ StartTransaction(void)
s->state = TRANS_START;
s->transactionId = InvalidTransactionId; /* until assigned */
  
+	/* Determine if we log statements in this transaction */

+   xact_is_sampled = random() <= log_xact_sample_rate * MAX_RANDOM_VALUE;
+
/*
 * initialize current transaction state fields
 *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e773f20d9f..a11667834e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -100,7 +100,8 @@ int max_stack_depth = 100;
  /* wait N seconds to allow attach from a debugger */
  int   PostAuthDelay = 0;
  
-

+/* flag for logging statements in a transaction */
+bool   xact_is_sampled = false;


It seems pretty weird to have this in postgres.c, given you declared it
in xact.h?


Yes, I have to revise my C. I will move it to 
src/backend/access/transam/xact.c






@@ -2218,7 +2221,8 @@ check_log_statement(List *stmt_list)
  int
  check_log_duration(char *msec_str, bool was_logged)
  {
-   if (log_duration || log_min_duration_statement >= 0)
+   if (log_duration || log_min_duration_statement >= 0 ||
+   (xact_is_sampled && log_xact_sample_rate > 0))


Why are both of these checked? ISTM once xact_is_sampled is set, we
ought not to respect a different value of log_xact_sample_rate for the
rest of that transaction.


I added theses checks to allow to disable logging during a sampled 
transaction, per suggestion of Masahiko Sawada:

https://www.postgresql.org/message-id/CAD21AoD4t%2BhTV6XfK5Yz%3DEocB8oMyJSYFfjAryCDYtqfib2GrA%40mail.gmail.com




As far as I can tell xact_is_sampled is not properly reset in case of
errors?



Yes, I wonder where we should reset it? Is it in AbortTransaction() or 
CleanupTransaction()?


Thanks!

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Alvaro Herrera

On 2019-Feb-15, Alvaro Herrera wrote:

> On 2019-Feb-15, Kyotaro HORIGUCHI wrote:

> > And even worse lzcntx is accidentially "fallback"s to bsrx on
> > unsupported CPUs, which leads to bogus results.
> > __builtin_clzll(8) = 3, which should be 60.
> 
> I'm not sure I understand this point.  Are you saying that if we use
> -mlzcnt to compile, then the compiler builtin is broken in platforms
> that don't support the lzcnt instruction?  That's horrible.  Let's stay
> away from -mlzcnt then.

Ah, I understand it now:
https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701
if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise
SIGILL or anything ... it'll just silently compute the wrong result.
That's certainly not what I call a fallback!

I think David's code was correct because it was testing CPUID for
instruction support before using the specialized code (well, except that
he forgot to add the right compiler option to *enable* the LZCNT/TZCNT
instructions in the first place); but given subsequent discussion that
the instruction is not worth it anyway, we might as well ignore it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reporting script runtimes in pg_regress

2019-02-15 Thread Peter Eisentraut

On 2019-02-15 14:32, Peter Eisentraut wrote:
>  identity ... ok  238 ms
>  partition_join   ... ok  429 ms
>  partition_prune  ... ok  786 ms
>  reloptions   ... ok   94 ms
>  hash_part... ok   78 ms
>  indexing ... ok 1298 ms
>  partition_aggregate  ... ok  727 ms
>  partition_info   ... ok  110 ms
> test event_trigger... ok  128 ms
> test fast_default ... ok  173 ms
> test stats... ok  637 ms

We should also strive to align "FAILED" properly.  This is currently
quite unreadable:

 int4 ... ok (128 ms)
 int8 ... FAILED (153 ms)
 oid  ... ok (163 ms)
 float4   ... ok (231 ms)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using POPCNT and other advanced bit manipulation instructions

2019-02-15 Thread Alvaro Herrera

On 2019-Feb-15, Kyotaro HORIGUCHI wrote:

> I understand that the behavior of __builtin_c[tl]z(0) is
> undefined from the reason, they convert to bs[rf]. So if we use
> these builtins, additional check is required.
> 
> https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Okay -- the functions check for a 0 argument:

+static inline int
+pg_rightmost_one32(uint32 word)
+{
+   int result = 0;
+
+   Assert(word != 0);
+
+#ifdef HAVE__BUILTIN_CTZ
+   result = __builtin_ctz(word);
+#else
+   while ((word & 255) == 0)
+   {
+   word >>= 8;
+   result += 8;
+   }
+   result += rightmost_one_pos[word & 255];
+#endif /* HAVE__BUILTIN_CTZ */
+
+   return result;
+}

so we're fine.

> And even worse lzcntx is accidentially "fallback"s to bsrx on
> unsupported CPUs, which leads to bogus results.
> __builtin_clzll(8) = 3, which should be 60.

I'm not sure I understand this point.  Are you saying that if we use
-mlzcnt to compile, then the compiler builtin is broken in platforms
that don't support the lzcnt instruction?  That's horrible.  Let's stay
away from -mlzcnt then.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Regarding participating in GSOC 2019 with PostgreSQL

2019-02-15 Thread Stephen Frost

Greetings,

* Abhishek Agrawal (aagrawal...@gmail.com) wrote:
> I am interested in doing GSOC with PostgreSQL if you guys are applying for it 
> this year. If anyone could direct me to proper links or some channel to 
> prepare for the same then it will be really helpful. I have some questions 
> like: What will be the projects this year, What are the requirement for the 
> respective projects, who will be the mentors, etc.

We have applied but there is no guarantee that we'll be accepted.
Google has said they're announcing on Feb 26 the projects which have
been accepted.  Also, it's up to Google to decide how many slots we are
given to have GSoC students, and we also won't know that until they tell
us.

I'd suggest you wait until the announcement is made by Google, at which
time they'll also tell us how many slots we have, and will provide
direction to prospective students regarding the next steps.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [PATCH][PROPOSAL] Add enum releation option type

2019-02-15 Thread Darafei Praliaskouski

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

I believe this patch simplifies infrastructure. As non native speaker, I'm okay 
with presented version of comma.

Re: [HACKERS] Time to change pg_regress diffs to unified by default?

2019-02-15 Thread Peter Eisentraut

On 2019-01-03 11:20, Daniel Gustafsson wrote:
>> On 3 Jan 2019, at 10:39, Christoph Berg  wrote:
>>
>> Re: Peter Eisentraut 2019-01-02 
>> <70440c81-37bb-76dd-e48b-b5a9550d5...@2ndquadrant.com>
> 
>>> While we're considering the pg_regress output, what do you think about
>>> replacing the ==... separator with a standard diff separator like
>>> "diff %s %s %s\n".  This would make the file behave more like a proper
>>> diff file, for use with other tools.  And it shows the diff options
>>> used, for clarity.  See attached patch.
>>
>> It will especially say which _alternate.out file was used, which seems
>> like a big win. So +1.
> 
> That has bitten more times than I’d like to admit, so definately +1 on being
> explicit about that.

committed

(This might be one of those rare times where one hopes for a buildfarm
failure for verification. :-) )

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq compression

2019-02-15 Thread Konstantin Knizhnik




On 15.02.2019 15:42, Peter Eisentraut wrote:

On 2018-06-19 09:54, Konstantin Knizhnik wrote:

The main drawback of streaming compression is that you can not
decompress some particular message without decompression of all previous
messages.

It seems this would have an adverse effect on protocol-aware connection
proxies: They would have to uncompress everything coming in and
recompress everything going out.

The alternative of compressing each packet individually would work much
better: A connection proxy could peek into the packet header and only
uncompress the (few, small) packets that it needs for state and routing.

Individual compression of each message depreciate all idea of libpq 
compression.

Messages are two small to efficiently compress each of them separately.
So using streaming compression algorithm is absolutely necessary here.

Concerning possible problem with proxies I do not think that it is 
really a problem.
Proxy is very rarely located somewhere in the "middle" between client 
and database servers.
It is usually launched either in the same network as DBMS client (for 
example, if client is application server) either in the same network 
with database server.
In both cases there is not so much sense to pass compressed traffic 
through the proxy.
If proxy and DBMS server are located in the same network, then proxy 
should perform decompression and send

decompressed messages to the database server.

Thank you very much for noticing this problem with compatibility 
compression and protocol-aware connection proxies.
I have wrote that current compression implementation (zpq_stream.c) can 
be used not only for libpq backend/frontend, but
also for compression any other streaming data. But I could not imaging 
what other data sources can require compression.
And proxy is exactly such case: it also needs to compress/decompress 
messages.
It is one more argument to make interface of zpq_stream as simple as 
possible and encapsulate all inflating/deflating logic in this code.
It can be achieved by passing arbitrary rx/tx function to zpq_create 
function.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: log bind parameter values on error

2019-02-15 Thread Alexey Bashtanov


Hello Anders and Peter,

Thanks for your messages.
Please see the new patch version attached.

> Have you analyzed how invasive it'd be to delay that till we actually
> can safely start a [sub]transaction to do that logging? Probably too
> expensive, but it seems like something that ought to be analyzed.

The fundamental problem here is that the output functions for the types
of the values to be logged may be present only in the transaction
that has just been aborted.

Also I don't like the idea of doing complex operations in the error handler,
as it increases the chances of cascading errors.

I thought of pre-calculating types' output functions instead of their 
results,

but that would work for certain built-in types only.


+ 
+  log_parameters_on_error (boolean)
+  
+   log_parameters_on_error configuration 
parameter
+  
+  
+  
+   
+Controls whether the statement is logged with bind parameter values.
+It adds some overhead, as postgres will cache textual
+representations of parameter values in memory for all statements,
+even if they eventually do not get logged. The default is
+off. Only superusers can change this setting.
+   
+  
+ 

This needs a bit of language polishing.


I would appreciate if you had any suggestions, as my English isn't great.


@@ -31,6 +31,8 @@
   * set of parameter values.  If dynamic parameter hooks are present, we
   * intentionally do not copy them into the result.  Rather, we forcibly
   * instantiate all available parameter values and copy the datum values.
+ *
+ * We don't copy textual representations here.
   */

That probably needs to be expanded on. Including a comment about what
guarantees that there are no memory lifetime issues.


What kind of memory lifetime issues do you mean?
We're not copying textual representations, so the worst can happen
is they don't get logged when appropriate. Luckily, we never use this
function when copying to a portal we use for logging, I added this to
the comment. Do you think it's any better?



So the parameters we log here haven't actually gone through the output
function? Isn't that an issue? I mean that'll cause the contents to
differ from the non-error case, no? And differs from the binary protocol
case?

I don't think it's much of a problem. Text input and output functions are
meant to match, but the CREATE TYPE spec isn't too specific about what 
it means.
Of course it does not mean that typoutput(typinput(foo)) is always 
exactly foo.

However, I really hope that at least typinput(typoutput(foo)) = foo,
where "=" is the correspondent operator registered in postgres.

If a cheeky client passes '007' as a numeric value I don't mind it being
sometimes logged as '007' and sometimes as '7', depending on the settings.
It anyway denotes the same number, and we'll know what to pass to 
reproduce the problem.
For binary protocol it'll be '7' as well, as it'll undergo the typrecv 
and then typoutput procedures.



else
{
+   /*
+* We do it for non-xact commands only, as an xact command
+* 1) shouldn't have any parameters to log;
+* 2) may have the portal dropped early.
+*/
+   Assert(current_top_portal == NULL);
+   current_top_portal = portal;
+   portalParams = NULL;
+

"it"? ISTM the comment doesn't really stand on its own?

Thanks, I fixed the comment and some code around it.


+/*
+ * get_portal_bind_parameters
+ * Get the string containing parameters data, is used for logging.
+ *
+ * Can return NULL if there are no parameters in the portal
+ * or the portal is not valid, or the text representations of the parameters 
are
+ * not available. If returning a non-NULL value, it allocates memory
+ * for the returned string in the current context, and it's the caller's
+ * responsibility to pfree() it if needed.
+ */
+char *
+get_portal_bind_parameters(ParamListInfo params)
+{
+   StringInfoData param_str;
+
+   /* No parameters to format */
+   if (!params || params->numParams == 0)
+   return NULL;
+
+   elog(WARNING, "params->hasTextValues=%d, 
IsAbortedTransactionBlockState()=%d",
+  params->hasTextValues && 
IsAbortedTransactionBlockState());

Err, huh?

This was some debugging, I threw it away now.

Best,
 Alex

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f3..59638f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6312,6 +6312,23 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
+ 
+  log_parameters_on_error (boolean)
+  
+   log_parameters_on_error configuration parameter
+  
+  
+  
+   
+Controls whether the statement is logged with bind parameter values. 
+It adds some overhead, as postgres will cache

Re: WAL insert delay settings

2019-02-15 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> > peter.eisentr...@2ndquadrant.com> wrote:
> > > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > > But if you add extra sleep() calls somewhere (say because there's also
> > > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > > general. Yet it'll continue with the cost-based throttling, but it will
> > > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > > limit drops to 4 MB/s.
> > >
> > > I think at a first approximation, you probably don't want to add WAL
> > > delays to vacuum jobs, since they are already slowed down, so the rate
> > > of WAL they produce might not be your first problem.  The problem is
> > > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> > >
> > > That leads to an alternative idea of expanding the existing cost-based
> > > vacuum delay system to other commands.
> > >
> > > We could even enhance the cost system by taking WAL into account as an
> > > additional factor.
> > 
> > This is really what I was thinking- let’s not have multiple independent
> > ways of slowing down maintenance and similar jobs to reduce their impact on
> > I/o to the heap and to WAL.
> 
> I think that's a bad idea. Both because the current vacuum code is
> *terrible* if you desire higher rates because both CPU and IO time
> aren't taken into account. And it's extremely hard to control.  And it
> seems entirely valuable to be able to limit the amount of WAL generated
> for replication, but still try go get the rest of the work done as
> quickly as reasonably possible wrt local IO.

I'm all for making improvements to the vacuum code and making it easier
to control.

I don't buy off on the argument that there is some way to segregate the
local I/O question from the WAL when we're talking about these kinds of
operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
do I think we do our users a service by giving them independent knobs
for both that will undoubtably end up making it more difficult to
understand and control what's going on overall.

Even here, it seems, you're arguing that the existing approach for
VACUUM is hard to control; wouldn't adding another set of knobs for
controlling the amount of WAL generated by VACUUM make that worse?  I
have a hard time seeing how it wouldn't.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [Bug Fix] ECPG: could not use some CREATE TABLE AS syntax

2019-02-15 Thread Michael Meskes

Higuchi-san,

> I found some "CREATE TABLE ... AS ... " syntaxes could not be used in
> ECPG. 
> ...
> [Investigation]
> In my investigation, parse.pl ignore type CreateAsStmt of gram.y and
> CreateAsStmt is defined in ecpg.trailer. ECPG use ecpg.trailer's
> CreateAsStmt. However, ecpg.trailer lacks some syntaxes. 

Correct, the syntax of create as statement was changed and those
changes have not been added to ecpg.

> I feel ignoring type CreateAsStmt of gram.y is strange. Seeing
> ecpg.trailer, it seems that ECPG wanted to output the message "CREATE
> TABLE AS cannot specify INTO" but is this needed now? In view of the
> maintenance, ECPG should use not ecpg.trailer's definition but
> gram.y's one. 

I beg to disagree, or I don't understand. Why would ecpg's changes to
the statement be wrong nowadays?

> I attached the patch for this and I will register this for next CF. 

I think the patch does not work correctly. The statement
CREATE TABLE a AS SELECT * INTO test FROM a;
is accepted with your patch, but it is not accepted in current ecpg nor
is it accepted by the backend when you execute it through ecpg. The
whole change of this rule has been made to make sure this syntax is not
accepted.

Michael

-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL

Re: Reporting script runtimes in pg_regress

2019-02-15 Thread Peter Eisentraut

On 2019-02-11 15:30, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Now that I see this in action, it makes the actual test results harder
>> to identify flying by.  I understand the desire to collect this timing
>> data, but that is a special use case and not relevant to the normal use
>> of the test suite, which is to see whether the test passes.  Can we make
>> this optional please?
> 
> Well, I want the buildfarm to produce this info, so it's hard to see
> how to get that without the timings being included by default.  I take
> your point that it makes the display look a bit cluttered, though.

Can we enable it through the buildfarm client?

Or we could write it into a separate log file.

> Would it help to put more whitespace between the status and the timing?

prove --timer produces this:

[14:21:32] t/001_basic.pl  ok 9154 ms ( 0.00 usr  0.01 sys +  
2.28 cusr  3.40 csys =  5.69 CPU)
[14:21:41] t/002_databases.pl  ok11294 ms ( 0.00 usr  0.00 sys +  
2.16 cusr  3.84 csys =  6.00 CPU)
[14:21:52] t/003_extrafiles.pl ... ok 7736 ms ( 0.00 usr  0.00 sys +  
1.89 cusr  2.91 csys =  4.80 CPU)
[14:22:00] t/004_pg_xlog_symlink.pl .. ok 9035 ms ( 0.00 usr  0.00 sys +  
2.03 cusr  3.02 csys =  5.05 CPU)
[14:22:09] t/005_same_timeline.pl  ok 8048 ms ( 0.00 usr  0.00 sys +  
0.92 cusr  1.29 csys =  2.21 CPU)

which seems quite readable.  So maybe something like this:

 identity ... ok  238 ms
 partition_join   ... ok  429 ms
 partition_prune  ... ok  786 ms
 reloptions   ... ok   94 ms
 hash_part... ok   78 ms
 indexing ... ok 1298 ms
 partition_aggregate  ... ok  727 ms
 partition_info   ... ok  110 ms
test event_trigger... ok  128 ms
test fast_default ... ok  173 ms
test stats... ok  637 ms

which would be

-   status(_(" (%.0f ms)"), INSTR_TIME_GET_MILLISEC(stoptimes[i]));
+   status(_(" %8.0f ms"), INSTR_TIME_GET_MILLISEC(stoptimes[i]));

(times two).

If we're going to keep this, should we enable the prove --timer option as well?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: restrict pg_stat_ssl to superuser?

2019-02-15 Thread Peter Eisentraut

On 2019-02-12 07:40, Michael Paquier wrote:
> On Thu, Feb 07, 2019 at 09:30:38AM +0100, Peter Eisentraut wrote:
>> If so, is there anything in that view that should be made available to
>> non-superusers?  If not, then we could perhaps do this via a simple
>> permission change instead of going the route of blanking out individual
>> columns.
> 
> Hm.  It looks sensible to move to a per-permission approach for that
> view.  Now, pg_stat_get_activity() is not really actually restricted,
> and would still return the information on direct calls, so the idea
> would be to split the SSL-related data into its own function?

We could remove default privileges from pg_stat_get_activity().  Would
that be a problem?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Problems with plan estimates in postgres_fdw

2019-02-15 Thread Antonin Houska

Etsuro Fujita  wrote:

> (2019/02/12 18:03), Antonin Houska wrote:
> > Etsuro Fujita  wrote:
> >> (2019/02/08 2:04), Antonin Houska wrote:
> >>>
> >>> And maybe related problem: why should FDW care about the effect of
> >>> apply_scanjoin_target_to_paths() like you claim to do here?
> >>>
> >>>   /*
> >>>* If this is UPPERREL_ORDERED and/or UPPERREL_FINAL steps on the
> >>>* final scan/join relation, the costs obtained from the cache
> >>>* wouldn't yet contain the eval cost for the final scan/join target,
> >>>* which would have been updated by apply_scanjoin_target_to_paths();
> >>>* add the eval cost now.
> >>>*/
> >>>   if (fpextra&&   !IS_UPPER_REL(foreignrel))
> >>>   {
> >>>   /* The costs should have been obtained from the cache. */
> >>>   Assert(fpinfo->rel_startup_cost>= 0&&
> >>>   fpinfo->rel_total_cost>= 0);
> >>>
> >>>   startup_cost += foreignrel->reltarget->cost.startup;
> >>>   run_cost += foreignrel->reltarget->cost.per_tuple * rows;
> >>>   }
> >>>
> >>> I think it should not, whether "foreignrel" means "input_rel" or 
> >>> "output_rel"
> >>> from the perspective of postgresGetForeignUpperPaths().
> >>
> >> I added this before costing the sort operation below, because 1) the cost 
> >> of
> >> evaluating the scan/join target might not be zero (consider the case where
> >> sort columns are not simple Vars, for example) and 2) the cost of sorting
> >> takes into account the underlying path's startup/total costs.  Maybe I'm
> >> missing something, though.
> >
> > My understanding is that the "input_rel" argument of
> > postgresGetForeignUpperPaths() should already have been processed by
> > postgresGetForeignPaths() or postgresGetForeignJoinPaths() (or actually by
> > postgresGetForeignUpperPaths() called with different "stage" too). Therefore
> > it should already have the "fdw_private" field initialized. The estimates in
> > the corresponding PgFdwRelationInfo should already include the reltarget
> > costs, as set previously by estimate_path_cost_size().
> 
> Right, but what I'm saying here is: the costs of evaluating the final
> scan/join target updated by apply_scanjoin_target_to_paths() wouldn't be yet
> included in the costs we calculated using local statistics when called from
> postgresGetForeignPaths() or postgresGetForeignJoinPaths(). Let me explain
> using an example:
> 
> SELECT a+b FROM foreign_table LIMIT 10;
> 
> For this query, the reltarget for the foreign_table would be {a, b} when
> called from postgresGetForeignPaths() (see build_base_rel_tlists()). The costs
> of evaluating simple Vars are zeroes, so we wouldn't charge any costs for
> tlist evaluation when estimating the costs of a basic foreign table scan in
> that callback routine, but the final scan target, with which the reltarget
> will be replaced later by apply_scanjoin_target_to_paths(), would be {a+b}, so
> we need to adjust the costs of the basic foreign table scan, cached in the
> PgFdwRelationInfo for the input_rel (ie, the foreign_table), so that eval
> costs for the replaced tlist are included when called from
> postgresGetForeignUpperPaths() with the UPPERREL_FINAL stage, as the costs of
> evaluating the expression 'a+b' wouldn't be zeroes.

ok, I understand now. I assume that the patch

https://www.postgresql.org/message-id/5C66A056.60007%40lab.ntt.co.jp

obsoletes the code snippet we discussed above.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Re: libpq compression

2019-02-15 Thread Peter Eisentraut

On 2018-06-19 09:54, Konstantin Knizhnik wrote:
> The main drawback of streaming compression is that you can not
> decompress some particular message without decompression of all previous
> messages.

It seems this would have an adverse effect on protocol-aware connection
proxies: They would have to uncompress everything coming in and
recompress everything going out.

The alternative of compressing each packet individually would work much
better: A connection proxy could peek into the packet header and only
uncompress the (few, small) packets that it needs for state and routing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Protect syscache from bloating with negative cache entries

2019-02-15 Thread Ideriha, Takeshi

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]

>>About the new global-size based evicition(2), cache entry creation
>>becomes slow after the total size reached to the limit since every one
>>new entry evicts one or more old (=
>>not-recently-used) entries. Because of not needing knbos for each
>>cache, it become far realistic. So I added documentation of
>"catalog_cache_max_size" in 0005.
>
>Now I'm also trying to benchmark, which will be posted in another email.

According to recent comments by Andres and Bruce
maybe we should address negative cache bloat step by step 
for example by reviewing Tom's patch. 

But at the same time, I did some benchmark with only hard limit option enabled
and time-related option disabled, because the figures of this case are not 
provided in this thread.
So let me share it.

I did two experiments. One is to show negative cache bloat is suppressed.
This thread originated from the issue that negative cache of pg_statistics 
is bloating as creating and dropping temp table is repeatedly executed.
https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp
  
Using the script attached the first email in this thread, I repeated create and 
drop temp table at 1 times.
(experiment is repeated 5 times. catalog_cache_max_size = 500kB. 
 compared master branch and patch with hard memory limit)

Here are TPS and CacheMemoryContext 'used' memory (total - freespace) 
calculated by MemoryContextPrintStats()
at 100, 1000, 1 times of create-and-drop transaction. The result shows 
cache bloating is suppressed
after exceeding the limit (at 1) but tps declines regardless of the limit.

number of tx (create and drop)   | 100  |1000|1 
---
used CacheMemoryContext  (master) |610296|2029256 |15909024
used CacheMemoryContext  (patch)  |755176|880552  |880592
---
TPS (master) |414   |407 |399
TPS (patch)   |242   |225 |220


Another experiment is using Tomas's script posted while ago,
The scenario is do select 1 from multiple tables randomly (uniform 
distribution).
(experiment is repeated 5 times. catalog_cache_max_size = 10MB. 
 compared master branch and patch with only hard memory limit enabled)

Before doing the benchmark, I checked pruning is happened only at 1 tables
using debug option. The result shows degradation regardless of before or after 
pruning. 
I personally still need hard size limitation but I'm surprised that the 
difference is so significant.

number of tables   | 100  |1000|1 
---
TPS (master)   |10966  |10654 |9099
TPS (patch)|4491   |2099 |378

Regards,
Takeshi Ideriha

Re: Problems with plan estimates in postgres_fdw

2019-02-15 Thread Etsuro Fujita


(2019/02/14 18:44), Etsuro Fujita wrote:

(2019/02/12 20:43), Antonin Houska wrote:

Etsuro Fujita wrote:

Here are my review comments:

root->upper_targets[UPPERREL_FINAL] = final_target;
+ root->upper_targets[UPPERREL_ORDERED] = final_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;

I think that's a good idea. I think we'll soon need the PathTarget for
UPPERREL_DISTINCT, so how about saving that as well like this?

root->upper_targets[UPPERREL_FINAL] = final_target;
+ root->upper_targets[UPPERREL_ORDERED] = final_target;
+ root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;


I see nothing wrong about this.


Cool!


Another is about this:

/*
+ * Set reltarget so that it's consistent with the paths. Also it's more
+ * convenient for FDW to find the target here.
+ */
+ ordered_rel->reltarget = target;

I think this is correct if there are no SRFs in the targetlist, but
otherwise
I'm not sure this is really correct because the relation's reltarget
would not
match the target of paths added to the relation after
adjust_paths_for_srfs().


I wouldn't expect problem here. create_grouping_paths() also sets the
reltarget although adjust_paths_for_srfs() can be alled on grouped_rel
afterwards.


Yeah, but as I said in a previous email, I think the root->upper_targets
change would be enough at least currently for FDWs to consider
performing post-UPPERREL_GROUP_AGG steps remotely, as shown in my
patches, so I'm inclined to leave the retarget change for future work.


So, here is an updated patch.  If there are no objections from you or 
anyone else, I'll commit the patch as a preliminary one for what's 
proposed in this thread.


Best regards,
Etsuro Fujita
>From 5ade52b576cb8403d3a079fecf4c938225cc3bd7 Mon Sep 17 00:00:00 2001
From: Etsuro Fujita 
Date: Fri, 15 Feb 2019 21:13:36 +0900
Subject: [PATCH] Save PathTargets for distinct/ordered relations in
 root->upper_targets[]

Previously, we only saved the PathTargets for grouping, window and final
relations in root->upper_targets[] in grouping_planner.  To improve the
convenience of extensions, save the PathTargets for distinct and ordered
relations as well.

Author: Antonin Houska, with an additional change by me
Discussion: https://postgr.es/m/10994.1549559088@localhost
---
 src/backend/optimizer/plan/planner.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ddb86bd0c3..a4ec4456e5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2035,6 +2035,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		 * of the corresponding upperrels might not be needed for this query.
 		 */
 		root->upper_targets[UPPERREL_FINAL] = final_target;
+		root->upper_targets[UPPERREL_ORDERED] = final_target;
+		root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
 		root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
 		root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
 
-- 
2.19.2

Re: [PROPOSAL]a new data type 'bytea' for ECPG

2019-02-15 Thread Michael Meskes

Matsumura-san,

> I agree that the special route is ugly, but I cannot remove them
> completely.
> I try to implement Idea-2. In same time, I try to move if(bytea)
> blocks to
> new function for readability.
> 
> e.g. move the following to new function set_data_attr().

I don't think this will help much, don't bother.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL

Re: Incorrect visibility test function assigned to snapshot

2019-02-15 Thread Antonin Houska

Michael Paquier  wrote:

> On Fri, Feb 08, 2019 at 11:59:05AM +0100, Antonin Houska wrote:
> > Sorry, I forgot. Patch is below and I'm going to add an entry to the
> > next CF.
> 
> > @@ -615,6 +615,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
> >  
> > TransactionIdAdvance(xid);
> > }
> > +   /* And of course, adjust snapshot type accordingly. */
> > +   snap->snapshot_type = SNAPSHOT_MVCC;
> 
> Wouldn't it be cleaner to have an additional argument to
> SnapBuildBuildSnapshot() for the snapshot type?  It looks confusing to
> me to overwrite the snapshot type after initializing it once.

I'm not sure. SnapBuildBuildSnapshot() only creates snapshot for the logical
replication purposes and the "snapshot_type" argument would make the function
look like it can handle all snapshot types. If adding an argument, I'd prefer
a boolean variable (e.g. "historic") telling whether user wants
SNAPSHOT_HISTORIC_MVCC or SNAPSHOT_MVCC. But that may deserve a separate
patch.

As for the bug fix, I think the additional assignment does not make things
worse because SnapBuildInitialSnapshot() already does overwrite some fields:
"xip" and "xnt".

> > @@ -1502,6 +1502,13 @@ ImportSnapshot(const char *idstr)
> >  */
> > memset(, 0, sizeof(snapshot));
> >  
> > +   /*
> > +* Do not rely on the fact that SNAPSHOT_MVCC is zero. (The core code
> > +* currently does not use this field of imported snapshot, but let's 
> > keep
> > +* things consistent.)
> > +*/
> > +   snapshot.snapshot_type = SNAPSHOT_MVCC;
> 
> Okay for this one, however the comment does not add much value.

o.k. I don't insist on using this comment.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

postgres_fdw: another oddity in costing aggregate pushdown paths

2019-02-15 Thread Etsuro Fujita

As mentioned in the near thread, I think there is another oversight in
the cost estimation for aggregate pushdown paths in postgres_fdw, IIUC.
 When costing an aggregate pushdown path using local statistics, we
re-use the estimated costs of implementing the underlying scan/join
relation, cached in the relation's PgFdwRelationInfo (ie,
rel_startup_cost and rel_total_cost).  Since these costs wouldn't yet
contain the costs of evaluating the final scan/join target, as tlist
replacement by apply_scanjoin_target_to_paths() is performed afterwards.
 So I think we need to adjust these costs so that the tlist eval costs
are included, but ISTM that estimate_path_cost_size() forgot to do so.
Attached is a patch for fixing this issue.

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 2705,2716  estimate_path_cost_size(PlannerInfo *root,
--- 2705,2719 
  		}
  		else if (IS_UPPER_REL(foreignrel))
  		{
+ 			RelOptInfo *outerrel = fpinfo->outerrel;
  			PgFdwRelationInfo *ofpinfo;
  			AggClauseCosts aggcosts;
  			double		input_rows;
  			int			numGroupCols;
  			double		numGroups = 1;
  
+ 			Assert(outerrel);
+ 
  			/*
  			 * This cost model is mixture of costing done for sorted and
  			 * hashed aggregates in cost_agg().  We are not sure which
***
*** 2719,2725  estimate_path_cost_size(PlannerInfo *root,
  			 * and all finalization and run cost are added in total_cost.
  			 */
  
! 			ofpinfo = (PgFdwRelationInfo *) fpinfo->outerrel->fdw_private;
  
  			/* Get rows and width from input rel */
  			input_rows = ofpinfo->rows;
--- 2722,2728 
  			 * and all finalization and run cost are added in total_cost.
  			 */
  
! 			ofpinfo = (PgFdwRelationInfo *) outerrel->fdw_private;
  
  			/* Get rows and width from input rel */
  			input_rows = ofpinfo->rows;
***
*** 2772,2782  estimate_path_cost_size(PlannerInfo *root,
  
  			/*-
  			 * Startup cost includes:
! 			 *	  1. Startup cost for underneath input relation
  			 *	  2. Cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			startup_cost = ofpinfo->rel_startup_cost;
  			startup_cost += aggcosts.transCost.startup;
  			startup_cost += aggcosts.transCost.per_tuple * input_rows;
  			startup_cost += aggcosts.finalCost.startup;
--- 2775,2787 
  
  			/*-
  			 * Startup cost includes:
! 			 *	  1. Startup cost for underneath input relation, adjusted for
! 			 *	 tlist replacement by apply_scanjoin_target_to_paths()
  			 *	  2. Cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			startup_cost = ofpinfo->rel_startup_cost;
+ 			startup_cost += outerrel->reltarget->cost.startup;
  			startup_cost += aggcosts.transCost.startup;
  			startup_cost += aggcosts.transCost.per_tuple * input_rows;
  			startup_cost += aggcosts.finalCost.startup;
***
*** 2784,2794  estimate_path_cost_size(PlannerInfo *root,
  
  			/*-
  			 * Run time cost includes:
! 			 *	  1. Run time cost of underneath input relation
  			 *	  2. Run time cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			run_cost = ofpinfo->rel_total_cost - ofpinfo->rel_startup_cost;
  			run_cost += aggcosts.finalCost.per_tuple * numGroups;
  			run_cost += cpu_tuple_cost * numGroups;
  
--- 2789,2801 
  
  			/*-
  			 * Run time cost includes:
! 			 *	  1. Run time cost of underneath input relation, adjusted for
! 			 *	 tlist replacement by apply_scanjoin_target_to_paths()
  			 *	  2. Run time cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			run_cost = ofpinfo->rel_total_cost - ofpinfo->rel_startup_cost;
+ 			run_cost += outerrel->reltarget->cost.per_tuple * input_rows;
  			run_cost += aggcosts.finalCost.per_tuple * numGroups;
  			run_cost += cpu_tuple_cost * numGroups;

Re: Problems with plan estimates in postgres_fdw

2019-02-15 Thread Etsuro Fujita


(2019/02/12 18:03), Antonin Houska wrote:

Etsuro Fujita  wrote:

(2019/02/08 2:04), Antonin Houska wrote:

Some comments on coding:

0001-postgres_fdw-Perform-UPPERREL_ORDERED-step-remotely-v3.patch
-

* I'm not sure if UPPERREL_ORDERD should to deal with LIMIT at all. Note that
grouping_planner() does not call create_limit_path() until it's done with
create_ordered_paths(), and create_ordered_paths() is where the FDW is
requested to add its paths to UPPERREL_ORDERED relation.


Maybe I'm missing something, but the 0001 patch doesn't deal with LIMIT at
all.  I added the parameter limit_tuples to PgFdwPathExtraData so that we can
pass limit_tuples=-1, which means no LIMIT, to cost_sort(), though.


Yes, the limit_tuples field made me confused. But if cost_sort() is the only
reason, why not simply pass -1 in the 0001 patch, and use the limit_tuples
argument only in 0002? I see several other calls of cost_sort() in the core
where -1 is hard-coded.


Yeah, that would make it easy to review the patch; will do.


Both patches:


One thing that makes me confused when I try to understand details is that the
new functions add_foreign_ordered_paths() and add_foreign_final_paths() both
pass "input_rel" for the "foreignrel" argument of estimate_path_cost_size():

estimate_path_cost_size(root, input_rel, ...)

whereas the existing add_foreign_grouping_paths() passes "grouped_rel":

estimate_path_cost_size(root, grouped_rel, ...)

Can you please make this consistent? If you do, you'll probably need to
reconsider your changes to estimate_path_cost_size().


This is intended and I think I should add more comments on that.  Let me
explain.  These patches modified estimate_path_cost_size() so that we can
estimate the costs of a foreign path for the UPPERREL_ORDERED or
UPPERREL_FINAL rel, by adding the costs of the upper-relation processing (ie,
ORDER BY and/or LIMIT/OFFSET, specified by PgFdwPathExtraData) to the costs of
implementing the *underlying* relation for the upper relation (ie, scan, join
or grouping rel, specified by the input_rel). So those functions call
estimate_path_cost_size() with the input_rel, not the ordered_rel/final_rel,
along with PgFdwPathExtraData.


I think the same already happens for the UPPERREL_GROUP_AGG relation:
estimate_path_cost_size()

...
else if (IS_UPPER_REL(foreignrel))
{
...
ofpinfo = (PgFdwRelationInfo *) fpinfo->outerrel->fdw_private;

Here "outerrel" actually means input relation from the grouping perspective.
The input relation (i.e. result of scan / join) estimates are retrieved from
"ofpinfo" and the costs of aggregation are added. Why should this be different
for other kinds of upper relations?


IIUC, I think there is an oversight in that case: the cost estimation 
doesn't account for evaluating the final scan/join target updated by 
apply_scanjoin_target_to_paths(), but I think that is another issue, so 
I'll start a new thread.



And maybe related problem: why should FDW care about the effect of
apply_scanjoin_target_to_paths() like you claim to do here?

/*
 * If this is UPPERREL_ORDERED and/or UPPERREL_FINAL steps on the
 * final scan/join relation, the costs obtained from the cache
 * wouldn't yet contain the eval cost for the final scan/join target,
 * which would have been updated by apply_scanjoin_target_to_paths();
 * add the eval cost now.
 */
if (fpextra&&   !IS_UPPER_REL(foreignrel))
{
/* The costs should have been obtained from the cache. */
Assert(fpinfo->rel_startup_cost>= 0&&
fpinfo->rel_total_cost>= 0);

startup_cost += foreignrel->reltarget->cost.startup;
run_cost += foreignrel->reltarget->cost.per_tuple * rows;
}

I think it should not, whether "foreignrel" means "input_rel" or "output_rel"
from the perspective of postgresGetForeignUpperPaths().


I added this before costing the sort operation below, because 1) the cost of
evaluating the scan/join target might not be zero (consider the case where
sort columns are not simple Vars, for example) and 2) the cost of sorting
takes into account the underlying path's startup/total costs.  Maybe I'm
missing something, though.


My understanding is that the "input_rel" argument of
postgresGetForeignUpperPaths() should already have been processed by
postgresGetForeignPaths() or postgresGetForeignJoinPaths() (or actually by
postgresGetForeignUpperPaths() called with different "stage" too). Therefore
it should already have the "fdw_private" field initialized. The estimates in
the corresponding PgFdwRelationInfo should already include the reltarget
costs, as set previously by estimate_path_cost_size().


Right, but what I'm saying here is: the costs of evaluating the final 
scan/join target updated by

Re: 2019-03 CF Summary / Review - Tranche #1

2019-02-15 Thread Michael Paquier

On Thu, Feb 14, 2019 at 12:37:52PM -0800, Andres Freund wrote:
> - Adding a TAP test checking data consistency on standby with
>   minRecoveryPoint
> 
>   NR: Seems like it ought to be committable, Andrew Gierth did provide
>   some feedback.

I am rather happy of the shape of the test, and was just waiting from
Andrew for a last confirmation.  If there are no objections, I could
commit it as well.
--
Michael


signature.asc
Description: PGP signature

Re: INSTALL file

2019-02-15 Thread Peter Eisentraut

On 14/02/2019 20:13, Andres Freund wrote:
> On 2019-02-04 11:02:44 +0900, Michael Paquier wrote:
>> +1.  I have just looked at the patch, and my take is that listing all
>> the ways to build Postgres directly in the README is just a
>> duplication of what we already have in the documentation.  So I would
>> just rip out all that and keep a simple link to the documentation.
> 
> Well, the documentation cannot be built without the dependencies, and
> not everyone has convenient internet access.  I'm not sure what the
> solution to that is, but somehow consolidating that information in the,
> by now pretty standardized, location of INSTALL seems pretty reasonable
> to me.

(I suppose you meant README here.  The information is already in INSTALL.)

But the proposed additions don't actually mention the required
dependencies to build the documentation, so it wouldn't even achieve
that goal.

And if you don't have internet access, how did you get the git checkout?

The use case here seems pretty narrow.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

86 matches

Mail list logo