Re: Disable WAL logging to speed up data loading

2020-10-01 Thread Kyotaro Horiguchi
At Fri, 02 Oct 2020 13:51:35 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Sorry for the slippery brain...  ^2

> At Fri, 02 Oct 2020 13:38:22 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao 
> >  wrote in 
> > > 
> > > On 2020/10/02 10:06, Kyotaro Horiguchi wrote:
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > > It always skips table copy in the SET UNLOGGED case,
> > > 
> > > Even in wal_level != minimal?
> > > What happens in the standby side when SET UNLOGGED is executed without
> > > the table rewrite in the primary? The table data should be truncated
> > > in the standby?
> > 
> > A table turned into unlogged on the primary is also turned into
> > unlogged on the standby and it is inaccessible on the standby.

> Maybe the storage dropped on unpatched

Maybe the old storage is replaced with an empty stroage on unpatched.

> and left alone on patched.

> > After the table is again turned into logged, the content is
> > transferred via WAL records generated from the insertions into the new
> > storage and it rebuilds the same storage on the standby on both
> > patched and unpatched.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Disable WAL logging to speed up data loading

2020-10-01 Thread tsunakawa.ta...@fujitsu.com
From: Fujii Masao 
> > The speedup has already been achieved with higher durability by
> > wal_level=minimal in that case.
> 
> I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold would
> speed up that initial data loading.

First of all, thank you Horiguchi-san for trying to improve ALTER TABLE SET 
UNLOGGED/LOGGED.  That should also be appealing.

At the same time, as I said before, both features have good points.  TBH, as a 
user, I'm kind of attracted by MySQL's approach because of its simplicity for 
users (although DBMS developers may be worried about this and that.)  What 
tempts me is that I can just switch on the feature with a single configuration 
parameter, and continue to use existing SQL scripts and other data integration 
software without knowing what tables those load data into.  In the same 
context, I don't have to add or delete ALTER TABLE statements when I have to 
change the set of tables to be loaded.  For the same reason, I'm also 
interested in Oracle's another feature ALTER TABLESPACE LOGGING/NOLOGGING.

BTW, does ALTER TABLE LOGGED/UNLOGGED on a partitioned table get the change to 
its all partitions?  It would be a bit tedious to add/delete ALTER TABLE 
LOGGED/UNLOGGED when I add/drop a partition.

Regarding data migration, data movement is not limited only to major upgrades.  
It will be convenient to speed up the migration of the entire database cluster 
into a new instance for testing and new deployment.  (I'm not sure about recent 
pg_upgrade, but pg_upgrade sometimes cannot upgrade too older versions.)

To conclude, I hope both features will be realized, and wish we won't fall in a 
situation where the words fly such as "Mine is enough. Yours is risky and not 
necessary."

With that said, I think we may as well separate the thread some time later for 
CF entry.  Otherwise, we will have trouble in finding the latest patch from the 
CF entry.


> No, I have no strong objection against your trial. But I was thinking
> that it's not so easy to design and implement wal_level=none.
> For example, there are some functions and commands depending on
> the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION
> and COMMIT PREPARED. Probably you need to define how they should
> work in wal_level=none, e.g., emit an error.

Yeah, we thought pg_switch_wal() may need some treatment.  We'll check 
PREPARE and COMMIT PREPARED as well.  I'd appreciate it if you share what you 
notice at any time.  It is possible that we should emit WAL records of some 
resource managers, like the bootstrap mode emits WAL only for RM_XLOG_ID.


Regards
Takayuki Tsunakawa







Re: Disable WAL logging to speed up data loading

2020-10-01 Thread Kyotaro Horiguchi
Sorry for the slippery brain...

At Fri, 02 Oct 2020 13:38:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao  
> wrote in 
> > 
> > On 2020/10/02 10:06, Kyotaro Horiguchi wrote:
> > > 1. Flip BM_PERMANENT of active buffers
> > > 2. adding/removing init fork
> > > 3. sync files,
> > > 4. Flip pg_class.relpersistence.
> > > It always skips table copy in the SET UNLOGGED case,
> > 
> > Even in wal_level != minimal?
> > What happens in the standby side when SET UNLOGGED is executed without
> > the table rewrite in the primary? The table data should be truncated
> > in the standby?
> 
> A table turned into unlogged on the primary is also turned into
> unlogged on the standby and it is inaccessible on the standby.

> Maybe the storage is dropped on both patched and unpatched versoins.

Maybe the storage dropped on unpatched and left alone on patched.

> After the table is again turned into logged, the content is
> transferred via WAL records generated from the insertions into the new
> storage and it rebuilds the same storage on the standby on both
> patched and unpatched.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-01 Thread Kyotaro Horiguchi
At Fri, 02 Oct 2020 11:44:46 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 1 Oct 2020 12:55:34 +, "k.jami...@fujitsu.com" 
>  wrote in 
> - *   XXX currently it sequentially searches the buffer pool, should 
> be
> - *   changed to more clever ways of searching.  However, this routine
> - *   is used only in code paths that aren't very 
> performance-critical,
> - *   and we shouldn't slow down the hot paths to make it faster ...
> + *   XXX The relation might have extended before this, so this path 
> is
> + *   only optimized during recovery when we can get a reliable cached
> + *   value of blocks for specified relation.  In addition, it is 
> safe to
> + *   do this since there are no other processes but the startup 
> process
> + *   that changes the relation size during recovery.  Otherwise, or 
> if
> + *   not in recovery, proceed to usual invalidation process, where it
> + *   sequentially searches the buffer pool.
> 
> This should no longer be a XXX comment.  It seems to me somewhat
> describing too-detailed at this function's level. How about something
> like the follwoing? (excpet its syntax, or phrasing:p)
> 
> ===
> If the expected maximum number of buffers to drop is small enough
> compared to NBuffers, individual buffers are located by
> BufTableLookup. Otherwise we scan through all buffers. Snnce we
> mustn't leave a buffer behind, we take the latter way unless the
> number is not reliably identified.  See smgrcachednblocks() for
> details.
> ===

The second to last phrase is inversed, and some typos are found. FWIW
this is the revised version.


If we are expected to drop buffers less enough, we locate individual
buffers using BufTableLookup. Otherwise we scan through all
buffers. Since we mustn't leave a buffer behind, we take the latter
way unless the sizes of all the involved forks are known to be
accurte. See smgrcachednblocks() for details.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Disable WAL logging to speed up data loading

2020-10-01 Thread Kyotaro Horiguchi
At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao  
wrote in 
> 
> On 2020/10/02 10:06, Kyotaro Horiguchi wrote:
> > At Thu, 1 Oct 2020 08:14:42 +, "osumi.takami...@fujitsu.com"
> >  wrote in
> >> When I compare the 2 ideas,
> >> one of the benefits of this ALTER TABLE 's improvement
> >> is that we can't avoid the downtime
> >> while that of wal_level='none' provides an easy and faster
> >> major version up via output file of pg_dumpall.
> > The speedup has already been achieved with higher durability by
> > wal_level=minimal in that case.
> 
> I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold
> would
> speed up that initial data loading.

Yeah.

> >  Or maybe you should consider using
> > pg_upgrade instead.  Even inducing the time to take a backup copy of
> > the whole cluster, running pg_upgrade would be far faster than
> > pg_dumpall then loading.
> > 
> >> Both ideas have good points.
> >> However, actually to modify ALTER TABLE's copy
> >> looks far more difficult than wal_level='none' and
> >> beyond my current ability.
> >> So, I'd like to go forward with the direction of wal_level='none'.
> >> Did you have strong objections for this direction ?
> 
> No, I have no strong objection against your trial. But I was thinking
> that it's not so easy to design and implement wal_level=none.
> For example, there are some functions and commands depending on
> the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION
> and COMMIT PREPARED. Probably you need to define how they should
> work in wal_level=none, e.g., emit an error.
> 
> 
> > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > trials of several ways, I drifted to the following way after poking
> > several ways.
> 
> Nice!
> 
> 
> > 1. Flip BM_PERMANENT of active buffers
> > 2. adding/removing init fork
> > 3. sync files,
> > 4. Flip pg_class.relpersistence.
> > It always skips table copy in the SET UNLOGGED case,
> 
> Even in wal_level != minimal?
> What happens in the standby side when SET UNLOGGED is executed without
> the table rewrite in the primary? The table data should be truncated
> in the standby?

A table turned into unlogged on the primary is also turned into
unlogged on the standby and it is inaccessible on the standby. Maybe
the storage is dropped on both patched and unpatched versoins.

After the table is again turned into logged, the content is
transferred via WAL records generated from the insertions into the new
storage and it rebuilds the same storage on the standby on both
patched and unpatched.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: New statistics for tuning WAL buffer size

2020-10-01 Thread Masahiro Ikeda

On 2020-10-02 10:21, Fujii Masao wrote:

On 2020/10/01 13:35, Fujii Masao wrote:



On 2020/10/01 12:56, Masahiro Ikeda wrote:

On 2020-10-01 11:33, Amit Kapila wrote:

On Thu, Oct 1, 2020 at 6:53 AM Kyotaro Horiguchi
 wrote:


At Thu, 1 Oct 2020 09:05:19 +0900, Fujii Masao 
 wrote in

>
>
> On 2020/09/30 20:21, Amit Kapila wrote:
> > On Tue, Sep 29, 2020 at 9:23 PM Fujii Masao
> >  wrote:
> >>
> >> On 2020/09/29 11:51, Masahiro Ikeda wrote:
> >>> On 2020-09-29 11:43, Amit Kapila wrote:
>  On Tue, Sep 29, 2020 at 7:39 AM Masahiro Ikeda
>   wrote:
> >>> Thanks for your suggestion.
> >>> I understood that the point is that WAL-related stats have just one
> >>> counter now.
> >>>
> >>> Since we may add some WAL-related stats like pgWalUsage.(bytes,
> >>> records, fpi),
> >>> I think that the current approach is good.
> >>
> >> +1
> >>
> > Okay, it makes sense to keep it in the current form if we have a plan
> > to extend this view with additional stats. However, why don't we
> > expose it with a function similar to pg_stat_get_archiver() instead of
> > providing individual functions like pg_stat_get_wal_buffers_full() and
> > pg_stat_get_wal_stat_reset_time?
>
> We can adopt either of those approaches for pg_stat_wal. I think that
> the former is a bit more flexible because we can collect only one of
> WAL information even when pg_stat_wal will contain many information
> in the future, by using the function. But you thought there are some
> reasons that the latter is better for pg_stat_wal?

FWIW I prefer to expose it by one SRF function rather than by
subdivided functions.  One of the reasons is the less oid 
consumption

and/or reduction of definitions for intrinsic functions.

Another reason is at least for me subdivided functions are not 
useful
so much for on-the-fly examination on psql console.  I'm often 
annoyed

by realizing I can't recall the exact name of a function, say,
pg_last_wal_receive_lsn or such but function names cannot be
auto-completed on psql console. "select proname from pg_proc where
proname like.. " is one of my friends:p On the other hand "select *
from pg_stat_wal" requires no detailed memory.

However subdivided functions might be useful if I wanted use just 
one
number of wal-stats in a function, I think it is not a major usage 
and

we can use a SQL query on the view instead.

Another reason that I mildly want to object to subdivided functions 
is

I was annoyed that a stats view makes many individual calls to
functions that internally share the same statistics entry.  That
behavior required me to provide an entry-caching feature to my
shared-memory statistics patch.



All these are good reasons to expose it via one function and I think


Understood. +1 to expose it as one function.


that is why most of our existing views also use one function 
approach.


Thanks for your comments.
I didn't notice there are the above disadvantages to provide 
individual functions.


I changed the latest patch to expose it via one function.


Thanks for updating the patch! LGTM.
Barring any other objection, I will commit it.


I updated typedefs.list and pushed the patch. Thanks!


Thanks to all reviewers!

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: Assertion failure with barriers in parallel hash join

2020-10-01 Thread Thomas Munro
On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro  wrote:
> On Tue, Sep 29, 2020 at 7:11 PM Michael Paquier  wrote:
> > #2  0x009027d2 in ExceptionalCondition
> > (conditionName=conditionName@entry=0xa80846 "!barrier->static_party",
>
> > #4  0x00682ebf in ExecParallelHashJoinNewBatch
>
> Thanks.  Ohhh.  I think I see how that condition was reached and what
> to do about it, but I'll need to look more closely.  I'm away on
> vacation right now, and will update in a couple of days when I'm back
> at a real computer.

Here's a throw-away patch to add some sleeps that trigger the problem,
and a first draft fix.  I'll do some more testing of this next week
and see if I can simplify it.
From 65f70e0be36ec61e1993907162cfd4edac46e063 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 2 Oct 2020 15:24:23 +1300
Subject: [PATCH 1/2] Inject fault timing

---
 src/backend/executor/nodeHash.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ea69eeb2a1..244805e69b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
@@ -585,6 +586,13 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		ParallelHashJoinState *pstate = hashtable->parallel_state;
 		Barrier*build_barrier;
 
+		if (ParallelWorkerNumber >= 1)
+		{
+			elog(LOG, "a short nap before attaching to build_barrier...");
+			sleep(2);
+			elog(LOG, "nap finished");
+		}
+
 		/*
 		 * Attach to the build barrier.  The corresponding detach operation is
 		 * in ExecHashTableDetach.  Note that we won't attach to the
@@ -3198,6 +3206,9 @@ ExecHashTableDetach(HashJoinTable hashtable)
 			if (DsaPointerIsValid(pstate->batches))
 			{
 dsa_free(hashtable->area, pstate->batches);
+elog(LOG, "batch array freed, taking a long nap...");
+sleep(5);
+elog(LOG, "finished nap, clearing pointer");
 pstate->batches = InvalidDsaPointer;
 			}
 		}
-- 
2.20.1

From 21f745905dcaff738326434943e70c4292aae4a4 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 2 Oct 2020 15:53:44 +1300
Subject: [PATCH 2/2] Fix race condition in parallel hash join batch cleanup.

With unlucky timing and parallel_leader_participation off, PHJ could
attempt to access per-batch state just as it was being freed.  There was
code intended to prevent that by checking for a cleared pointer, but it
was racy.  Fix, by introducing an extra barrier phase.  The new phase
PHJ_BUILD_RUNNING means that it's safe to access the per-batch state to
find a batch to help with, and PHJ_BUILD_DONE means that it is too late.
The last to detach will free the array of per-batch state as before, but
now it will also atomically advance the phase at the same time, so that
late attachers can avoid the hazard.  This mirrors the way per-batch
hash tables are freed (see phases PHJ_BATCH_PROBING and PHJ_BATCH_DONE).

Revealed by a build farm failure, where BarrierAttach() failed a sanity
check assertion, because the memory had been clobbered by dsa_free().

Discussion: https://postgr.es/m/20200929061142.GA29096%40paquier.xyz
---
 src/backend/executor/nodeHash.c | 47 -
 src/backend/executor/nodeHashjoin.c | 38 +--
 src/include/executor/hashjoin.h |  3 +-
 3 files changed, 56 insertions(+), 32 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 244805e69b..b1013b452b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -334,14 +334,21 @@ MultiExecParallelHash(HashState *node)
 	hashtable->nbuckets = pstate->nbuckets;
 	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets);
 	hashtable->totalTuples = pstate->total_tuples;
-	ExecParallelHashEnsureBatchAccessors(hashtable);
+
+	/*
+	 * Unless we're completely done and the batch state has been freed, make
+	 * sure we have accessors.
+	 */
+	if (BarrierPhase(build_barrier) < PHJ_BUILD_DONE)
+		ExecParallelHashEnsureBatchAccessors(hashtable);
 
 	/*
 	 * The next synchronization point is in ExecHashJoin's HJ_BUILD_HASHTABLE
-	 * case, which will bring the build phase to PHJ_BUILD_DONE (if it isn't
+	 * case, which will bring the build phase to PHJ_BUILD_RUNNING (if it isn't
 	 * there already).
 	 */
 	Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
+		   BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING ||
 		   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
 }
 
@@ -632,7 +639,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		/*
 		 * The next Parallel Hash synchronization point is in
 		 * MultiExecParallelHash(), which will progress it all the way to
-		 * PHJ_BUILD_DONE.  The caller must not return control from this
+		 * PHJ_BUILD_RUNNING.  The caller must not return control from this
 		 * 

Re: Why does PostgresNode.pm set such a low value of max_wal_senders?

2020-10-01 Thread Kyotaro Horiguchi


At Wed, 30 Sep 2020 22:38:59 -0700, Noah Misch  wrote in 
noah> Perhaps wal_level=minimal should stop its pedantic call for 
max_wal_senders=0.
noah> As long as the relevant error messages are clear, it would be fine for
noah> wal_level=minimal to ignore max_wal_senders and size resources as though
noah> max_wal_senders=0.  That could be one less snag for end users.  (It's not
noah> worth changing solely to save a line in PostgresNode, though.)

At Thu, 01 Oct 2020 09:42:52 -0400, Tom Lane  wrote in 
tgl> On the other point, I think that we should continue to complain
tgl> about max_wal_senders > 0 with wal_level = minimal.  If we reduce
tgl> that to a LOG message, which'd be the net effect of trying to be
tgl> laxer, people wouldn't see it and would then wonder why they can't
tgl> start replication.

FWIW, I'm on the noah's side.

One reason of that is that if we implement the in-place setting
relation persistence feature for bulk-data loading, wal_level would
get flipped-then-back between minimal and replica or logical.  The
restriction about max_wal_senders is the pain n the ass in that case..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-01 Thread Kyotaro Horiguchi
At Thu, 1 Oct 2020 12:55:34 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote:
>  
> > From: Kyotaro Horiguchi 
> > > I thought that the advantage of this optimization is that we don't
> > > need to visit all buffers?  If we need to run a full-scan for any
> > > reason, there's no point in looking-up already-visited buffers again.
> > > That's just wastefull cycles.  Am I missing somethig?
> > >
> > > I don't understand. If we chose to the optimized dropping, the reason
> > > is the number of buffer lookup is fewer than a certain threashold. Why
> > > do you think that the fork kind a buffer belongs to is relevant to the
> > > criteria?
> > 
> > I rethought about this, and you certainly have a point, but...  OK, I think 
> > I
> > understood.  I should have thought in a complicated way.  In other words,
> > you're suggesting "Let's simply treat all forks as one relation to determine
> > whether to optimize," right?  That is, the code simple becomes:

Exactly.  The concept of the threshold is that if we are expected to
repeat buffer look-up than that, we consider just one-time full-scan
more efficient.  Since we know we are going to drop buffers of all (or
the specified) forks of the relation at once, the number of looking-up
is naturally the sum of the expected number of the buffers of all
forks.

> > whether to optimize," right?  That is, the code simple becomes:
> > 
> > Sums up the number of buffers to invalidate in all forks;
> > if (the cached sizes
> > of all forks are valid && # of buffers to invalidate < THRESHOLD) {
> > do the optimized way;
> > return;
> > }
> > do the traditional way;
> > 
> > This will be simple, and I'm +1.

Thanks!

> This is actually close to the v18 I posted trying Horiguchi-san's approach, 
> but that
> patch had bug. So attached is an updated version (v20) trying this approach 
> again.
> I hope it's bug-free this time.

Thaks for the new version.

- * XXX currently it sequentially searches the buffer pool, should 
be
- * changed to more clever ways of searching.  However, this routine
- * is used only in code paths that aren't very 
performance-critical,
- * and we shouldn't slow down the hot paths to make it faster ...
+ * XXX The relation might have extended before this, so this path 
is

The following description is found in the comment for FlushRelationBuffers.

> * XXX currently it sequentially searches the buffer pool, should 
> be
> * changed to more clever ways of searching.  This routine is not
> * used in any performance-critical code paths, so it's not worth
> * adding additional overhead to normal paths to make it go faster;
> * but see also DropRelFileNodeBuffers.

This looks like to me "We won't do that kind of optimization for
FlushRelationBuffers, but DropRelFileNodeBuffers would need it".  If
so, don't we need to revise the comment together?

- * XXX currently it sequentially searches the buffer pool, should 
be
- * changed to more clever ways of searching.  However, this routine
- * is used only in code paths that aren't very 
performance-critical,
- * and we shouldn't slow down the hot paths to make it faster ...
+ * XXX The relation might have extended before this, so this path 
is
+ * only optimized during recovery when we can get a reliable cached
+ * value of blocks for specified relation.  In addition, it is 
safe to
+ * do this since there are no other processes but the startup 
process
+ * that changes the relation size during recovery.  Otherwise, or 
if
+ * not in recovery, proceed to usual invalidation process, where it
+ * sequentially searches the buffer pool.

This should no longer be a XXX comment.  It seems to me somewhat
describing too-detailed at this function's level. How about something
like the follwoing? (excpet its syntax, or phrasing:p)

===
If the expected maximum number of buffers to drop is small enough
compared to NBuffers, individual buffers are located by
BufTableLookup. Otherwise we scan through all buffers. Snnce we
mustn't leave a buffer behind, we take the latter way unless the
number is not reliably identified.  See smgrcachednblocks() for
details.
===

(I'm still mildly opposed to the function name, which seems exposing
 detail too much.)

+* Get the total number of cached blocks and to-be-invalidated blocks
+* of the relation.  If a fork's nblocks is not valid, break the loop.

The number of file blocks is not usually equal to the number of
existing buffers for the file. We might need to explain that
limitation here.


+   for (j = 0; j < nforks; j++)

Though I understand that j is considered to be in a connection with
fork number, I'm a bit uncomfortable that j is used for the outmost

Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Ian Barwick

On 2020/10/02 3:26, Andres Freund wrote:

Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out).  I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.


I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.


Thanks! Apologies for not getting back to your earlier responses,
have been distracted by Various Other Things.

The tests I run which originally triggered the issue now run just fine.


Regards

Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Disable WAL logging to speed up data loading

2020-10-01 Thread Fujii Masao




On 2020/10/02 10:06, Kyotaro Horiguchi wrote:

At Thu, 1 Oct 2020 08:14:42 +, "osumi.takami...@fujitsu.com" 
 wrote in

Hi, Horiguchi-San and Fujii-San.


Thank you so much both of you.

the table needs to be rewriitten. One idea for that is to improve that
command so that it skips the table rewrite if wal_level=minimal.
Of course, also you can change wal_level after marking the table as
unlogged.


tablecmd.c:

The idea is really interesting.
I didn't come up with getting rid of the whole copy of
the ALTER TABLE UNLOGGED/LOGGED commands
only when wal_level='minimal'.


* There are two reasons for requiring a rewrite when changing
* persistence: on one hand, we need to ensure that the buffers
* belonging to each of the two relations are marked with or without
* BM_PERMANENT properly.  On the other hand, since rewriting creates
* and assigns a new relfilenode, we automatically create or drop an
* init fork for the relation as appropriate.

Thanks for sharing concrete comments in the source code.


According to this comment, perhaps we can do that at least for
wal_level=minimal.

When I compare the 2 ideas,
one of the benefits of this ALTER TABLE 's improvement
is that we can't avoid the downtime
while that of wal_level='none' provides an easy and faster
major version up via output file of pg_dumpall.


The speedup has already been achieved with higher durability by
wal_level=minimal in that case.


I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold would
speed up that initial data loading.



 Or maybe you should consider using
pg_upgrade instead.  Even inducing the time to take a backup copy of
the whole cluster, running pg_upgrade would be far faster than
pg_dumpall then loading.


Both ideas have good points.
However, actually to modify ALTER TABLE's copy
looks far more difficult than wal_level='none' and
beyond my current ability.
So, I'd like to go forward with the direction of wal_level='none'.
Did you have strong objections for this direction ?


No, I have no strong objection against your trial. But I was thinking
that it's not so easy to design and implement wal_level=none.
For example, there are some functions and commands depending on
the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION
and COMMIT PREPARED. Probably you need to define how they should
work in wal_level=none, e.g., emit an error.



For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.


Nice!



1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.

It always skips table copy in the SET UNLOGGED case,


Even in wal_level != minimal?
What happens in the standby side when SET UNLOGGED is executed without
the table rewrite in the primary? The table data should be truncated
in the standby?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: New statistics for tuning WAL buffer size

2020-10-01 Thread Fujii Masao




On 2020/10/01 13:35, Fujii Masao wrote:



On 2020/10/01 12:56, Masahiro Ikeda wrote:

On 2020-10-01 11:33, Amit Kapila wrote:

On Thu, Oct 1, 2020 at 6:53 AM Kyotaro Horiguchi
 wrote:


At Thu, 1 Oct 2020 09:05:19 +0900, Fujii Masao  
wrote in
>
>
> On 2020/09/30 20:21, Amit Kapila wrote:
> > On Tue, Sep 29, 2020 at 9:23 PM Fujii Masao
> >  wrote:
> >>
> >> On 2020/09/29 11:51, Masahiro Ikeda wrote:
> >>> On 2020-09-29 11:43, Amit Kapila wrote:
>  On Tue, Sep 29, 2020 at 7:39 AM Masahiro Ikeda
>   wrote:
> >>> Thanks for your suggestion.
> >>> I understood that the point is that WAL-related stats have just one
> >>> counter now.
> >>>
> >>> Since we may add some WAL-related stats like pgWalUsage.(bytes,
> >>> records, fpi),
> >>> I think that the current approach is good.
> >>
> >> +1
> >>
> > Okay, it makes sense to keep it in the current form if we have a plan
> > to extend this view with additional stats. However, why don't we
> > expose it with a function similar to pg_stat_get_archiver() instead of
> > providing individual functions like pg_stat_get_wal_buffers_full() and
> > pg_stat_get_wal_stat_reset_time?
>
> We can adopt either of those approaches for pg_stat_wal. I think that
> the former is a bit more flexible because we can collect only one of
> WAL information even when pg_stat_wal will contain many information
> in the future, by using the function. But you thought there are some
> reasons that the latter is better for pg_stat_wal?

FWIW I prefer to expose it by one SRF function rather than by
subdivided functions.  One of the reasons is the less oid consumption
and/or reduction of definitions for intrinsic functions.

Another reason is at least for me subdivided functions are not useful
so much for on-the-fly examination on psql console.  I'm often annoyed
by realizing I can't recall the exact name of a function, say,
pg_last_wal_receive_lsn or such but function names cannot be
auto-completed on psql console. "select proname from pg_proc where
proname like.. " is one of my friends:p On the other hand "select *
from pg_stat_wal" requires no detailed memory.

However subdivided functions might be useful if I wanted use just one
number of wal-stats in a function, I think it is not a major usage and
we can use a SQL query on the view instead.

Another reason that I mildly want to object to subdivided functions is
I was annoyed that a stats view makes many individual calls to
functions that internally share the same statistics entry.  That
behavior required me to provide an entry-caching feature to my
shared-memory statistics patch.



All these are good reasons to expose it via one function and I think


Understood. +1 to expose it as one function.



that is why most of our existing views also use one function approach.


Thanks for your comments.
I didn't notice there are the above disadvantages to provide individual 
functions.

I changed the latest patch to expose it via one function.


Thanks for updating the patch! LGTM.
Barring any other objection, I will commit it.


I updated typedefs.list and pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: buildfarm animal shoveler failing with "Illegal instruction"

2020-10-01 Thread Tom Lane
Mark Wong  writes:
> I'm getting Tom set up with access too, in case he has time before me to
> get a stack trace to see what's happening...

tl;dr: it's hard to conclude that this is anything but a compiler bug.

I was able to reproduce this on shoveler's host, but only when using
the compiler shoveler uses (clang-3.9), not the 6.3 gcc that's also
on there and is of similar vintage.  Observations:

* You don't need any complicated test case; "pg_dump template1"
is enough.

* Reverting 1ed6b8956's addition of a "postfix operators are not supported
anymore" warning to dumpOpr() makes it go away.  This despite the fact
that that code is never reached when dumping template1.  (We do enter
dumpOpr, but the oprinfo->dobj.dump test always fails.)

* Reducing the optimization level to -O1 or -O0 makes it go away.

* Inserting a debugging fprintf in dumpOpr makes it go away.

Since clang 3.9 is several years old, maybe we could move shoveler
up to a newer version?  Or dial it down to -O1 optimization?

regards, tom lane




Re: enable_incremental_sort changes query behavior

2020-10-01 Thread James Coleman
On Thu, Oct 1, 2020 at 6:08 PM Tomas Vondra
 wrote:
>
> On Thu, Oct 01, 2020 at 09:02:57AM -0400, James Coleman wrote:
> >On Thu, Oct 1, 2020 at 3:09 AM Jaime Casanova
> > wrote:
> >>
> >> On Wed, 30 Sep 2020 at 21:21, James Coleman  wrote:
> >> >
> >> > On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova
> >> >  wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > > With sqlsmith I found a query that gives this error:
> >> > > ERROR:  ORDER/GROUP BY expression not found in targetlist
> >> > >
> >> [...]
> >> > >
> >> > > But if I set enable_incremental_sort to off the query gets executed
> >> > > without problems (attached the explain produced for that case)
> >> >
> >> > Thanks for the report.
> >> >
> >>
> >> Hi,
> >>
> >> by experiment I reduced the query to this
> >>
> >> --- 0 ---
> >> select distinct
> >> subq_0.c1 as c0,
> >> case when (true = pg_catalog.pg_rotate_logfile_old()) then
> >> ref_0.t else ref_0.t
> >> end
> >>  as c4
> >> from
> >>   public.ref_0,
> >>   lateral (select
> >>
> >> ref_0.i as c1
> >>   from
> >> generate_series(1, 100) as ref_1) as subq_0
> >> --- 0 ---
> >>
> >> the only custom table already needed can be created with this commands:
> >>
> >> --- 0 ---
> >> create table ref_0 as select repeat('abcde', (random() * 10)::integer)
> >> t, random() * 1000 i from generate_series(1, 50);
> >> create index on ref_0 (i);
> >> analyze ref_0 ;
> >> --- 0 ---
> >>
> >>
> >> > Is there by an chance an index on ref_0.radi_text_temp?
> >> >
> >>
> >> there is an index involved but not on that field, commands above
> >> create the index in the right column... after that, ANALYZE the table
> >>
> >> > And if you set enable_hashagg = off what plan do you get (or error)?
> >> >
> >>
> >> same error
> >
> >I was able to reproduce the error without incremental sort enabled
> >(i.e., it happens with a full sort also). The function call in the
> >SELECT doesn't have to be in a case expression; for example I was able
> >to reproduce changing that to `random()::text || ref_0.t`.
> >
> >It looks like the issue happens when:
> >1. The sort happens within a parallel node.
> >2. One of the sort keys is an expression containing a volatile
> >function call and a column from the lateral join.
> >
> >Here are the settings I used with your above repro case to show it
> >with regular sort:
> >
> > enable_hashagg=off
> > enable_incremental_sort=off
> > enable_seqscan=off
> > parallel_setup_cost=10
> > parallel_tuple_cost=0
> >
> >The plan (obtained by replacing the volatile function with a stable one):
> >
> > Unique
> >   ->  Nested Loop
> > ->  Gather Merge
> >   Workers Planned: 2
> >   ->  Sort
> > Sort Key: ref_0.i, (md5(ref_0.t))
> > ->  Parallel Index Scan using ref_0_i_idx on ref_0
> > ->  Function Scan on generate_series ref_1
> >
> >Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error.
> >
> >I haven't been able to dig further than that yet, but my intuition is
> >to poke around in the parallel query machinery?
> >
>
> Nope. Bisect says this was introduced by this commit:
>
>  ba3e76cc57 Consider Incremental Sort paths at additional places
>
> Looking at the diff, I suspect generate_useful_gather_paths (or maybe
> get_useful_pathkeys_for_relation) fails to do something with the
> pathkeys.
>
> Of course, that'd explain why it only happens for parallel plans, as
> this builds gather nodes.

I should have known I'd regret making a wild guess. I was in the
middle of bisecting too, but had to set it aside for a bit and hadn't
finished.

I'll try to look into that commit (and I agree those functions are the
likely places to look) as I get a chance here.

James




Re: Disable WAL logging to speed up data loading

2020-10-01 Thread Kyotaro Horiguchi
At Thu, 1 Oct 2020 08:14:42 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> Hi, Horiguchi-San and Fujii-San.
> 
> 
> Thank you so much both of you.
> > > the table needs to be rewriitten. One idea for that is to improve that
> > > command so that it skips the table rewrite if wal_level=minimal.
> > > Of course, also you can change wal_level after marking the table as
> > > unlogged.
> > 
> > tablecmd.c:
> The idea is really interesting.
> I didn't come up with getting rid of the whole copy of
> the ALTER TABLE UNLOGGED/LOGGED commands
> only when wal_level='minimal'.
> 
> > > * There are two reasons for requiring a rewrite when changing
> > > * persistence: on one hand, we need to ensure that the buffers
> > > * belonging to each of the two relations are marked with or without
> > > * BM_PERMANENT properly.  On the other hand, since rewriting creates
> > > * and assigns a new relfilenode, we automatically create or drop an
> > > * init fork for the relation as appropriate.
> Thanks for sharing concrete comments in the source code.
> 
> > According to this comment, perhaps we can do that at least for
> > wal_level=minimal.
> When I compare the 2 ideas,
> one of the benefits of this ALTER TABLE 's improvement 
> is that we can't avoid the downtime
> while that of wal_level='none' provides an easy and faster
> major version up via output file of pg_dumpall.

The speedup has already been achieved with higher durability by
wal_level=minimal in that case.  Or maybe you should consider using
pg_upgrade instead.  Even inducing the time to take a backup copy of
the whole cluster, running pg_upgrade would be far faster than
pg_dumpall then loading.

> Both ideas have good points.
> However, actually to modify ALTER TABLE's copy
> looks far more difficult than wal_level='none' and
> beyond my current ability.
> So, I'd like to go forward with the direction of wal_level='none'.
> Did you have strong objections for this direction ?

For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.

1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.

It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case.  Crash recovery seems
working by some brief testing by hand.

Of course, I haven't performed intensive test on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..d8a1c2a7e7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -613,6 +613,26 @@ heapam_relation_set_new_filenode(Relation rel,
 	smgrclose(srel);
 }
 
+static void
+heapam_relation_set_persistence(Relation rel, char persistence)
+{
+	Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+		   rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
+
+	Assert (rel->rd_rel->relpersistence != persistence);
+
+	if (persistence == RELPERSISTENCE_UNLOGGED)
+	{
+		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
+			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+		RelationCreateInitFork(rel->rd_node);
+	}
+	else
+		RelationDropInitFork(rel->rd_node);
+}
+
+
 static void
 heapam_relation_nontransactional_truncate(Relation rel)
 {
@@ -2540,6 +2560,7 @@ static const TableAmRoutine heapam_methods = {
 	.compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples,
 
 	.relation_set_new_filenode = heapam_relation_set_new_filenode,
+	.relation_set_persistence = heapam_relation_set_persistence,
 	.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..8397002613 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,14 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +63,9 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dbbd3aa31f..7f695e6d8b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -60,6 +60,8 @@ int			

RE: New statistics for tuning WAL buffer size

2020-10-01 Thread tsunakawa.ta...@fujitsu.com
From: Fujii Masao 
> I think that we can improve that, for example, by storing backend id
> into WalSndCtl and making pg_stat_get_wal_senders() directly
> get the walsender's LocalPgBackendStatus with the backend id,
> rather than joining pg_stat_get_activity() and pg_stat_get_wal_senders().

Yeah, I had something like that in mind.  I think I'll take note of this as my 
private homework.  (Of course, anyone can do it.)


Regards
Takayuki Tsunakawa







RE: Should walsernder check correctness of WAL records?

2020-10-01 Thread tsunakawa.ta...@fujitsu.com
From: Michael Paquier 
> CRC calculation would unlikely be the bottleneck here, no?  I would assume
> that the extra lseek() calls needed to look after the record data to be more
> harmful.

Maybe, although I'm not sure lseek() is necessary.  I simply thought walsender 
was designed to just read and send WAL without caring about other things for 
maximal speed.


> Yep.  However, I would worry much more about the case of cold archives.  In
> my experience, there are higher risks to get a WAL segment corrupted because
> it was on disk and that this disk got corrupted.  Transmission is a one-time
> short operation. Cold archives could stay on disk for weeks before getting
> reused in WAL replay.

Yes, I think cold archives should be checked regularly.  pg_verifybackup and 
pg_waldump can be used for it, can't they?


Regards
Takayuki Tsunakawa







Re: Add information to rm_redo_error_callback()

2020-10-01 Thread Michael Paquier
On Thu, Oct 01, 2020 at 11:18:30AM +0200, Drouvot, Bertrand wrote:
> Had a look at it and did a few tests: looks all good to me.
> 
> No objections at all, thanks!

Thanks for double-checking.  Applied, then.
--
Michael


signature.asc
Description: PGP signature


Re: New statistics for tuning WAL buffer size

2020-10-01 Thread Fujii Masao




On 2020/10/01 10:50, tsunakawa.ta...@fujitsu.com wrote:

From: Kyotaro Horiguchi 

Another reason that I mildly want to object to subdivided functions is
I was annoyed that a stats view makes many individual calls to
functions that internally share the same statistics entry.  That
behavior required me to provide an entry-caching feature to my
shared-memory statistics patch.


+1
The views for troubleshooting performance problems should be as light as 
possible.  IIRC, we saw  frequently searching pg_stat_replication consume 
unexpectedly high CPU power, because it calls pg_stat_get_activity(null) to get 
all sessions and join them with the walsenders.  At that time, we had hundreds 
of client sessions.  We expected pg_stat_replication to be very lightweight 
because it provides information about a few walsenders.


I think that we can improve that, for example, by storing backend id
into WalSndCtl and making pg_stat_get_wal_senders() directly
get the walsender's LocalPgBackendStatus with the backend id,
rather than joining pg_stat_get_activity() and pg_stat_get_wal_senders().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Should walsernder check correctness of WAL records?

2020-10-01 Thread Michael Paquier
On Fri, Oct 02, 2020 at 12:16:25AM +, tsunakawa.ta...@fujitsu.com wrote:
> IIUC, walsender tries hard to send WAL as fast as possible to reduce
> replication lag and transaction response time, so it doesn't try to
> peek each WAL record.  I think it's good.

CRC calculation would unlikely be the bottleneck here, no?  I would
assume that the extra lseek() calls needed to look after the record
data to be more harmful.

> In any case, the WAL can get corrupt during transmission, and
> writing and reading on the standby.  So, the standby needs to check
> the WAL record CRC.

Yep.  However, I would worry much more about the case of cold
archives.  In my experience, there are higher risks to get a WAL
segment corrupted because it was on disk and that this disk got
corrupted.  Transmission is a one-time short operation. Cold archives
could stay on disk for weeks before getting reused in WAL replay.
--
Michael


signature.asc
Description: PGP signature


RE: Should walsernder check correctness of WAL records?

2020-10-01 Thread tsunakawa.ta...@fujitsu.com
From: Konstantin Knizhnik 
> Investigating one of customer's support cases I found out that walsender
> is not calculating WAL records CRC and send them to replicas without any
> checks.
> As a result damaged WAL record causes errors on all replicas:

IIUC, walsender tries hard to send WAL as fast as possible to reduce 
replication lag and transaction response time, so it doesn't try to peek each 
WAL record.  I think it's good.

In any case, the WAL can get corrupt during transmission, and writing and 
reading on the standby.  So, the standby needs to check the WAL record CRC.


Regards
Takayuki Tsunakawa





Re: Asynchronous Append on postgres_fdw nodes.

2020-10-01 Thread Etsuro Fujita
On Tue, Sep 29, 2020 at 4:45 AM Etsuro Fujita  wrote:
> BTW: I noticed that you changed the ExecProcNode() API so that an
> Append calling FDWs can know wether they return tuples immediately or
> not:

> That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new
> PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on
> whether it returns a tuple immediately or not, and then 2) the Append
> knows that from the new flag when the callback routine returns.  I’m
> not sure this is a good idea, because it seems likely that the
> ExecProcNode() change would affect many other places in the executor,
> making maintenance and/or future development difficult.  I think the
> FDW callback routines proposed in the original patch by Robert would
> provide a cleaner way to do asynchronous execution of FDWs without
> changing the ExecProcNode() API, IIUC:
>
> +On the other hand, nodes that wish to produce tuples asynchronously
> +generally need to implement three methods:
> +
> +1. When an asynchronous request is made, the node's ExecAsyncRequest callback
> +will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
> +number of file descriptor events for which it wishes to wait and whether it
> +wishes to receive a callback when the process latch is set. Alternatively,
> +it can instead use ExecAsyncRequestDone if a result is available immediately.
> +
> +2. When the event loop wishes to wait or poll for file descriptor events and
> +the process latch, the ExecAsyncConfigureWait callback is invoked to 
> configure
> +the file descriptor wait events for which the node wishes to wait.  This
> +callback isn't needed if the node only cares about the process latch.
> +
> +3. When file descriptors or the process latch become ready, the node's
> +ExecAsyncNotify callback is invoked.
>
> What is the reason for not doing like this in your patch?

I think we should avoid changing the ExecProcNode() API.

Thomas’ patch also provides a clean FDW API that doesn’t change the
ExecProcNode() API, but I think the FDW API provided in Robert’ patch
would be better designed, because I think it would support more
different types of asynchronous interaction between the core and FDWs.
Consider this bit from Thomas’ patch, which produces a tuple when a
file descriptor becomes ready:

+   if (event.events & WL_SOCKET_READABLE)
+   {
+   /* Linear search for the node that told us to wait for this fd. */
+   for (i = 0; i < node->nasyncplans; ++i)
+   {
+   if (event.fd == node->asyncfds[i])
+   {
+   TupleTableSlot *result;
+
+   /*
+ -->   * We assume that because the fd is ready, it can produce
+ -->   * a tuple now, which is not perfect.  An improvement
+ -->   * would be if it could say 'not yet, I'm still not
+ -->   * ready', so eg postgres_fdw could PQconsumeInput and
+ -->   * then say 'I need more input'.
+*/
+   result = ExecProcNode(node->asyncplans[i]);
+   if (!TupIsNull(result))
+   {
+   /*
+* Remember this plan so that append_next_async will
+* keep trying this subplan first until it stops
+* feeding us buffered tuples.
+*/
+   node->lastreadyplan = i;
+   /* We can stop waiting for this fd. */
+   node->asyncfds[i] = 0;
+   return result;
+   }
+   else
+   {
+   /*
+* This subplan has reached EOF.  We'll go back and
+* wait for another one.
+*/
+   forget_async_subplan(node, i);
+   break;
+   }
+   }
+   }
+   }

As commented above, his patch doesn’t allow an FDW to do another data
fetch from the remote side before returning a tuple when the file
descriptor becomes available, but Robert’s patch would, using his FDW
API ForeignAsyncNotify(), which is called when the file descriptor
becomes available, IIUC.

I might be missing something, but I feel inclined to vote for Robert’s
patch (more precisely, Robert’s patch as a base patch with (1) some
planner/executor changes from Horiguchi-san’s patch and (2)
postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW
API).

Best regards,
Etsuro Fujita




Re: NOTIFY docs fixup - emit and deliver consistency

2020-10-01 Thread David G. Johnston
On Tue, Sep 29, 2020 at 8:38 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tue, Sep 29, 2020 at 7:58 PM Tom Lane  wrote:
>
>> Maybe we could use terminology along the lines of "added to the
>> queue" and "removed from the queue"?
>>
>
> Quickly looking it over with this in mind there are a few spots that can
> be cleaned up and linked together by explicitly talking about a FIFO queue
> as the mechanism instead of the less precise send/process/deliver.  A bit
> more invasive but I think it will be done more clearly with this approach.
>

As attached.


v2-doc-notify-queue-focus.patch
Description: Binary data


Re: Rejecting redundant options in Create Collation

2020-10-01 Thread Michael Paquier
On Thu, Oct 01, 2020 at 02:58:53PM -0400, Tom Lane wrote:
> Hmm ... I think that that is pretty standard behavior for a lot of
> our utility commands.  Trying something at random,

The behavior handling is a bit inconsistent.  For example EXPLAIN and
VACUUM don't do that, because their parenthesized grammar got
introduced after the flavor that handles options as separate items in
the query, so redundant options was not something possible with only
the original grammar.

> I can see the argument for complaining about repeated options rather
> than letting this pass.  But shouldn't we try to make it uniform
> instead of letting different commands have different policies?

Consistency would be a good thing.  There is a very recent discussion
in this area for COPY actually, where the option redundancy is handled
incorrectly:
https://www.postgresql.org/message-id/20200929072433.ga15...@paquier.xyz

The argument goes in both ways.  If we handle redundant options, I
would also suspect that users complain back about the lack of error 
checks with stupid input values, but I'd like to think that it can
also be beneficial in some cases.  Now, if we don't handle redundant
options, that's less code, less risk to introduce semi-broken
behaviors like what COPY has done for years and a bit less work for
new translators, but the last point is not really an issue as we have
a generic error message here.  From the point of view of a code
maintainer, not handling redundant options at all is more appealing
IMO in the long-term.  From the user perspective, there are benefits
and disadvantages with both.  So, at the end, it seems to me that not
doing it anymore for any commands would have more benefits in the
long-term than doing it.
--
Michael


signature.asc
Description: PGP signature


Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andrew Dunstan


On 10/1/20 4:22 PM, Andres Freund wrote:
> Hi,
>
> On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:
>> My strong suspicion is that we're getting unwanted CRs. Note the
>> presence of numerous instances of this in PostgresNode.pm:
>
>> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
>>
>> So you probably want something along those lines at the top of the loop
>> in send_query_and_wait:
>>
>> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
> Yikes, that's ugly :(.
>
>
> I assume it's not, as the comments says
>   # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> output
>   # if we're using native Perl, but not if we're using MSys Perl.  So do 
> it
>   # by hand in the latter case, here and elsewhere.
> that IPC::Run converts things, but that native windows perl uses
> https://perldoc.perl.org/perlrun#PERLIO
> a PERLIO that includes :crlf, whereas msys probably doesn't?
>
> Any chance you could run something like
> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> on both native and msys perl?
>
>
>> possibly also for stderr, just to make it more futureproof, and at the
>> top of the file:
>>
>> use Config;
>>
>>
>> Do you want me to test that first?
> That'd be awesome.
>
>
>

The change I suggested makes jacana happy.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: buildfarm animal shoveler failing with "Illegal instruction"

2020-10-01 Thread Mark Wong
On Thu, Oct 01, 2020 at 12:12:44PM -0700, Andres Freund wrote:
> Hi Mark,
> 
> shoveler has been failing for a while with an odd error. E.g.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler=2020-09-18%2014%3A01%3A48
> 
> Illegal instruction
> pg_dumpall: error: pg_dump failed on database "template1", exiting
> waiting for server to shut down done
> 
> None of the changes in that time frame look like they're likely causing
> illegal instructions to be emitted that weren't before. So I am
> wondering if anything changed on that machine around  2020-09-18
> 14:01:48 ?

It looks like the last package update was 2020-06-10 06:59:26, according
to the apt logs.

I'm getting Tom set up with access too, in case he has time before me to
get a stack trace to see what's happening...

Regards,
Mark




Re: WIP: BRIN multi-range indexes

2020-10-01 Thread Alvaro Herrera
On 2020-Oct-01, Tomas Vondra wrote:

> OK, so this seems like a data corruption bug in BRIN, actually.

Oh crap.  You're right -- the data needs to be detoasted before being
put in the index.

I'll have a look at how this can be fixed.




Re: enable_incremental_sort changes query behavior

2020-10-01 Thread Tomas Vondra

On Thu, Oct 01, 2020 at 09:02:57AM -0400, James Coleman wrote:

On Thu, Oct 1, 2020 at 3:09 AM Jaime Casanova
 wrote:


On Wed, 30 Sep 2020 at 21:21, James Coleman  wrote:
>
> On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova
>  wrote:
> >
> > Hi,
> >
> > With sqlsmith I found a query that gives this error:
> > ERROR:  ORDER/GROUP BY expression not found in targetlist
> >
[...]
> >
> > But if I set enable_incremental_sort to off the query gets executed
> > without problems (attached the explain produced for that case)
>
> Thanks for the report.
>

Hi,

by experiment I reduced the query to this

--- 0 ---
select distinct
subq_0.c1 as c0,
case when (true = pg_catalog.pg_rotate_logfile_old()) then
ref_0.t else ref_0.t
end
 as c4
from
  public.ref_0,
  lateral (select

ref_0.i as c1
  from
generate_series(1, 100) as ref_1) as subq_0
--- 0 ---

the only custom table already needed can be created with this commands:

--- 0 ---
create table ref_0 as select repeat('abcde', (random() * 10)::integer)
t, random() * 1000 i from generate_series(1, 50);
create index on ref_0 (i);
analyze ref_0 ;
--- 0 ---


> Is there by an chance an index on ref_0.radi_text_temp?
>

there is an index involved but not on that field, commands above
create the index in the right column... after that, ANALYZE the table

> And if you set enable_hashagg = off what plan do you get (or error)?
>

same error


I was able to reproduce the error without incremental sort enabled
(i.e., it happens with a full sort also). The function call in the
SELECT doesn't have to be in a case expression; for example I was able
to reproduce changing that to `random()::text || ref_0.t`.

It looks like the issue happens when:
1. The sort happens within a parallel node.
2. One of the sort keys is an expression containing a volatile
function call and a column from the lateral join.

Here are the settings I used with your above repro case to show it
with regular sort:

enable_hashagg=off
enable_incremental_sort=off
enable_seqscan=off
parallel_setup_cost=10
parallel_tuple_cost=0

The plan (obtained by replacing the volatile function with a stable one):

Unique
  ->  Nested Loop
->  Gather Merge
  Workers Planned: 2
  ->  Sort
Sort Key: ref_0.i, (md5(ref_0.t))
->  Parallel Index Scan using ref_0_i_idx on ref_0
->  Function Scan on generate_series ref_1

Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error.

I haven't been able to dig further than that yet, but my intuition is
to poke around in the parallel query machinery?



Nope. Bisect says this was introduced by this commit:

ba3e76cc57 Consider Incremental Sort paths at additional places

Looking at the diff, I suspect generate_useful_gather_paths (or maybe
get_useful_pathkeys_for_relation) fails to do something with the
pathkeys.

Of course, that'd explain why it only happens for parallel plans, as
this builds gather nodes.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andres Freund
Hi,

On 2020-10-01 16:44:03 -0400, Andrew Dunstan wrote:
> > I assume it's not, as the comments says
> > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> > output
> > # if we're using native Perl, but not if we're using MSys Perl.  So do 
> > it
> > # by hand in the latter case, here and elsewhere.
> > that IPC::Run converts things, but that native windows perl uses
> > https://perldoc.perl.org/perlrun#PERLIO
> > a PERLIO that includes :crlf, whereas msys probably doesn't?
> >
> > Any chance you could run something like
> > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> > on both native and msys perl?
> 
> sys (jacana): stdio
> 
> native: unixcrlf

Interesting. That suggest we could get around needing the if msys
branches in several places by setting PERLIO to unixcrlf somewhere
centrally when using msys.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andrew Dunstan


On 10/1/20 4:22 PM, Andres Freund wrote:
> Hi,
>
> On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:
>> My strong suspicion is that we're getting unwanted CRs. Note the
>> presence of numerous instances of this in PostgresNode.pm:
>
>> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
>>
>> So you probably want something along those lines at the top of the loop
>> in send_query_and_wait:
>>
>> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
> Yikes, that's ugly :(.
>
>
> I assume it's not, as the comments says
>   # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> output
>   # if we're using native Perl, but not if we're using MSys Perl.  So do 
> it
>   # by hand in the latter case, here and elsewhere.
> that IPC::Run converts things, but that native windows perl uses
> https://perldoc.perl.org/perlrun#PERLIO
> a PERLIO that includes :crlf, whereas msys probably doesn't?
>
> Any chance you could run something like
> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> on both native and msys perl?
>
>

sys (jacana): stdio

native: unixcrlf


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andres Freund
Hi,

On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:
> My strong suspicion is that we're getting unwanted CRs. Note the
> presence of numerous instances of this in PostgresNode.pm:


> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
> 
> So you probably want something along those lines at the top of the loop
> in send_query_and_wait:
> 
> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

Yikes, that's ugly :(.


I assume it's not, as the comments says
# Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
output
# if we're using native Perl, but not if we're using MSys Perl.  So do 
it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?


> possibly also for stderr, just to make it more futureproof, and at the
> top of the file:
> 
> use Config;
> 
> 

> Do you want me to test that first?

That'd be awesome.


> The difference between the canonical way perl states the regex is due to
> perl version differences. It shouldn't matter.

Thanks!

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andrew Dunstan


On 10/1/20 2:26 PM, Andres Freund wrote:
> Hi Ian, Andrew, All,
>
> On 2020-09-30 15:43:17 -0700, Andres Freund wrote:
>> Attached is an updated version of the test (better utility function,
>> stricter regexes, bailing out instead of failing just the current when
>> psql times out).  I'm leaving it in this test for now, but it's fairly
>> easy to use this way, in my opinion, so it may be worth moving to
>> PostgresNode at some point.
> I pushed this yesterday. Ian, thanks again for finding this and helping
> with fixing & testing.
>
>
> Unfortunately currently some buildfarm animals don't like the test for
> reasons I don't quite understand. Looks like it's all windows + msys
> animals that run the tap tests.  On jacana and fairywren the new test
> fails, but with a somewhat confusing problem:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34
> Bail out!  aborting wait: program timed out
> # stream contents: >>data
> # (0 rows)
> # <<
> # pattern searched for: (?m-xis:^\\(0 rows\\)$)
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13
> Bail out!  aborting wait: program timed out
> stream contents: >>data
> (0 rows)
> <<
> pattern searched for: (?^m:^\\(0 rows\\)$)
>
> I don't know with the -xis indicates on jacana, and why it's not present
> on fairywren. Nor do I know why the pattern doesn't match in the first
> place, sure looks like it should?
>
> Andrew, do you have an insight into how mingw's regex match differs
> from native windows and proper unixoid systems? I guess it's somewhere
> around line endings or such?
>
> Jacana successfully deals with 013_crash_restart.pl, which does use the
> same mechanis as the new 021_row_visibility.pl - I think the only real
> difference is that I used ^ and $ in the regexes in the latter...


My strong suspicion is that we're getting unwanted CRs. Note the
presence of numerous instances of this in PostgresNode.pm:

$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

So you probably want something along those lines at the top of the loop
in send_query_and_wait:

$$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;


Do you want me to test that first?


The difference between the canonical way perl states the regex is due to
perl version differences. It shouldn't matter.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






buildfarm animal shoveler failing with "Illegal instruction"

2020-10-01 Thread Andres Freund
Hi Mark,

shoveler has been failing for a while with an odd error. E.g.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler=2020-09-18%2014%3A01%3A48

Illegal instruction
pg_dumpall: error: pg_dump failed on database "template1", exiting
waiting for server to shut down done

None of the changes in that time frame look like they're likely causing
illegal instructions to be emitted that weren't before. So I am
wondering if anything changed on that machine around  2020-09-18
14:01:48 ?

Greetings,

Andres Freund




Re: Rejecting redundant options in Create Collation

2020-10-01 Thread Tom Lane
"Daniel Verite"  writes:
> Currently, it's not an error for CREATE COLLATION to be invoked
> with options repeated several times. The last (rightmost) value is kept
> and the others are lost.
> ...
> I suggest the attached simple patch to raise an error when any of
> these options is specified multiple times.

Hmm ... I think that that is pretty standard behavior for a lot of
our utility commands.  Trying something at random,

regression=# create operator <<< (leftarg = int, leftarg = int,
regression(# rightarg = int99, procedure = int4pl, rightarg = int);
CREATE OPERATOR

Note the lack of any comment on the flat-out-wrong first value
for rightarg :-(

I can see the argument for complaining about repeated options rather
than letting this pass.  But shouldn't we try to make it uniform
instead of letting different commands have different policies?

regards, tom lane




Rejecting redundant options in Create Collation

2020-10-01 Thread Daniel Verite
  Hi,

Currently, it's not an error for CREATE COLLATION to be invoked
with options repeated several times. The last (rightmost) value is kept
and the others are lost.
For instance CREATE COLLATION x (lc_ctype='en_US.UTF8',
lc_collate='en_US.UTF8', lc_ctype='C')
silently ignores lc_ctype='en_US.UTF8'. But that kind of invocation
isn't likely to be legit. It's more plausible that it's the result of
some mistake or confusion.
The same goes for the other options:

CREATE COLLATION [ IF NOT EXISTS ] name (
[ LOCALE = locale, ]
[ LC_COLLATE = lc_collate, ]
[ LC_CTYPE = lc_ctype, ]
[ PROVIDER = provider, ]
[ DETERMINISTIC = boolean, ]
[ VERSION = version ]
)

I suggest the attached simple patch to raise an error when any of
these options is specified multiple times.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/src/backend/commands/collationcmds.c 
b/src/backend/commands/collationcmds.c
index 9f6582c530..7dfd544396 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -108,6 +108,19 @@ DefineCollation(ParseState *pstate, List *names, List 
*parameters, bool if_not_e
break;
}
 
+   if (*defelp != NULL)
+   {
+   /*
+* If the option was previously set, it means that it 
occurs
+* several times in the list, which is not allowed.
+*/
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or redundant 
options",
+   defel->defname),
+parser_errposition(pstate, 
defel->location)));
+   break;
+   }
*defelp = defel;
}
 


Re: WIP: BRIN multi-range indexes

2020-10-01 Thread Tomas Vondra

On Wed, Sep 30, 2020 at 07:57:19AM -0400, John Naylor wrote:

On Mon, Sep 28, 2020 at 10:12 PM Tomas Vondra
 wrote:


Is it actually all that different from the existing BRIN indexes?
Consider this example:

create table x (a text, b text, c text);

create index on x using brin (a,b,c);

create or replace function random_str(p_len int) returns text as $$
select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from 
generate_series(1,$1)) foo;
$$ language sql;

test=# insert into x select random_str(1000), random_str(1000), 
random_str(1000);
ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"


Hmm, okay. As for which comes first, insert or index creation, I'm
baffled, too. I also would expect the example above would take up a
bit over 6000 bytes, but not 9000.



OK, so this seems like a data corruption bug in BRIN, actually.

The ~9000 bytes is actually about right, because the strings are in
UTF-8 so roughly 1.5B per character seems about right. And we have 6
values to store (3 columns, min/max for each), so 6 * 1500 = 9000.

The real question is how come INSERT + CREATE INDEX actually manages to
create an index tuple. And the answer is pretty simple - brin_form_tuple
kinda ignores toasting, happily building index tuples where some values
are toasted.

Consider this:

create table x (a text, b text, c text);
insert into x select random_str(1000), random_str(1000), random_str(1000);

create index on x using brin (a,b,c);
delete from x;
vacuum x;

set enable_seqscan=off;

insert into x select random_str(10), random_str(10), random_str(10);
ERROR:  missing chunk number 0 for toast value 16530 in pg_toast_16525

explain analyze select * from x where a = 'xxx';
ERROR:  missing chunk number 0 for toast value 16530 in pg_toast_16525

select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 
'x_a_b_c_idx'::regclass);
ERROR:  missing chunk number 0 for toast value 16547 in pg_toast_16541


Interestingly enough, running the select before the insert seems to be
working - not sure why.

Anyway, it behaves like this since 9.5 :-(


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andres Freund
Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:
> Attached is an updated version of the test (better utility function,
> stricter regexes, bailing out instead of failing just the current when
> psql times out).  I'm leaving it in this test for now, but it's fairly
> easy to use this way, in my opinion, so it may be worth moving to
> PostgresNode at some point.

I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.


Unfortunately currently some buildfarm animals don't like the test for
reasons I don't quite understand. Looks like it's all windows + msys
animals that run the tap tests.  On jacana and fairywren the new test
fails, but with a somewhat confusing problem:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34
Bail out!  aborting wait: program timed out
# stream contents: >>data
# (0 rows)
# <<
# pattern searched for: (?m-xis:^\\(0 rows\\)$)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13
Bail out!  aborting wait: program timed out
stream contents: >>data
(0 rows)
<<
pattern searched for: (?^m:^\\(0 rows\\)$)

I don't know with the -xis indicates on jacana, and why it's not present
on fairywren. Nor do I know why the pattern doesn't match in the first
place, sure looks like it should?

Andrew, do you have an insight into how mingw's regex match differs
from native windows and proper unixoid systems? I guess it's somewhere
around line endings or such?

Jacana successfully deals with 013_crash_restart.pl, which does use the
same mechanis as the new 021_row_visibility.pl - I think the only real
difference is that I used ^ and $ in the regexes in the latter...

Greetings,

Andres Freund




Re: small cleanup: unify scanstr() functions

2020-10-01 Thread John Naylor
On Thu, Oct 1, 2020 at 11:19 AM Tom Lane  wrote:
> I'm unsure this topic is worth messing with.  But if we do so, I'd kind
> of like to move scanstr() out of parser/scansup.c.  Since it's not used
> by the core scanner, only bootstrap, it seems out of place there; and
> it's confusing because somebody coming across it for the first time
> would reasonably assume that it does have something to do with how
> we lex normal SQL strings.
>
> Of course, that just begs the question of where to put it instead.
> But since guc-file.l lives in utils/misc/, in or beside that file
> does not seem that awful.

A ringing endorsement if I ever heard one. ;-) Seems fine if it
remains in guc-file.l. Cosmetic considerations are

- The declaration in guc.h should fit in with code that's already
there. I put it next to the *Parse* functions.
- There seem to be a few unused headers included into bootscanner.l.
To keep "#include guc.h" from looking as strange as those, I added a
comment. It might be worth removing those extra includes, but that's
for another day.

> We might consider renaming it to something a shade less generic, too.
> I have no immediate suggestions on that score.

I've named it DeescapeQuotedString to see how that looks.

> In short: maybe instead of what you have here, leave GUC_scanstr()
> alone except for a possible rename; make bootscanner.l use that;
> and drop the now-unused scanstr().

v2 done that way.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v2-0001-Remove-duplicate-scanstr-function.patch
Description: Binary data


Improve choose_custom_plan for initial partition prune case

2020-10-01 Thread Andy Fan
Given the plan example:

CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktempint,
unitsales   int
) PARTITION BY RANGE (logdate);

CREATE TABLE measurement_y2006m02 PARTITION OF measurement
FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');

CREATE TABLE measurement_y2006m03 PARTITION OF measurement
FOR VALUES FROM ('2006-03-01') TO ('2006-04-01');

prepare s as select * from measurement where logdate = $1;
execute s('2006-02-01').

The generic plan will probably not be chosen because it doesn't reduce the
cost
which can be reduced at initial_prune while the custom plan reduces such
cost
at  planning time. which makes the cost comparison not fair.  I'm thinking
if we can
get an estimated cost reduction of initial_prunne for generic plan based on
the
partition pruned at plan time from custom plan and then reducing
such costs from the generic plan.  I just went through the related code but
didn't write anything now.  I'd like to see if this is a correct direction
to go.

-- 
Best Regards
Andy Fan


Re: [PATCH] Automatic HASH and LIST partition creation

2020-10-01 Thread Anastasia Lubennikova

On 30.09.2020 22:58, Rahila Syed wrote:

Hi Anastasia,

I tested the syntax with some basic commands and it works fine, 
regression tests also pass.



Thank you for your review.

Couple of comments:
1. The syntax used omits the { IMMEDIATE | DEFERRED} keywords 
suggested in
the earlier discussions. I think it is intuitive to include IMMEDIATE 
with the current implementation
so that the syntax can be extended with a  DEFERRED clause in future 
for dynamic partitions.


  CREATE TABLE tbl_lst (i int) PARTITION BY LIST (i)
 CONFIGURATION (values in (1, 2), (3, 4) DEFAULT PARTITION
tbl_default);

After some consideration, I decided that we don't actually need to 
introduce IMMEDIATE | DEFERRED keyword. For hash and list partitions it 
will always be immediate, as the number of partitions cannot change 
after we initially set it. For range partitions, on the contrary, it 
doesn't make much sense to make partitions immediately, because in many 
use-cases one bound will be open.


2. One suggestion for generation of partition names is to append a 
unique id to

avoid conflicts.


Can you please give an example of such a conflict? I agree that current 
naming scheme is far from perfect, but I think that 'tablename'_partnum 
provides unique name for each partition.




3. Probably, here you mean to write list and hash instead of range and 
list as

per the current state.

     
     Range and list partitioning also support automatic creation
of partitions
      with an optional CONFIGURATION clause.
    

4. Typo in default_part_name

+VALUES IN ( partition_bound_expr [, ...] ), [(
partition_bound_expr
[, ...] )] [, ...] [DEFAULT PARTITION defailt_part_name]
+MODULUS numeric_literal



Yes, you're right. I will fix these typos in next version of the patch.


Thank you,
Rahila Syed



--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-10-01 Thread Bharath Rupireddy
On Thu, Oct 1, 2020 at 8:10 PM Fujii Masao  wrote:
>
> pg_stat_clear_snapshot() can be used to reset the entry.
>

Thanks. I wasn't knowing it.

>
> +   EXIT WHEN proccnt = 0;
> +END LOOP;
>
> Isn't it better to sleep here, to avoid th busy loop?
>

+1.

>
> So what I thought was something like
>
> CREATE OR REPLACE PROCEDURE wait_for_backend_termination()
> LANGUAGE plpgsql
> AS $$
> BEGIN
>  LOOP
>  PERFORM * FROM pg_stat_activity WHERE application_name = 
> 'fdw_retry_check';
>  EXIT WHEN NOT FOUND;
>  PERFORM pg_sleep(1), pg_stat_clear_snapshot();
>  END LOOP;
> END;
> $$;
>

Changed.

Attaching v8 patch, please review it.. Both make check and make
check-world passes on v8.

I have another question not related to this patch: though we have
wait_pid() function, we are not able to use it like
pg_terminate_backend() in other modules, wouldn't be nice if we can
make it generic under the name pg_wait_pid() and usable across all pg
modules?


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v8-Retry-Cached-Remote-Connections-For-postgres_fdw.patch
Description: Binary data


Should walsernder check correctness of WAL records?

2020-10-01 Thread Konstantin Knizhnik

Hi hackers,

Investigating one of customer's support cases I found out that walsender 
is not calculating WAL records CRC and send them to replicas without any 
checks.

As a result damaged WAL record causes errors on all replicas:

    LOG: incorrect resource manager data checksum in record at 
5FB9/D199F7D8

    FATAL: terminating walreceiver process due to administrator command

I wonder if it will be better to detect this problem earlier at master?
We can try to recover damaged WAL record (it is not always possible, but...)
Or at least do not advance replication slots and make it possible for 
DBA to restore corrupted WAL segment from archive and resume replication.


And right now the only choice is to restore replicas using basebackup 
which may take significant amount of time (for larger database).

And during this time master will not be protected from failures.

Or extra overhead of computing CRC in WAL sender is assumed to be to high?

Sorry, if this question was already discussed - I failed to find it in 
the archive.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: small cleanup: unify scanstr() functions

2020-10-01 Thread Tom Lane
John Naylor  writes:
> In the attached, the GUC_scanstr() function body is moved to scansup.c
> to replace scanstr(), but with a different order of if-statements to
> make the diff smaller. Since we have control over what goes in the BKI
> file, we can use single-quoted escape strings there, allowing removal
> of special case code for double quotes.

I'm unsure this topic is worth messing with.  But if we do so, I'd kind
of like to move scanstr() out of parser/scansup.c.  Since it's not used
by the core scanner, only bootstrap, it seems out of place there; and
it's confusing because somebody coming across it for the first time
would reasonably assume that it does have something to do with how
we lex normal SQL strings.

Of course, that just begs the question of where to put it instead.
But since guc-file.l lives in utils/misc/, in or beside that file
does not seem that awful.

We might consider renaming it to something a shade less generic, too.
I have no immediate suggestions on that score.

In short: maybe instead of what you have here, leave GUC_scanstr()
alone except for a possible rename; make bootscanner.l use that;
and drop the now-unused scanstr().

regards, tom lane




Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-10-01 Thread Fujii Masao




On 2020/10/01 21:14, Bharath Rupireddy wrote:

On Wed, Sep 30, 2020 at 11:32 PM Fujii Masao
 wrote:



And another way, if we don't want to use wait_pid() is to have a
plpgsql stored procedure, that in a loop keeps on checking for the
backed pid from pg_stat_activity, exit when pid is 0. and then proceed
to issue the next foreign table query. Thoughts?


+1 for this approach! We can use plpgsql or DO command.



Used plpgsql procedure as we have to use the procedure 2 times.





mypid = -1;
while (mypid != 0)
SELECT pid INTO mypid FROM pg_stat_activity WHERE backend_type =
'client backend' AND application_name = 'fdw_retry_check';


Or we can just wait for the number of processes with
appname='fdw_retry_check' to be zero rather than checking the pid.



Done.

Attaching v7 patch with above changes. Please review it.


Thanks for updating the patch!

+-- committed the txn. The entry of the terminated backend from pg_stat_activity
+-- would be removed only after the txn commit.

pg_stat_clear_snapshot() can be used to reset the entry.

+   EXIT WHEN proccnt = 0;
+END LOOP;

Isn't it better to sleep here, to avoid th busy loop?

So what I thought was something like

CREATE OR REPLACE PROCEDURE wait_for_backend_termination()
LANGUAGE plpgsql
AS $$
BEGIN
LOOP
PERFORM * FROM pg_stat_activity WHERE application_name = 
'fdw_retry_check';
EXIT WHEN NOT FOUND;
PERFORM pg_sleep(1), pg_stat_clear_snapshot();
END LOOP;
END;
$$;

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-01 Thread Dilip Kumar
On Thu, Oct 1, 2020 at 4:44 PM Amit Kapila  wrote:
>
> On Thu, Oct 1, 2020 at 10:12 AM Dilip Kumar  wrote:
> >
> > On Wed, Sep 30, 2020 at 3:00 PM Amit Kapila  wrote:
> > >
> > > On Wed, Sep 30, 2020 at 12:16 PM Dilip Kumar  
> > > wrote:
> >
> > > And you might need to update the below comment as well:
> > > typedef struct ReorderBufferTXN
> > > {
> > > ..
> > > /*
> > > * List of ReorderBufferChange structs, including new Snapshots and new
> > > * CommandIds
> > > */
> > > dlist_head changes;
> > >
>
> You forgot to address the above comment.

Fixed

> Few other comments:
> ==
> 1.
> void
>  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
> @@ -2813,12 +2830,27 @@ ReorderBufferAddInvalidations(ReorderBuffer
> *rb, TransactionId xid,
> SharedInvalidationMessage *msgs)
>  {
>   ReorderBufferTXN *txn;
> + MemoryContext oldcontext;
> + ReorderBufferChange *change;
>
>   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
>
> + oldcontext = MemoryContextSwitchTo(rb->context);
> +
> + change = ReorderBufferGetChange(rb);
> + change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
> + change->data.inval.ninvalidations = nmsgs;
> + change->data.inval.invalidations = (SharedInvalidationMessage *)
> + MemoryContextAlloc(rb->context,
> +sizeof(SharedInvalidationMessage) * nmsgs);
> + memcpy(change->data.inval.invalidations, msgs,
> +sizeof(SharedInvalidationMessage) * nmsgs);
> +
> + ReorderBufferQueueChange(rb, xid, lsn, change, false);
> +
>   /*
> - * We collect all the invalidations under the top transaction so that we
> - * can execute them all together.
> + * Additionally, collect all the invalidations under the top transaction so
> + * that we can execute them all together.  See comment atop this function
>   */
>   if (txn->toptxn)
>   txn = txn->toptxn;
> @@ -2830,8 +2862,7 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb,
> TransactionId xid,
>   {
>   txn->ninvalidations = nmsgs;
>   txn->invalidations = (SharedInvalidationMessage *)
>
> Here what is the need for using MemoryContextAlloc, can't we directly
> use palloc? Also, isn't it better to store the invalidation in txn
> before queuing it for change because doing so can lead to the
> processing of this and other changes accumulated till that point
> before recording the same in txn. As such, there is no problem with it
> but still, I think if any error happens while processing those changes
> we would be able to clear the cache w.r.t this particular
> invalidation.

Fixed

> 2.
> XXX Do we need to care about relcacheInitFileInval and
> + * the other fields added to ReorderBufferChange, or just
> + * about the message itself?
>
> I don't think we need this comment in the patch.
>
> 3.
> - * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
> - * accumulates all the invalidation messages in the toplevel transaction.
> - * This is required because in some cases where we skip processing the
> - * transaction (see ReorderBufferForget), we need to execute all the
> - * invalidations together.
> + * This needs to be called for each XLOG_XACT_INVALIDATIONS message.  The
> + * invalidation messages will be added in the reorder buffer as a change as
> + * well as all the invalidations will be accumulated under the toplevel
> + * transaction.  We collect them as a change so that during decoding, we can
> + * execute only those invalidations which are specific to the command instead
> + * of executing all the invalidations, OTOH after decoding is complete or on
> + * transaction abort (see ReorderBufferForget) we need to execute all the
> + * invalidations to avoid any cache pollution so it is better to keep them
> + * together
>
> Can we rewrite the comment as below?
> This needs to be called for each XLOG_XACT_INVALIDATIONS message and
> accumulates all the invalidation messages in the toplevel transaction
> as well as in the form of change in reorder buffer. We require to
> record it in form of change so that we can execute only required
> invalidations instead of executing all the invalidations on each
> CommandId increment. We also need to accumulate these in top-level txn
> because in some cases where we skip processing the transaction (see
> ReorderBufferForget), we need to execute all the invalidations
> together.

Done

> 4.
> +void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId,
> XLogRecPtr lsn,
> +   int nmsgs, SharedInvalidationMessage *msgs);
> As pointed by Keisuke-San as well, this is not required.

Fixed

> 5. Can you please once repeat the performance test done by Keisuke-San
> to see if you have similar observations? Additionally, see if you are
> also seeing the inconsistency related to the Truncate message reported
> by him and if so why?
>

Okay, I will set up and do this test early next week.  Keisuke-San,
can you send me your complete test script?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Manager for commit fest 2020-09

2020-10-01 Thread Tom Lane
Michael Paquier  writes:
> And one month later, here we are with the following results and a CF
> now closed:
> Committed: 48
> Moved to next CF: 147
> Withdrawn: 4
> Rejected: 3
> Returned with Feedback: 33
> Total: 235

Once again, thanks for doing all that tedious work!

regards, tom lane




Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-01 Thread Masahiko Sawada
On Wed, 30 Sep 2020 at 16:02, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > To avoid misunderstanding, I didn't mean to disregard the performance.
> > I mean especially for the transaction management feature it's
> > essential to work fine even in failure cases. So I hope we have a
> > safe, robust, and probably simple design for the first version that
> > might be low performance yet though but have a potential for
> > performance improvement and we will be able to try to improve
> > performance later.
>
> Yes, correctness (safety?) is a basic premise.  I understand that given the 
> time left for PG 14, we haven't yet given up a sound design that offers 
> practical or normally expected performance.  I don't think the design has not 
> well thought yet to see if it's simple or complex.  At least, I don't believe 
> doing "send commit request, perform commit on a remote server, and wait for 
> reply" sequence one transaction at a time in turn is what this community (and 
> other DBMSs) tolerate.  A kid's tricycle is safe, but it's not safe to ride a 
> tricycle on the road.  Let's not rush to commit and do our best!

Okay. I'd like to resolve my concern that I repeatedly mentioned and
we don't find a good solution yet. That is, how we handle errors
raised by FDW transaction callbacks during committing/rolling back
prepared foreign transactions. Actually, this has already been
discussed before[1] and we concluded at that time that using a
background worker to commit/rolling back foreign prepared transactions
is the best way.

Anyway, let me summarize the discussion on this issue so far. With
your idea, after the local commit, the backend process directly call
transaction FDW API to commit the foreign prepared transactions.
However, it's likely to happen an error (i.g. ereport(ERROR)) during
that due to various reasons. It could be an OOM by memory allocation,
connection error whatever. In case an error happens during committing
prepared foreign transactions, the user will get the error but it's
too late. The local transaction and possibly other foreign prepared
transaction have already been committed. You proposed the first idea
to avoid such a situation that FDW implementor can write the code
while trying to reduce the possibility of errors happening as much as
possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and
hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive
solution. They might miss, not know it, or use other functions
provided by the core that could lead an error. Another idea is to use
PG_TRY() and PG_CATCH(). IIUC with this idea, FDW implementor catches
an error but ignores it rather than rethrowing by PG_RE_THROW() in
order to return the control to the core after an error. I’m really not
sure it’s a correct usage of those macros. In addition, after
returning to the core, it will retry to resolve the same or other
foreign transactions. That is, after ignoring an error, the core needs
to continue working and possibly call transaction callbacks of other
FDW implementations.

Regards,

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoY%3DVkHrzXD%3Djw5DA%2BPp-ePW_6_v5n%2BTJk40s5Q9VXY-Pw%40mail.gmail.com

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Why does PostgresNode.pm set such a low value of max_wal_senders?

2020-10-01 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Sep 30, 2020 at 10:38:59PM -0700, Noah Misch wrote:
>> In favor of minimal values, we've had semaphore-starved buildfarm members in
>> the past.  Perhaps those days are over, seeing that this commit has not yet
>> broken a buildfarm member in that particular way.  Keeping max_wal_senders=10
>> seems fine.

> Indeed, I am not spotting anything suspicious here.

Yeah, so far so good.  Note that PostgresNode.pm does attempt to cater for
semaphore-starved machines, by cutting max_connections as much as it can.
In practice the total semaphore usage of a subscription test is probably
still less than that of one postmaster with default max_connections.

>> No, PostgreSQL commit 54c2ecb changed that.  I recommend an explicit
>> max_wal_senders=10 in PostgresNode, which makes it easy to test
>> wal_level=minimal:

> Ah, thanks, I have missed this piece.  So we really need to have a
> value set in this module after all.

Agreed, I'll go put it back.

On the other point, I think that we should continue to complain
about max_wal_senders > 0 with wal_level = minimal.  If we reduce
that to a LOG message, which'd be the net effect of trying to be
laxer, people wouldn't see it and would then wonder why they can't
start replication.

regards, tom lane




Re: enable_incremental_sort changes query behavior

2020-10-01 Thread James Coleman
On Thu, Oct 1, 2020 at 3:09 AM Jaime Casanova
 wrote:
>
> On Wed, 30 Sep 2020 at 21:21, James Coleman  wrote:
> >
> > On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova
> >  wrote:
> > >
> > > Hi,
> > >
> > > With sqlsmith I found a query that gives this error:
> > > ERROR:  ORDER/GROUP BY expression not found in targetlist
> > >
> [...]
> > >
> > > But if I set enable_incremental_sort to off the query gets executed
> > > without problems (attached the explain produced for that case)
> >
> > Thanks for the report.
> >
>
> Hi,
>
> by experiment I reduced the query to this
>
> --- 0 ---
> select distinct
> subq_0.c1 as c0,
> case when (true = pg_catalog.pg_rotate_logfile_old()) then
> ref_0.t else ref_0.t
> end
>  as c4
> from
>   public.ref_0,
>   lateral (select
>
> ref_0.i as c1
>   from
> generate_series(1, 100) as ref_1) as subq_0
> --- 0 ---
>
> the only custom table already needed can be created with this commands:
>
> --- 0 ---
> create table ref_0 as select repeat('abcde', (random() * 10)::integer)
> t, random() * 1000 i from generate_series(1, 50);
> create index on ref_0 (i);
> analyze ref_0 ;
> --- 0 ---
>
>
> > Is there by an chance an index on ref_0.radi_text_temp?
> >
>
> there is an index involved but not on that field, commands above
> create the index in the right column... after that, ANALYZE the table
>
> > And if you set enable_hashagg = off what plan do you get (or error)?
> >
>
> same error

I was able to reproduce the error without incremental sort enabled
(i.e., it happens with a full sort also). The function call in the
SELECT doesn't have to be in a case expression; for example I was able
to reproduce changing that to `random()::text || ref_0.t`.

It looks like the issue happens when:
1. The sort happens within a parallel node.
2. One of the sort keys is an expression containing a volatile
function call and a column from the lateral join.

Here are the settings I used with your above repro case to show it
with regular sort:

 enable_hashagg=off
 enable_incremental_sort=off
 enable_seqscan=off
 parallel_setup_cost=10
 parallel_tuple_cost=0

The plan (obtained by replacing the volatile function with a stable one):

 Unique
   ->  Nested Loop
 ->  Gather Merge
   Workers Planned: 2
   ->  Sort
 Sort Key: ref_0.i, (md5(ref_0.t))
 ->  Parallel Index Scan using ref_0_i_idx on ref_0
 ->  Function Scan on generate_series ref_1

Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error.

I haven't been able to dig further than that yet, but my intuition is
to poke around in the parallel query machinery?

James




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-01 Thread k.jami...@fujitsu.com
On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote:
 
> From: Kyotaro Horiguchi 
> > I thought that the advantage of this optimization is that we don't
> > need to visit all buffers?  If we need to run a full-scan for any
> > reason, there's no point in looking-up already-visited buffers again.
> > That's just wastefull cycles.  Am I missing somethig?
> >
> > I don't understand. If we chose to the optimized dropping, the reason
> > is the number of buffer lookup is fewer than a certain threashold. Why
> > do you think that the fork kind a buffer belongs to is relevant to the
> > criteria?
> 
> I rethought about this, and you certainly have a point, but...  OK, I think I
> understood.  I should have thought in a complicated way.  In other words,
> you're suggesting "Let's simply treat all forks as one relation to determine
> whether to optimize," right?  That is, the code simple becomes:
> 
> Sums up the number of buffers to invalidate in all forks; if (the cached sizes
> of all forks are valid && # of buffers to invalidate < THRESHOLD) {
>   do the optimized way;
>   return;
> }
> do the traditional way;
> 
> This will be simple, and I'm +1.

This is actually close to the v18 I posted trying Horiguchi-san's approach, but 
that
patch had bug. So attached is an updated version (v20) trying this approach 
again.
I hope it's bug-free this time.

Regards,
Kirk Jamison
 


v20-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description: v20-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-10-01 Thread Bharath Rupireddy
On Wed, Sep 30, 2020 at 11:32 PM Fujii Masao
 wrote:
>
> > And another way, if we don't want to use wait_pid() is to have a
> > plpgsql stored procedure, that in a loop keeps on checking for the
> > backed pid from pg_stat_activity, exit when pid is 0. and then proceed
> > to issue the next foreign table query. Thoughts?
>
> +1 for this approach! We can use plpgsql or DO command.
>

Used plpgsql procedure as we have to use the procedure 2 times.

>
> >
> > mypid = -1;
> > while (mypid != 0)
> > SELECT pid INTO mypid FROM pg_stat_activity WHERE backend_type =
> > 'client backend' AND application_name = 'fdw_retry_check';
>
> Or we can just wait for the number of processes with
> appname='fdw_retry_check' to be zero rather than checking the pid.
>

Done.

Attaching v7 patch with above changes. Please review it.



With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From e2a85116116e79c76a7b0fcfc1a25eff57627ed4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 1 Oct 2020 15:16:26 +0530
Subject: [PATCH v7] Retry Cached Remote Connections For postgres_fdw

Remote connections are cached for postgres_fdw in the local
backend. There are high chances that the remote backend may
not be available i.e. it can get killed, the subsequent foreign
queries from local backend/session that use cached connection
will fail as the remote backend would have become unavailable.

This patch proposes a retry mechanism. Local backend/session
uses cached connection and if it receives connection bad error,
it deletes the cached entry and retry to get a new connection.
This retry happens only at the start of remote xact.
---
 contrib/postgres_fdw/connection.c | 55 +---
 .../postgres_fdw/expected/postgres_fdw.out| 85 +++
 contrib/postgres_fdw/sql/postgres_fdw.sql | 64 ++
 3 files changed, 192 insertions(+), 12 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf..441124963f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -108,6 +108,7 @@ PGconn *
 GetConnection(UserMapping *user, bool will_prep_stmt)
 {
 	bool		found;
+	volatile bool retry_conn = false;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
@@ -159,23 +160,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
+retry:
 	/*
 	 * If the connection needs to be remade due to invalidation, disconnect as
-	 * soon as we're out of all transactions.
+	 * soon as we're out of all transactions. Also, if previous attempt to
+	 * start new remote transaction failed on the cached connection, disconnect
+	 * it to retry a new connection.
 	 */
-	if (entry->conn != NULL && entry->invalidated && entry->xact_depth == 0)
+	if ((entry->conn != NULL && entry->invalidated &&
+		 entry->xact_depth == 0) || retry_conn)
 	{
-		elog(DEBUG3, "closing connection %p for option changes to take effect",
-			 entry->conn);
+		if (retry_conn)
+			elog(DEBUG3, "closing connection %p to reestablish a new one",
+ entry->conn);
+		else
+			elog(DEBUG3, "closing connection %p for option changes to take effect",
+ entry->conn);
 		disconnect_pg_server(entry);
 	}
 
-	/*
-	 * We don't check the health of cached connection here, because it would
-	 * require some overhead.  Broken connection will be detected when the
-	 * connection is actually used.
-	 */
-
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
@@ -206,9 +209,37 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	}
 
 	/*
-	 * Start a new transaction or subtransaction if needed.
+	 * We check the health of the cached connection here, while the remote xact
+	 * is going to begin.  Broken connection will be detected and the
+	 * connection is retried. We do this only once during each begin remote
+	 * xact call, if the connection is still broken after the retry attempt,
+	 * then the query will fail.
 	 */
-	begin_remote_xact(entry);
+	PG_TRY();
+	{
+		/* Start a new transaction or subtransaction if needed. */
+		begin_remote_xact(entry);
+		retry_conn = false;
+	}
+	PG_CATCH();
+	{
+		if (!entry->conn ||
+			PQstatus(entry->conn) != CONNECTION_BAD ||
+			entry->xact_depth > 0 ||
+			retry_conn)
+			PG_RE_THROW();
+		retry_conn = true;
+	}
+	PG_END_TRY();
+
+	if (retry_conn)
+	{
+		ereport(DEBUG3,
+(errmsg("could not start remote transaction on connection %p",
+ entry->conn)),
+ errdetail_internal("%s", pchomp(PQerrorMessage(entry->conn;
+		goto retry;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 10e23d02ed..09d0578d05 100644
--- 

small cleanup: unify scanstr() functions

2020-10-01 Thread John Naylor
Hi,

Looking at the guc file code, GUC_scanstr() is almost the same as the
exported function scanstr(), except the latter requires finicky extra
coding around double quotes both in its callers and while creating the
input.

In the attached, the GUC_scanstr() function body is moved to scansup.c
to replace scanstr(), but with a different order of if-statements to
make the diff smaller. Since we have control over what goes in the BKI
file, we can use single-quoted escape strings there, allowing removal
of special case code for double quotes.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 93fce2a68674e1f3ec8999b15c8526535e12fffa Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 14 Sep 2020 11:29:43 -0400
Subject: [PATCH] Unify scanstr() implementations

Use single-quoted escaped strings in the BKI file, the same as those
used for GUC parameters. Rewrite scanstr() to match GUC_scanstr() and
remove the latter. Get rid of special case coding around double quotes
both in the callers of scanstr() and while formatting the input.
---
 doc/src/sgml/bki.sgml   |  6 +-
 src/backend/bootstrap/bootscanner.l |  9 +--
 src/backend/catalog/genbki.pl   |  6 +-
 src/backend/parser/scansup.c| 34 +-
 src/backend/utils/misc/guc-file.l   | 97 +
 src/bin/initdb/initdb.c | 34 +++---
 6 files changed, 35 insertions(+), 151 deletions(-)

diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index 4e696d1d3e..b7f0bccf35 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -752,7 +752,7 @@ $ perl  rewrite_dat_with_prokind.pl  pg_proc.dat
next token that syntactically cannot belong to the preceding
command starts a new one.  (Usually you would put a new command on
a new line, for clarity.)  Tokens can be certain key words, special
-   characters (parentheses, commas, etc.), numbers, or double-quoted
+   characters (parentheses, commas, etc.), numbers, or single-quoted
strings.  Everything is case sensitive.
   
 
@@ -876,7 +876,7 @@ $ perl  rewrite_dat_with_prokind.pl  pg_proc.dat
  
   NULL values can be specified using the special key word
   _null_.  Values that do not look like
-  identifiers or digit strings must be double quoted.
+  identifiers or digit strings must be single-quoted.
  
 

@@ -1046,7 +1046,7 @@ $ perl  rewrite_dat_with_prokind.pl  pg_proc.dat
 
 create test_table 420 (oid = oid, cola = int4, colb = text)
 open test_table
-insert ( 421 1 "value1" )
+insert ( 421 1 'value 1' )
 insert ( 422 2 _null_ )
 close test_table
 
diff --git a/src/backend/bootstrap/bootscanner.l b/src/backend/bootstrap/bootscanner.l
index 1048e70d05..452dc067d0 100644
--- a/src/backend/bootstrap/bootscanner.l
+++ b/src/backend/bootstrap/bootscanner.l
@@ -66,7 +66,7 @@ static int	yyline = 1;			/* line number for error reporting */
 
 
 id		[-A-Za-z0-9_]+
-sid		\"([^\"])*\"
+sid		\'([^']|\'\')*\'
 
 /*
  * Keyword tokens return the keyword text (as a constant string) in yylval.kw,
@@ -120,14 +120,11 @@ NOT{ yylval.kw = "NOT"; return XNOT; }
 NULL			{ yylval.kw = "NULL"; return XNULL; }
 
 {id}			{
-	yylval.str = scanstr(yytext);
+	yylval.str = pstrdup(yytext);
 	return ID;
 }
 {sid}			{
-	/* leading and trailing quotes are not passed to scanstr */
-	yytext[strlen(yytext) - 1] = '\0';
-	yylval.str = scanstr(yytext+1);
-	yytext[strlen(yytext)] = '"';	/* restore yytext */
+	yylval.str = scanstr(yytext);
 	return ID;
 }
 
diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index dc5f442397..ef3105af44 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -845,17 +845,15 @@ sub print_bki_insert
 		# since that represents a NUL char in C code.
 		$bki_value = '' if $bki_value eq '\0';
 
-		# Handle single quotes by doubling them, and double quotes by
-		# converting them to octal escapes, because that's what the
+		# Handle single quotes by doubling them, because that's what the
 		# bootstrap scanner requires.  We do not process backslashes
 		# specially; this allows escape-string-style backslash escapes
 		# to be used in catalog data.
 		$bki_value =~ s/'/''/g;
-		$bki_value =~ s/"/\\042/g;
 
 		# Quote value if needed.  We need not quote values that satisfy
 		# the "id" pattern in bootscanner.l, currently "[-A-Za-z0-9_]+".
-		$bki_value = sprintf(qq'"%s"', $bki_value)
+		$bki_value = sprintf("'%s'", $bki_value)
 		  if length($bki_value) == 0
 		  or $bki_value =~ /[^-A-Za-z0-9_]/;
 
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index cac70d5df7..f35ee91d64 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -23,8 +23,8 @@
 /* 
  *		scanstr
  *
- * if the string passed in has escaped codes, map the escape codes to actual
- * chars
+ * Strip the 

Re: Error code missing for "wrong length of inner sequence" error

2020-10-01 Thread Daniel Gustafsson
> On 1 Oct 2020, at 12:54, Heikki Linnakangas  wrote:

> Most checks when converting between SQL and Python types use the PLy_elog() 
> function, which uses the genericc ERRCODE_EXTERNAL_ROUTINE_EXCEPTION error 
> code, but I think ERRCODE_ARRAY_SUBSCRIPT_ERROR is better.

On that note, wouldn't the dimension check errors in PLySequence_ToArray be
just as well off using normal ereport()'s?  Only one of them seem to error out
in a way that could propagate an error from Python to postgres.

> Thoughts? If not, I'm going to add errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR) to 
> that.

+1 on using ERRCODE_ARRAY_SUBSCRIPT_ERROR as it better conveys meaning.

cheers ./daniel



Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-01 Thread Amit Kapila
On Thu, Oct 1, 2020 at 10:12 AM Dilip Kumar  wrote:
>
> On Wed, Sep 30, 2020 at 3:00 PM Amit Kapila  wrote:
> >
> > On Wed, Sep 30, 2020 at 12:16 PM Dilip Kumar  wrote:
>
> > And you might need to update the below comment as well:
> > typedef struct ReorderBufferTXN
> > {
> > ..
> > /*
> > * List of ReorderBufferChange structs, including new Snapshots and new
> > * CommandIds
> > */
> > dlist_head changes;
> >

You forgot to address the above comment.

Few other comments:
==
1.
void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
@@ -2813,12 +2830,27 @@ ReorderBufferAddInvalidations(ReorderBuffer
*rb, TransactionId xid,
SharedInvalidationMessage *msgs)
 {
  ReorderBufferTXN *txn;
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;

  txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

+ oldcontext = MemoryContextSwitchTo(rb->context);
+
+ change = ReorderBufferGetChange(rb);
+ change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
+ change->data.inval.ninvalidations = nmsgs;
+ change->data.inval.invalidations = (SharedInvalidationMessage *)
+ MemoryContextAlloc(rb->context,
+sizeof(SharedInvalidationMessage) * nmsgs);
+ memcpy(change->data.inval.invalidations, msgs,
+sizeof(SharedInvalidationMessage) * nmsgs);
+
+ ReorderBufferQueueChange(rb, xid, lsn, change, false);
+
  /*
- * We collect all the invalidations under the top transaction so that we
- * can execute them all together.
+ * Additionally, collect all the invalidations under the top transaction so
+ * that we can execute them all together.  See comment atop this function
  */
  if (txn->toptxn)
  txn = txn->toptxn;
@@ -2830,8 +2862,7 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb,
TransactionId xid,
  {
  txn->ninvalidations = nmsgs;
  txn->invalidations = (SharedInvalidationMessage *)

Here what is the need for using MemoryContextAlloc, can't we directly
use palloc? Also, isn't it better to store the invalidation in txn
before queuing it for change because doing so can lead to the
processing of this and other changes accumulated till that point
before recording the same in txn. As such, there is no problem with it
but still, I think if any error happens while processing those changes
we would be able to clear the cache w.r.t this particular
invalidation.

2.
XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?

I don't think we need this comment in the patch.

3.
- * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
- * accumulates all the invalidation messages in the toplevel transaction.
- * This is required because in some cases where we skip processing the
- * transaction (see ReorderBufferForget), we need to execute all the
- * invalidations together.
+ * This needs to be called for each XLOG_XACT_INVALIDATIONS message.  The
+ * invalidation messages will be added in the reorder buffer as a change as
+ * well as all the invalidations will be accumulated under the toplevel
+ * transaction.  We collect them as a change so that during decoding, we can
+ * execute only those invalidations which are specific to the command instead
+ * of executing all the invalidations, OTOH after decoding is complete or on
+ * transaction abort (see ReorderBufferForget) we need to execute all the
+ * invalidations to avoid any cache pollution so it is better to keep them
+ * together

Can we rewrite the comment as below?
This needs to be called for each XLOG_XACT_INVALIDATIONS message and
accumulates all the invalidation messages in the toplevel transaction
as well as in the form of change in reorder buffer. We require to
record it in form of change so that we can execute only required
invalidations instead of executing all the invalidations on each
CommandId increment. We also need to accumulate these in top-level txn
because in some cases where we skip processing the transaction (see
ReorderBufferForget), we need to execute all the invalidations
together.

4.
+void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId,
XLogRecPtr lsn,
+   int nmsgs, SharedInvalidationMessage *msgs);
As pointed by Keisuke-San as well, this is not required.

5. Can you please once repeat the performance test done by Keisuke-San
to see if you have similar observations? Additionally, see if you are
also seeing the inconsistency related to the Truncate message reported
by him and if so why?

-- 
With Regards,
Amit Kapila.




Error code missing for "wrong length of inner sequence" error

2020-10-01 Thread Heikki Linnakangas

In PLySequence_ToArray_recurse(), there's this check:


if (PySequence_Length(list) != dims[dim])
ereport(ERROR,
(errmsg("wrong length of inner sequence: has length 
%d, but %d was expected",
(int) PySequence_Length(list), 
dims[dim]),
 (errdetail("To construct a multidimensional array, 
the inner sequences must all have the same length.";


It's missing an error code, making it implicitly ERRCODE_INTERNAL_ERROR, 
which is surely not right. That's simple to fix, but what error code 
should we use?


I'm inclined to use ERRCODE_ARRAY_SUBSCRIPT_ERROR, because that's used 
in the similar error message with SQL ARRAY expression, like "ARRAY[[1], 
[2, 3]]". Most checks when converting between SQL and Python types use 
the PLy_elog() function, which uses the genericc 
ERRCODE_EXTERNAL_ROUTINE_EXCEPTION error code, but I think 
ERRCODE_ARRAY_SUBSCRIPT_ERROR is better.


You get this error e.g. if you try to return a python lists of lists 
from a PL/python function as a multi-dimensional array, but the sublists 
have different lengths:


create function return_array() returns int[] as $$
return [[1], [1,2]]
$$ language plpythonu;

postgres=# select return_array();
ERROR:  wrong length of inner sequence: has length 2, but 1 was expected
DETAIL:  To construct a multidimensional array, the inner sequences must 
all have the same length.

CONTEXT:  while creating return value
PL/Python function "return_array"


Thoughts? If not, I'm going to add 
errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR) to that.


- Heikki




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-01 Thread Amit Kapila
On Thu, Oct 1, 2020 at 2:43 PM Keisuke Kuroda
 wrote:
>
> > > I test the patch on the master HEAD(9796f455) and it works fine.
> > > * make installcheck-world: passed
> > > * walsender process continues to use 100% of the CPU 1core when
> > > TRUNCATE 1000 partition: 1s or less
> > > ** before patch : 230s
> >
> > Does this result indicate that it is still CPU bound but it does the
> > actual decoding and completes in 1s instead of spending 230s mainly to
> > execute unnecessary invalidations?
>
> Okay, to check the decodeing and invalidation times,
> I got the time to return the results to pg_recvlogical and perf.
>
> Before the patch, most of the processing was done
> by hash_seq_search in the ReferenodeMapInvalidateCallback.
> After the patch, this problem has been resolved.
>
> # test
>
> (1) start pg_recvlogical
> (2) INSERT to 2000 partition for creating RelfilenodemapHash
> (3) TRUNCATE 1000 partition
>
> (1)
> pg_recvlogical --create-slot --start -f - --if-not-exists \
> --plugin=test_decoding --slot=test1 --dbname=postgres --username=postgres \
> --option=include-timestamp | gawk '{ print strftime("%Y-%m-%d
> %H:%M:%S"), $0; fflush(); }'
>
> (2)
> DO $$
> DECLARE
> i INT;
> j INT;
> schema TEXT;
> tablename TEXT;
> BEGIN
> FOR i IN 1 .. 2 LOOP
> schema := 'nsp_' || to_char(i, 'FM000');
> tablename := 'tbl_' || to_char(i, 'FM000');
> EXECUTE 'INSERT INTO ' || schema || '.' || tablename || '
>  SELECT i,i,i,0,0,0,0,0,0,0,0,0,''a''::bytea,''a''::bytea FROM
> generate_series(0,,1) AS i';
> END LOOP;
> END;
> $$ LANGUAGE plpgsql;
>
> (3)
> TRUNCATE TABLE nsp_001.tbl_001;
>
> # before
>
> (3) TRUNCATE
> decode + invalidation = 222s
>
> 2020-10-01 16:58:29 BEGIN 529
> 2020-10-01 17:02:11 COMMIT 529 (at 2020-10-01 16:58:29.281747+09)
>
> (Before the patch was applied, I was concerned that
> the "table nsp_001.tbl_001 ... TRUNCATE" was not outputting.)
>

Why is it that before the patch these were not displayed and after the
patch, we started to see this?

-- 
With Regards,
Amit Kapila.




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Craig Ringer
On Tue, 15 Sep 2020 at 07:17, Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-09 17:02:58 +0900, Ian Barwick wrote:
> > Attached, though bear in mind I'm not very familiar with parts of this,
> > particularly 2PC stuff, so consider it educated guesswork.
>
> Thanks for this, and the test case!
>
> Your commit fixes the issues, but not quite correctly. Multixacts
> shouldn't matter, so we don't need to do anything there. And for the
> increases, I think they should be inside the already existing
> ProcArrayLock acquisition, as in the attached.
>
>
> I've also included a quite heavily revised version of your test. Instead
> of using dblink I went for having a long-running psql that I feed over
> stdin. The main reason for not liking the previous version is that it
> seems fragile, with the sleep and everything.  I expanded it to cover
> 2PC is as well.
>
> The test probably needs a bit of cleanup, wrapping some of the
> redundancy around the pump_until calls.
>
> I think the approach of having a long running psql session is really
> useful, and probably would speed up some tests. Does anybody have a good
> idea for how to best, and without undue effort, to integrate this into
> PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> it with a local helper in the new test?

2ndQ has some infra for that and various other TAP enhancements that
I'd like to try to upstream. I'll ask what I can share and how.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-01 Thread Keisuke Kuroda
> > I test the patch on the master HEAD(9796f455) and it works fine.
> > * make installcheck-world: passed
> > * walsender process continues to use 100% of the CPU 1core when
> > TRUNCATE 1000 partition: 1s or less
> > ** before patch : 230s
>
> Does this result indicate that it is still CPU bound but it does the
> actual decoding and completes in 1s instead of spending 230s mainly to
> execute unnecessary invalidations?

Okay, to check the decodeing and invalidation times,
I got the time to return the results to pg_recvlogical and perf.

Before the patch, most of the processing was done
by hash_seq_search in the ReferenodeMapInvalidateCallback.
After the patch, this problem has been resolved.

# test

(1) start pg_recvlogical
(2) INSERT to 2000 partition for creating RelfilenodemapHash
(3) TRUNCATE 1000 partition

(1)
pg_recvlogical --create-slot --start -f - --if-not-exists \
--plugin=test_decoding --slot=test1 --dbname=postgres --username=postgres \
--option=include-timestamp | gawk '{ print strftime("%Y-%m-%d
%H:%M:%S"), $0; fflush(); }'

(2)
DO $$
DECLARE
i INT;
j INT;
schema TEXT;
tablename TEXT;
BEGIN
FOR i IN 1 .. 2 LOOP
schema := 'nsp_' || to_char(i, 'FM000');
tablename := 'tbl_' || to_char(i, 'FM000');
EXECUTE 'INSERT INTO ' || schema || '.' || tablename || '
 SELECT i,i,i,0,0,0,0,0,0,0,0,0,''a''::bytea,''a''::bytea FROM
generate_series(0,,1) AS i';
END LOOP;
END;
$$ LANGUAGE plpgsql;

(3)
TRUNCATE TABLE nsp_001.tbl_001;

# before

(3) TRUNCATE
decode + invalidation = 222s

2020-10-01 16:58:29 BEGIN 529
2020-10-01 17:02:11 COMMIT 529 (at 2020-10-01 16:58:29.281747+09)

(Before the patch was applied, I was concerned that
the "table nsp_001.tbl_001 ... TRUNCATE" was not outputting.)

Samples: 228K of event 'cpu-clock:uhH', Event count (approx.): 5708075
  Overhead  Command   Shared Object   Symbol
-   96.00%  postgres  postgres[.] hash_seq_search
 hash_seq_search
 - RelfilenodeMapInvalidateCallback
  - LocalExecuteInvalidationMessage
ReorderBufferExecuteInvalidations
ReorderBufferProcessTXN
ReorderBufferCommit
DecodeCommit
DecodeXactOp
LogicalDecodingProcessRecord
XLogSendLogical
WalSndLoop
StartLogicalReplication
exec_replication_command
PostgresMain
BackendRun
BackendStartup
ServerLoop
PostmasterMain
main
__libc_start_main
_start
+0.77%  postgres  postgres[.] LocalExecuteInvalidationMessage
 0.47%  postgres  postgres[.] hash_bytes_uint32
 0.42%  postgres  postgres[.] hash_search_with_hash_value
 0.37%  postgres  postgres[.] RelfilenodeMapInvalidateCallback
 0.32%  postgres  postgres[.] CatCacheInvalidate
 0.23%  postgres  postgres[.] PlanCacheRelCallback
 0.23%  postgres  postgres[.] ReorderBufferExecuteInvalidations

# after

(3) TRUNCATE
decode + invalidation = 1s or less

2020-10-01 17:09:43 BEGIN 537
2020-10-01 17:09:43 table nsp_001.tbl_001, nsp_001.part_0001,
nsp_001.part_0002 ... nsp_001.part_0999, nsp_001.part_1000: TRUNCATE:
(no-flags)
2020-10-01 17:09:43 COMMIT 537 (at 2020-10-01 17:09:43.192989+09)

Samples: 337  of event 'cpu-clock:uhH', Event count (approx.): 8425
  Overhead  Command   Shared Object   Symbol
-   47.48%  postgres  postgres[.] hash_seq_search
 hash_seq_search
 RelfilenodeMapInvalidateCallback
 LocalExecuteInvalidationMessage
 ReorderBufferExecuteInvalidations
 ReorderBufferProcessTXN
 ReorderBufferCommit
 DecodeCommit
 DecodeXactOp
 LogicalDecodingProcessRecord
 XLogSendLogical
 WalSndLoop
 StartLogicalReplication
 exec_replication_command
 PostgresMain
 BackendRun
 BackendStartup
 ServerLoop
 PostmasterMain
 main
 __libc_start_main
 _start
+3.26%  postgres  postgres[.] heap_hot_search_buffer
+2.67%  postgres  postgres[.] pg_comp_crc32c_sse42
+2.08%  postgres  libc-2.17.so[.] __memcpy_ssse3_back
+2.08%  postgres  postgres[.] DecodeXLogRecord
+2.08%  postgres  postgres[.] hash_search_with_hash_value
+1.48%  postgres  libpthread-2.17.so  [.] __libc_recv
+1.19%  postgres  libc-2.17.so[.] __memset_sse2
+1.19%  postgres  libpthread-2.17.so  [.] __errno_location
+1.19%  postgres  postgres[.] LocalExecuteInvalidationMessage
+1.19%  postgres  postgres[.] ReadPageInternal
+1.19%  postgres  postgres[.] XLogReadRecord
+1.19%  postgres  postgres[.] socket_is_send_pending
 hash_seq_search

Best Regards,
-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com




RE: Disable WAL logging to speed up data loading

2020-10-01 Thread osumi.takami...@fujitsu.com
Hi, Horiguchi-San and Fujii-San.


Thank you so much both of you.
> > the table needs to be rewriitten. One idea for that is to improve that
> > command so that it skips the table rewrite if wal_level=minimal.
> > Of course, also you can change wal_level after marking the table as
> > unlogged.
> 
> tablecmd.c:
The idea is really interesting.
I didn't come up with getting rid of the whole copy of
the ALTER TABLE UNLOGGED/LOGGED commands
only when wal_level='minimal'.

> > * There are two reasons for requiring a rewrite when changing
> > * persistence: on one hand, we need to ensure that the buffers
> > * belonging to each of the two relations are marked with or without
> > * BM_PERMANENT properly.  On the other hand, since rewriting creates
> > * and assigns a new relfilenode, we automatically create or drop an
> > * init fork for the relation as appropriate.
Thanks for sharing concrete comments in the source code.

> According to this comment, perhaps we can do that at least for
> wal_level=minimal.
When I compare the 2 ideas,
one of the benefits of this ALTER TABLE 's improvement 
is that we can't avoid the downtime
while that of wal_level='none' provides an easy and faster
major version up via output file of pg_dumpall.
Both ideas have good points.
However, actually to modify ALTER TABLE's copy
looks far more difficult than wal_level='none' and
beyond my current ability.
So, I'd like to go forward with the direction of wal_level='none'.
Did you have strong objections for this direction ?


Regards,
Takamichi Osumi




Re: Why does PostgresNode.pm set such a low value of max_wal_senders?

2020-10-01 Thread Michael Paquier
On Wed, Sep 30, 2020 at 10:38:59PM -0700, Noah Misch wrote:
> In favor of minimal values, we've had semaphore-starved buildfarm members in
> the past.  Perhaps those days are over, seeing that this commit has not yet
> broken a buildfarm member in that particular way.  Keeping max_wal_senders=10
> seems fine.

Indeed, I am not spotting anything suspicious here.

> No, PostgreSQL commit 54c2ecb changed that.  I recommend an explicit
> max_wal_senders=10 in PostgresNode, which makes it easy to test
> wal_level=minimal:
> 
>   printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> 'max_wal_senders = 0' >/tmp/minimal.conf
>   make check-world TEMP_CONFIG=/tmp/minimal.conf
> 
> thorntail is doing the equivalent, hence the failures.

Ah, thanks, I have missed this piece.  So we really need to have a
value set in this module after all.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: Batch/pipelining support for libpq

2020-10-01 Thread Matthieu Garrigues
This patch fixes compilation on windows and compilation of the documentation.

Matthieu Garrigues

On Thu, Oct 1, 2020 at 8:41 AM Matthieu Garrigues
 wrote:
>
> On Thu, Oct 1, 2020 at 6:35 AM Michael Paquier  wrote:
>
> > The documentation is failing to build, and the patch does not build
> > correctly on Windows.  Could you address that?
> > --
> > Michael
>
> Yes I'm on it.
>
> --
> Matthieu
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 92556c7ce0..932561d0c5 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -4829,6 +4829,465 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Batch mode and query pipelining
+
+  
+   libpq
+   batch mode
+  
+
+  
+   libpq
+   pipelining
+  
+
+  
+   libpq supports queueing up queries into
+   a pipeline to be executed as a batch on the server. Batching queries allows
+   applications to avoid a client/server round-trip after each query to get
+   the results before issuing the next query.
+  
+
+  
+   When to use batching
+
+   
+Much like asynchronous query mode, there is no performance disadvantage to
+using batching and pipelining. It increases client application complexity
+and extra caution is required to prevent client/server deadlocks but
+can sometimes offer considerable performance improvements.
+   
+
+   
+Batching is most useful when the server is distant, i.e. network latency
+(ping time) is high, and when many small operations are being performed in
+rapid sequence. There is usually less benefit in using batches when each
+query takes many multiples of the client/server round-trip time to execute.
+A 100-statement operation run on a server 300ms round-trip-time away would take
+30 seconds in network latency alone without batching; with batching it may spend
+as little as 0.3s waiting for results from the server.
+   
+
+   
+Use batches when your application does lots of small
+INSERT, UPDATE and
+DELETE operations that can't easily be transformed into
+operations on sets or into a
+COPY operation.
+   
+
+   
+Batching is not useful when information from one operation is required by the
+client before it knows enough to send the next operation. The client must
+introduce a synchronisation point and wait for a full client/server
+round-trip to get the results it needs. However, it's often possible to
+adjust the client design to exchange the required information server-side.
+Read-modify-write cycles are especially good candidates; for example:
+
+ BEGIN;
+ SELECT x FROM mytable WHERE id = 42 FOR UPDATE;
+ -- result: x=2
+ -- client adds 1 to x:
+ UPDATE mytable SET x = 3 WHERE id = 42;
+ COMMIT;
+
+could be much more efficiently done with:
+
+ UPDATE mytable SET x = x + 1 WHERE id = 42;
+
+   
+
+   
+
+ The batch API was introduced in PostgreSQL 14.0, but clients using PostgresSQL 14.0 version of libpq can
+ use batches on server versions 7.4 and newer. Batching works on any server
+ that supports the v3 extended query protocol.
+
+   
+
+  
+
+  
+   Using batch mode
+
+   
+To issue batches the application must switch
+a connection into batch mode. Enter batch mode with PQenterBatchMode(conn) or test
+whether batch mode is active with PQbatchStatus(conn). In batch mode only asynchronous operations are permitted, and
+COPY is not recommended as it most likely will trigger failure in batch processing. 
+Using any synchronous command execution functions such as PQfn,
+PQexec or one of its sibling functions are error conditions.
+Functions allowed in batch mode are described in . 
+   
+
+   
+The client uses libpq's asynchronous query functions to dispatch work,
+marking the end of each batch with PQbatchSendQueue.
+And to get results, it uses PQgetResult. It may eventually exit
+batch mode with PQexitBatchMode once all results are
+processed.
+   
+
+   
+
+ It is best to use batch mode with libpq in
+ non-blocking mode. If used in
+ blocking mode it is possible for a client/server deadlock to occur. The
+ client will block trying to send queries to the server, but the server will
+ block trying to send results from queries it has already processed to the
+ client. This only occurs when the client sends enough queries to fill its
+ output buffer and the server's receive buffer before switching to
+ processing input from the server, but it's hard to predict exactly when
+ that'll happen so it's best to always use non-blocking mode.
+ Batch mode consumes more memory when send/recv is not done as required even in non-blocking mode.
+
+   
+
+   
+Issuing queries
+
+
+ After entering batch mode the application dispatches requests
+ using normal asynchronous libpq functions such as 
+ PQsendQueryParams, PQsendPrepare,
+ PQsendQueryPrepared, 

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-01 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> I thought that the advantage of this optimization is that we don't
> need to visit all buffers?  If we need to run a full-scan for any
> reason, there's no point in looking-up already-visited buffers
> again. That's just wastefull cycles.  Am I missing somethig?
> 
> I don't understand. If we chose to the optimized dropping, the reason
> is the number of buffer lookup is fewer than a certain threashold. Why
> do you think that the fork kind a buffer belongs to is relevant to the
> criteria?

I rethought about this, and you certainly have a point, but...  OK, I think I 
understood.  I should have thought in a complicated way.  In other words, 
you're suggesting "Let's simply treat all forks as one relation to determine 
whether to optimize," right?  That is, the code simple becomes:

Sums up the number of buffers to invalidate in all forks;
if (the cached sizes of all forks are valid && # of buffers to invalidate < 
THRESHOLD)
{
do the optimized way;
return;
}
do the traditional way;

This will be simple, and I'm +1.


Regards
Takayuki Tsunakawa







Re: Protect syscache from bloating with negative cache entries

2020-10-01 Thread Kyotaro Horiguchi
At Thu, 1 Oct 2020 13:37:29 +0900, Michael Paquier  wrote 
in 
> On Wed, Jan 22, 2020 at 02:38:19PM +0900, Kyotaro Horiguchi wrote:
> > I changed my mind to attach the benchmark patch as .txt file,
> > expecting the checkers not picking up it as a part of the patchset.
> > 
> > I have in the precise performance measurement mode for a long time,
> > but I think it's settled. I'd like to return to normal mode and
> > explain this patch.
> 
> Looking at the CF bot, this patch set does not compile properly.
> Could you look at that?

It is complaining that TimestampDifference is implicitly used. I'm not
sure the exact cause but maybe some refactoring in header file
inclusion caused that.

This is the rebased version.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 07ce4cf303a365c3f3efc3f133551a67e4646875 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 10 Jan 2020 15:02:26 +0900
Subject: [PATCH v2 1/3] base_change

Adds struct members needed by catcache expiration feature and a GUC
variable that controls the behavior of the feature. But no substantial
code is not added yet.

If existence of some variables alone can cause degradation,
benchmarking after this patch shows that.
---
 src/backend/access/transam/xact.c  |  3 +++
 src/backend/utils/cache/catcache.c | 15 +++
 src/backend/utils/misc/guc.c   | 13 +
 src/include/utils/catcache.h   | 20 
 4 files changed, 51 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..1040713955 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1085,6 +1085,9 @@ ForceSyncCommit(void)
 static void
 AtStart_Cache(void)
 {
+	if (xactStartTimestamp != 0)
+		SetCatCacheClock(xactStartTimestamp);
+
 	AcceptInvalidationMessages();
 }
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 3613ae5f44..c5bad63aac 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -60,9 +60,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz	catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
 			   int nkeys,
 			   Datum v1, Datum v2,
@@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 			 Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+	catalog_cache_prune_min_age = newval;
+}
 
 /*
  *	internal support functions
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..e39c756f80 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -88,6 +88,8 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
+#include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -2358,6 +2360,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered for removal."),
+			gettext_noop("The value of -1 turns off pruning."),
+			GUC_UNIT_S
+		},
+		_cache_prune_min_age,
+		300, -1, INT_MAX,
+		NULL, assign_catalog_cache_prune_min_age, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f4aa316604..3d3870f05a 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
 	slist_node	cc_next;		/* list link */
 	ScanKeyData cc_skey[CATCACHE_MAXKEYS];	/* precomputed key info for heap
 			 * scans */
+	TimestampTz	cc_oldest_ts;	/* timestamp of the oldest tuple in the hash */
 
 	/*
 	 * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
 	bool		dead;			/* dead but not yet removed? */
 	bool		negative;		/* negative cache entry? */
 	HeapTupleData tuple;		/* tuple management header */
+	unsigned int naccess;		/* # of access to this 

Re: Add information to rm_redo_error_callback()

2020-10-01 Thread Michael Paquier
On Thu, Sep 24, 2020 at 03:03:46PM +0900, Michael Paquier wrote:
> Hmm.  I still think that knowing at least about a FPW could be an
> interesting piece of information even here.  Anyway, instead of
> copying a logic that exists already in xlog_outrec(), why not moving
> the block information print into a separate routine out of the
> WAL_DEBUG section, and just reuse the same format for the context of
> the redo error callback?  That would also be more consistent with what
> we do in pg_waldump where we don't show the fork name of a block when
> it is on a MAIN_FORKNUM.  And this would avoid a third copy of the
> same logic.  If we add the XID, previous LSN and the record length
> on the stack of what is printed, we could just reuse the existing
> routine, still that's perhaps too much information displayed.

Seeing nothing, I took a swing at that, and finished with the
attached that refactors the logic and prints the block information as
wanted.  Any objections to that?
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 79a77ebbfe..e96075158a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -940,6 +940,7 @@ static bool CheckForStandbyTrigger(void);
 #ifdef WAL_DEBUG
 static void xlog_outrec(StringInfo buf, XLogReaderState *record);
 #endif
+static void xlog_block_info(StringInfo buf, XLogReaderState *record);
 static void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 static void pg_start_backup_callback(int code, Datum arg);
 static void pg_stop_backup_callback(int code, Datum arg);
@@ -10258,6 +10259,19 @@ xlog_outrec(StringInfo buf, XLogReaderState *record)
 	appendStringInfo(buf, "; len %u",
 	 XLogRecGetDataLen(record));
 
+	xlog_block_info(buf, record);
+}
+#endif			/* WAL_DEBUG */
+
+/*
+ * Returns a string giving information about all the blocks in an
+ * XLogRecord.
+ */
+static void
+xlog_block_info(StringInfo buf, XLogReaderState *record)
+{
+	int block_id;
+
 	/* decode block references */
 	for (block_id = 0; block_id <= record->max_block_id; block_id++)
 	{
@@ -10284,7 +10298,6 @@ xlog_outrec(StringInfo buf, XLogReaderState *record)
 			appendStringInfoString(buf, " FPW");
 	}
 }
-#endif			/* WAL_DEBUG */
 
 /*
  * Returns a string describing an XLogRecord, consisting of its identity
@@ -11765,6 +11778,7 @@ rm_redo_error_callback(void *arg)
 
 	initStringInfo();
 	xlog_outdesc(, record);
+	xlog_block_info(, record);
 
 	/* translator: %s is a WAL record description */
 	errcontext("WAL redo at %X/%X for %s",


signature.asc
Description: PGP signature


Re: enable_incremental_sort changes query behavior

2020-10-01 Thread Jaime Casanova
On Wed, 30 Sep 2020 at 21:21, James Coleman  wrote:
>
> On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova
>  wrote:
> >
> > Hi,
> >
> > With sqlsmith I found a query that gives this error:
> > ERROR:  ORDER/GROUP BY expression not found in targetlist
> >
[...]
> >
> > But if I set enable_incremental_sort to off the query gets executed
> > without problems (attached the explain produced for that case)
>
> Thanks for the report.
>

Hi,

by experiment I reduced the query to this

--- 0 ---
select distinct
subq_0.c1 as c0,
case when (true = pg_catalog.pg_rotate_logfile_old()) then
ref_0.t else ref_0.t
end
 as c4
from
  public.ref_0,
  lateral (select

ref_0.i as c1
  from
generate_series(1, 100) as ref_1) as subq_0
--- 0 ---

the only custom table already needed can be created with this commands:

--- 0 ---
create table ref_0 as select repeat('abcde', (random() * 10)::integer)
t, random() * 1000 i from generate_series(1, 50);
create index on ref_0 (i);
analyze ref_0 ;
--- 0 ---


> Is there by an chance an index on ref_0.radi_text_temp?
>

there is an index involved but not on that field, commands above
create the index in the right column... after that, ANALYZE the table

> And if you set enable_hashagg = off what plan do you get (or error)?
>

same error

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Get memory contexts of an arbitrary backend process

2020-10-01 Thread Kasahara Tatsuhito
Hi,

On Fri, Sep 25, 2020 at 4:28 PM torikoshia  wrote:
> Thanks for all your comments, I updated the patch.
Thanks for updating the patch.
I did a brief test and code review.

> I added a shared hash table consisted of minimal members
> mainly for managing whether the file is dumped or not.
> Some members like 'loc' seem useful in the future, but I
> haven't added them since it's not essential at this point.
Yes, that would be good.

+/*
+ * Since we allow only one session can request to dump
memory context at
+ * the same time, check whether the dump files already exist.
+ */
+while (stat(dumpfile, _tmp) == 0 || stat(tmpfile, _tmp) == 0)
+{
+pg_usleep(100L);
+}

If pg_get_backend_memory_contexts() is executed by two or more
sessions at the same time, it cannot be run exclusively in this way.
Currently it seems to cause a crash when do it so.
This is easy to reproduce and can be done as follows.

[session-1]
BEGIN;
LOCK TABKE t1;
  [Session-2]
  BEGIN;
  LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts();
  [Session-4]
  select * FROM pg_get_backend_memory_contexts();

If you issue commit or abort at session-1, you will get SEGV.

Instead of checking for the existence of the file, it might be better
to use a hash (mcxtdumpHash) entry with LWLock.

+if (proc == NULL)
+{
+ereport(WARNING,
+(errmsg("PID %d is not a PostgreSQL server
process", dst_pid)));
+return (Datum) 1;
+}

Shouldn't it clear the hash entry before return?

+/* Wait until target process finished dumping file. */
+while (!entry->is_dumped)
+{
+CHECK_FOR_INTERRUPTS();
+pg_usleep(1L);
+}

If the target is killed and exit before dumping the memory
information, you're in an infinite loop here.
So how about making sure that any process that has to stop before
doing a memory dump changes the status of the hash (entry->is_dumped)
before stopping and the caller detects it?
I'm not sure it's best or not, but you might want to use something
like the on_shmem_exit callback.

In the current design, if the caller stops processing before reading
the dumped file, you will have an orphaned file.
It looks like the following.

[session-1]
BEGIN;
LOCK TABKE t1;
  [Session-2]
  BEGIN;
  LOCK TABLE t1; <- waiting
[Session-3]
select * FROM pg_get_backend_memory_contexts();

If you cancel or terminate the session-3, then issue commit or abort
at session-1, you will get orphaned files in pg_memusage.

So if you allow only one session can request to dump file, it could
call pg_memusage_reset() before send the signal in this function.

Best regards,

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: Parallel copy

2020-10-01 Thread Amit Kapila
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow  wrote:
>
> Hi Vignesh and Bharath,
>
> Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as
> parallel-unsafe.
> Can you explain why this is?
>

I don't think we need to restrict this case and even if there is some
reason to do so then probably the same should be mentioned in the
comments.

-- 
With Regards,
Amit Kapila.




Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-01 Thread Amit Kapila
On Thu, Oct 1, 2020 at 12:06 PM Amit Kapila  wrote:
>
> Thanks for the review.
>

oops, forgot to attach the updated patches, doing now.

-- 
With Regards,
Amit Kapila.


v7-0001-Track-statistics-for-spilling-of-changes-from-Reo.patch
Description: Binary data


v7-0002-Test-stats.patch
Description: Binary data


Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-01 Thread Amit Kapila
On Wed, Sep 30, 2020 at 4:34 PM Dilip Kumar  wrote:
>
> On Wed, Sep 30, 2020 at 2:40 PM Amit Kapila  wrote:
> >
> > On Wed, Sep 30, 2020 at 1:12 PM Dilip Kumar  wrote:
> > >
> > > On Fri, Sep 25, 2020 at 4:33 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Sep 24, 2020 at 5:44 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > I have done some more testing of this patch especially for the case
> > > > where we spill before streaming the transaction and found everything
> > > > is working as expected. Additionally, I have changed a few more
> > > > comments and ran pgindent. I am still not very sure whether we want to
> > > > display physical slots in this view as all the stats are for logical
> > > > slots but anyway we can add stats w.r.t physical slots in the future.
> > > > I am fine either way (don't show physical slots in this view or show
> > > > them but keep stats as 0). Let me know if you have any thoughts on
> > > > these points, other than that I am happy with the current state of the
> > > > patch.
> > >
> > > IMHO, It will make more sense to only show the logical replication
> > > slots in this view.
> > >
> >
> > Okay, Sawada-San, others, do you have any opinion on this matter?
>

I have changed so that view will only show logical replication slots
and adjusted the docs accordingly. I think we can easily extend it to
show physical replication slots if required in the future.

> I have started looking into this patch, I have a few comments.
>
> +Number of times transactions were spilled to disk. Transactions
> +may get spilled repeatedly, and this counter gets incremented on 
> every
> +such invocation.
> +  
> + 
> +
> +
> +  
> +   Tracking of spilled transactions works only for logical replication.  In
>
> The number of spaces used after the full stop is not uniform.
>

I have removed this sentence as it is not required as we only want to
show logical slots in the view and for that I have updated the other
parts of the doc.

> + /* update the statistics iff we have spilled anything */
> + if (spilled)
> + {
> + rb->spillCount += 1;
> + rb->spillBytes += size;
> +
> + /* Don't consider already serialized transactions. */
>
> Single line comments are not uniform,  "update the statistics" is
> starting is small letter and not ending with the full stop
> whereas 'Don't consider' is starting with capital and ending with full stop.
>

Actually, it doesn't matter in this case but I have changed to keep it
consistent.

>
> +
> + /*
> + * We set this flag to indicate if the transaction is ever serialized.
> + * We need this to accurately update the stats.
> + */
> + txn->txn_flags |= RBTXN_IS_SERIALIZED_CLEAR;
>
> I feel we can explain the exact scenario in the comment, i.e. after
> spill if we stream then we still
> need to know that it spilled in past so that we don't count this again
> as a new spilled transaction.
>

Okay, I have expanded the comment a bit more to explain this.

> old slot.  We send the drop
> + * and create messages while holding ReplicationSlotAllocationLock to
> + * reduce that possibility. If the messages reached in reverse, we would
> + * lose one statistics update message.  But
>
> Spacing after the full stop is not uniform.
>

Changed.

>
> + * Statistics about transactions spilled to disk.
> + *
> + * A single transaction may be spilled repeatedly, which is why we keep
> + * two different counters. For spilling, the transaction counter includes
> + * both toplevel transactions and subtransactions.
> + */
> + int64 spillCount; /* spill-to-disk invocation counter */
> + int64 spillTxns; /* number of transactions spilled to disk  */
> + int64 spillBytes; /* amount of data spilled to disk */
>
> Can we keep the order as spillTxns, spillTxns, spillBytes because
> every other place we kept like that
> so that way it will look more uniform.
>

Changed here and at one more place as per this suggestion.

> Other than that I did not see any problem.
>

Thanks for the review.

-- 
With Regards,
Amit Kapila.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-01 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
> For non-recovery path, did you mean by any chance
> measuring the cache hit rate for varying shared_buffers?

No.  You can test the speed of DropRelFileNodeBuffers() during normal 
operation, i.e. by running TRUNCATE on psql, instead of performing recovery.  
To enable that, you can just remove the checks for recovery, i.e. removing the 
check if InRecovery and if the value is cached or not.



Regards
Takayuki Tsunakawa





Re: making update/delete of inheritance trees scale better

2020-10-01 Thread Amit Langote
Hi,

On Thu, Oct 1, 2020 at 1:32 PM Michael Paquier  wrote:
>
> On Fri, Sep 11, 2020 at 07:20:56PM +0900, Amit Langote wrote:
> > I have been working away at this and have updated the patches for many
> > cosmetic and some functional improvements.
>
> Please note that this patch set fails to apply.  Could you provide a
> rebase please?

Yeah, I'm working on posting an updated patch.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: Batch/pipelining support for libpq

2020-10-01 Thread Matthieu Garrigues
On Thu, Oct 1, 2020 at 6:35 AM Michael Paquier  wrote:

> The documentation is failing to build, and the patch does not build
> correctly on Windows.  Could you address that?
> --
> Michael

Yes I'm on it.

-- 
Matthieu