date:20200616

Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-16 Thread Amit Khandekar

On Wed, 10 Jun 2020 at 17:12, Pavel Stehule  wrote:
> st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar  
> napsal:
>> Could you show an example testcase that tests this recursive scenario,
>> with which your earlier patch fails the test, and this v2 patch passes
>> it ? I am trying to understand the recursive scenario and the re-use
>> of expr->plan.
>
>
> it hangs on plpgsql tests. So you can apply first version of patch
>
> and "make check"

I could not reproduce the make check hang with the v1 patch. But I
could see a crash with the below testcase. So I understand the purpose
of the plan_owner variable that you introduced in v2.

Consider this recursive test :

create or replace procedure p1(in r int) as $$
begin
   RAISE INFO 'r : % ', r;
   if r < 3 then
  call p1(r+1);
   end if;
end
$$ language plpgsql;

do $$
declare r int default 1;
begin
call p1(r);
end;
$$;

In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
consider this code of exec_stmt_call() with your v2 patch applied:
if (expr->plan && !expr->plan->saved)
{
   if (plan_owner)
  SPI_freeplan(expr->plan);
   expr->plan = NULL;
}

Here, plan_owner is false. So SPI_freeplan() will not be called, and
expr->plan is set to NULL. Now I have observed that the stmt pointer
and expr pointer is shared between the p1() execution at this r=2
level and the p1() execution at r=1 level. So after the above code is
executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
the same above code snippet, it gets the same expr pointer, but it's
expr->plan is already set to NULL without being freed. From this
logic, it looks like the plan won't get freed whenever the expr/stmt
pointers are shared across recursive levels, since expr->plan is set
to NULL at the lowermost level ? Basically, the handle to the plan is
lost so no one at the upper recursion level can explicitly free it
using SPI_freeplan(), right ? This looks the same as the main issue
where the plan does not get freed for non-recursive calls. I haven't
got a chance to check if we can develop a testcase for this, similar
to your testcase where the memory keeps on increasing.

-Amit

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Amit Kapila

On Tue, Jun 16, 2020 at 6:43 PM Ashutosh Bapat
 wrote:
>
> On Tue, Jun 16, 2020 at 3:40 PM Amit Kapila  wrote:
> >
> >
> > Is there some mapping between GXID and XIDs allocated for each node or
> > will each node use the GXID as XID to modify the data?   Are we fine
> > with parking the work for global snapshots and atomic visibility to a
> > separate patch and just proceed with the design proposed by this
> > patch?
>
> Distributed transaction involves, atomic commit,  atomic visibility
> and global consistency. 2PC is the only practical solution for atomic
> commit. There are some improvements over 2PC but those are add ons to
> the basic 2PC, which is what this patch provides. Atomic visibility
> and global consistency however have alternative solutions but all of
> those solutions require 2PC to be supported. Each of those are large
> pieces of work and trying to get everything in may not work. Once we
> have basic 2PC in place, there will be a ground to experiment with
> solutions for global consistency and atomic visibility. If we manage
> to do it right, we could make it pluggable as well.
>

I think it is easier said than done. If you want to make it pluggable
or want alternative solutions to adapt the 2PC support provided by us
we should have some idea how those alternative solutions look like.  I
am not telling we have to figure out each and every detail of those
solutions but without paying any attention to the high-level picture
we might end up doing something for 2PC here which either needs a lot
of modifications or might need a design change which would be bad.
Basically, if we later decide to use something like Global Xid to
achieve other features then what we are doing here might not work.

I think it is a good idea to complete the work in pieces where each
piece is useful on its own but without having clarity on the overall
solution that could be a recipe for disaster.  It is possible that you
have some idea in your mind where you can see clearly how this piece
of work can fit in the bigger picture but it is not very apparent to
others or doesn't seem to be documented anywhere.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Masahiko Sawada

On Wed, 17 Jun 2020 at 09:01, Masahiro Ikeda  wrote:
>
> > I've attached the new version patch set. 0006 is a separate patch
> > which introduces 'prefer' mode to foreign_twophase_commit.
>
> I hope we can use this feature. Thank you for making patches and
> discussions.
> I'm currently understanding the logic and found some minor points to be
> fixed.
>
> I'm sorry if my understanding is wrong.
>
> * The v22 patches need rebase as they can't apply to the current master.
>
> * FdwXactAtomicCommitParticipants said in
> src/backend/access/fdwxact/README
>is not implemented. Is FdwXactParticipants right?

Right.

>
> * A following comment says that this code is for "One-phase",
>but second argument of FdwXactParticipantEndTransaction() describes
>this code is not "onephase".
>
> AtEOXact_FdwXact() in fdwxact.c
> /* One-phase rollback foreign transaction */
> FdwXactParticipantEndTransaction(fdw_part, false, false);
>
> static void
> FdwXactParticipantEndTransaction(FdwXactParticipant *fdw_part, bool
> onephase,
> bool for_commit)
>
> * "two_phase_commit" option is mentioned in postgres-fdw.sgml,
> but I can't find related code.
>
> * resolver.c comments have the sentence
>containing two blanks.(Emergency  Termination)
>
> * There are some inconsistency with PostgreSQL wiki.
> https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
>
>I understand it's difficult to keep consistency, I think it's ok to
> fix later
>when these patches almost be able to be committed.
>
>- I can't find "two_phase_commit" option in the source code.
>  But 2PC is work if the remote server's "max_prepared_transactions"
> is set
>  to non zero value. It is correct work, isn't it?

Yes. I had removed two_phase_commit option from postgres_fdw.
Currently, postgres_fdw uses 2pc when 2pc is required. Therefore,
max_prepared_transactions needs to be set to more than one, as you
mentioned.

>
>- some parameters are renamed or added in latest patches.
>  max_prepared_foreign_transaction, max_prepared_transactions and so
> on.
>
>- typo: froeign_transaction_resolver_timeout
>

Thank you for your review! I've incorporated your comments on the
local branch. I'll share the latest version patch.

Also, I've updated the wiki page. I'll try to keep the wiki page up-to-date.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Amit Kapila

On Tue, Jun 16, 2020 at 8:06 PM Tatsuo Ishii  wrote:
>
> >> > I think the problem mentioned above can occur with this as well or if
> >> > I am missing something then can you explain in further detail how it
> >> > won't create problem in the scenario I have used above?
> >>
> >> So the problem you mentioned above is like this? (S1/S2 denotes
> >> transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
> >> already committed on N1, S2 sees the row on N1.  However S2 does not
> >> see the row on N2 since S1 has not committed on N2 yet.
> >>
> >
> > Yeah, something on these lines but S2 can execute the query on N1
> > directly which should fetch the data from both N1 and N2.
>
> The algorythm assumes that any client should access database through a
> middle ware. Such direct access is prohibited.
>

okay, so it seems we need few things which middleware (Pangea) expects
if we have to follow the design of paper.

> > Even if
> > there is a solution using REPEATABLE READ isolation level we might not
> > prefer to use that as the only level for distributed transactions, it
> > might be too costly but let us first see how does it solve the
> > problem?
>
> The paper extends Snapshot Isolation (SI, which is same as our
> REPEATABLE READ isolation level) to "Global Snapshot Isolation", GSI).
> I think GSI will solve the problem (atomic visibility) we are
> discussing.
>
> Unlike READ COMMITTED, REPEATABLE READ acquires snapshot at the time
> when the first command is executed in a transaction (READ COMMITTED
> acquires a snapshot at each command in a transaction). Pangea controls
> the timing of the snapshot acquisition on pair of transactions
> (S1/N1,N2 or S2/N1,N2) so that each pair acquires the same
> snapshot. To achieve this, while some transactions are trying to
> acquire snapshot, any commit operation should be postponed. Likewise
> any snapshot acquisition should wait until any in progress commit
> operations are finished (see Algorithm I to III in the paper for more
> details).
>

I haven't read the paper completely but it sounds quite restrictive
(like both commits and snapshots need to wait).  Another point is that
do we want some middleware involved in the solution?   The main thing
I was looking into at this stage is do we think that the current
implementation proposed by the patch for 2PC is generic enough that we
would be later able to integrate the solution for atomic visibility?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi

At Wed, 17 Jun 2020 10:17:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera  
> wrote in 
> > On 2020-Jun-16, Kyotaro Horiguchi wrote:
> > 
> > > I noticed the another issue. If some required WALs are removed, the
> > > slot will be "invalidated", that is, restart_lsn is set to invalid
> > > value. As the result we hardly see the "lost" state.
> > > 
> > > It can be "fixed" by remembering the validity of a slot separately
> > > from restart_lsn. Is that worth doing?
> > 
> > We discussed this before.  I agree it would be better to do this
> > in some way, but I fear that if we do it naively, some code might exist
> > that reads the LSN without realizing that it needs to check the validity
> > flag first.
> 
> Yes, that was my main concern on it. That's error-prone. How about
> remembering the LSN where invalidation happened?  It's safe since no
> others than slot-monitoring functions would look
> last_invalidated_lsn. It can be reset if active_pid is a valid pid.
> 
> InvalidateObsoleteReplicationSlots:
>  ...
>   SpinLockAcquire(>mutex);
> + s->data.last_invalidated_lsn = s->data.restart_lsn;
>   s->data.restart_lsn = InvalidXLogRecPtr;
>   SpinLockRelease(>mutex);

The attached does that (Poc).  No document fix included.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d6fe205eb4..d3240d1e38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9485,20 +9485,25 @@ CreateRestartPoint(int flags)
  *		(typically a slot's restart_lsn)
  *
  * Returns one of the following enum values:
- * * WALAVAIL_NORMAL means targetLSN is available because it is in the range
- *   of max_wal_size.
  *
- * * WALAVAIL_PRESERVED means it is still available by preserving extra
+ * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of
+ *   max_wal_size.
+ *
+ * * WALAVAIL_EXTENDED means it is still available by preserving extra
  *   segments beyond max_wal_size. If max_slot_wal_keep_size is smaller
  *   than max_wal_size, this state is not returned.
  *
+ * * WALAVAIL_BEING_REMOVED means it is being lost. The walsender using this
+ *   slot may return to the above.
+ *
  * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on
  *   a slot with this LSN cannot continue.
  *
  * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
  */
 WALAvailability
-GetWALAvailability(XLogRecPtr targetLSN)
+GetWALAvailability(XLogRecPtr targetLSN, XLogSegNo last_removed_seg,
+   bool slot_is_active)
 {
 	XLogRecPtr	currpos;		/* current write LSN */
 	XLogSegNo	currSeg;		/* segid of currpos */
@@ -9509,7 +9514,11 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 * slot */
 	uint64		keepSegs;
 
-	/* slot does not reserve WAL. Either deactivated, or has never been active */
+	/*
+	 * slot does not reserve WAL. Either deactivated, or has never been active
+	 * The caller should have passed last_invalidated_lsn as targetLSN if the
+	 * slot has been invalidated.
+	 */
 	if (XLogRecPtrIsInvalid(targetLSN))
 		return WALAVAIL_INVALID_LSN;
 
@@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 * the first WAL segment file since startup, which causes the status being
 	 * wrong under certain abnormal conditions but that doesn't actually harm.
 	 */
-	oldestSeg = XLogGetLastRemovedSegno() + 1;
+	oldestSeg = last_removed_seg + 1;
 
 	/* calculate oldest segment by max_wal_size and wal_keep_segments */
 	XLByteToSeg(currpos, currSeg, wal_segment_size);
@@ -9544,20 +9553,21 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 */
 	if (targetSeg >= oldestSeg)
 	{
-		/*
-		 * show "normal" when targetSeg is within max_wal_size, even if
-		 * max_slot_wal_keep_size is smaller than max_wal_size.
-		 */
-		if ((max_slot_wal_keep_size_mb <= 0 ||
-			 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
-			oldestSegMaxWalSize <= targetSeg)
-			return WALAVAIL_NORMAL;
-
-		/* being retained by slots */
-		if (oldestSlotSeg <= targetSeg)
+		/* show "reserved" when targetSeg is within max_wal_size */
+		if (oldestSegMaxWalSize <= targetSeg)
 			return WALAVAIL_RESERVED;
+
+		/* being retained by slots exceeding max_wal_size */
+		return WALAVAIL_EXTENDED;
 	}
 
+	/*
+	 * However segments required by the slot has been lost, if walsender is
+	 * active the walsender can read into the first reserved slot.
+	 */
+	if (slot_is_active)
+		return WALAVAIL_BEING_REMOVED;
+
 	/* Definitely lost */
 	return WALAVAIL_REMOVED;
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..f141b29d28 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;

Re: Does TupleQueueReaderNext() really need to copy its result?

2020-06-16 Thread Thomas Munro

On Thu, May 14, 2020 at 10:55 PM Thomas Munro  wrote:
> On Tue, Aug 27, 2019 at 6:35 AM Andres Freund  wrote:
> > Perhaps this'd could be sidestepped by funneling through MinimalTuples
> > instead of HeapTuples. Afaict that should always be sufficient, because
> > all system column accesses ought to happen below (including being
> > projected into a separate column, if needed above). With the added
> > benefit of needing less space, of course.

Right, create_gather[_merge]_plan() does create_plan_recurse(...,
CP_EXACT_TLIST).  Here's a new version that updates the comment there
to note that this is not merely a good idea but a requirement, due to
the MinimalTuple conveyance.  (I think there may be another reason the
system columns are projected away even without that, but saying so
explicitly and documenting it seems useful either way).

> I tried that out (attached).  That makes various simple tests like
> this to go 10%+ faster on my development machine:

I registered this patch as https://commitfest.postgresql.org/28/2560/
in case someone would like to review it.
From 103c74c34f0d0d97d162c38d34cd720774325b25 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 14 May 2020 19:08:50 +1200
Subject: [PATCH v2] Use MinimalTuple for tuple queues.

This saves 8 bytes per tuple in the queue, and avoids the need to copy
and free tuples on the receiving side.

Gather can emit the returned MinimalTuple directly, but GatherMerge now
needs to make an explicit copy because it buffers multiple tuples at a
time.

Discussion: https://postgr.es/m/CA%2BhUKG%2B8T_ggoUTAE-U%3DA%2BOcPc4%3DB0nPPHcSfffuQhvXXjML6w%40mail.gmail.com
---
 src/backend/executor/nodeGather.c   | 16 +-
 src/backend/executor/nodeGatherMerge.c  | 40 ++---
 src/backend/executor/tqueue.c   | 30 +--
 src/backend/optimizer/plan/createplan.c |  8 +++--
 src/include/executor/tqueue.h   |  4 +--
 5 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 6b8ed867d5..a01b46af14 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -46,7 +46,7 @@
 
 static TupleTableSlot *ExecGather(PlanState *pstate);
 static TupleTableSlot *gather_getnext(GatherState *gatherstate);
-static HeapTuple gather_readnext(GatherState *gatherstate);
+static MinimalTuple gather_readnext(GatherState *gatherstate);
 static void ExecShutdownGatherWorkers(GatherState *node);
 
 
@@ -120,7 +120,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * Initialize funnel slot to same tuple descriptor as outer plan.
 	 */
 	gatherstate->funnel_slot = ExecInitExtraTupleSlot(estate, tupDesc,
-	  );
+	  );
 
 	/*
 	 * Gather doesn't support checking a qual (it's always more efficient to
@@ -266,7 +266,7 @@ gather_getnext(GatherState *gatherstate)
 	PlanState  *outerPlan = outerPlanState(gatherstate);
 	TupleTableSlot *outerTupleSlot;
 	TupleTableSlot *fslot = gatherstate->funnel_slot;
-	HeapTuple	tup;
+	MinimalTuple	tup;
 
 	while (gatherstate->nreaders > 0 || gatherstate->need_to_scan_locally)
 	{
@@ -278,9 +278,9 @@ gather_getnext(GatherState *gatherstate)
 
 			if (HeapTupleIsValid(tup))
 			{
-ExecStoreHeapTuple(tup, /* tuple to store */
-   fslot,	/* slot to store the tuple */
-   true);	/* pfree tuple when done with it */
+ExecStoreMinimalTuple(tup, /* tuple to store */
+	  fslot,	/* slot to store the tuple */
+	  false);	/* don't pfree tuple  */
 return fslot;
 			}
 		}
@@ -308,7 +308,7 @@ gather_getnext(GatherState *gatherstate)
 /*
  * Attempt to read a tuple from one of our parallel workers.
  */
-static HeapTuple
+static MinimalTuple
 gather_readnext(GatherState *gatherstate)
 {
 	int			nvisited = 0;
@@ -316,7 +316,7 @@ gather_readnext(GatherState *gatherstate)
 	for (;;)
 	{
 		TupleQueueReader *reader;
-		HeapTuple	tup;
+		MinimalTuple tup;
 		bool		readerdone;
 
 		/* Check for async events, particularly messages from workers. */
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 317ddb4ae2..47129344f3 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -45,7 +45,7 @@
  */
 typedef struct GMReaderTupleBuffer
 {
-	HeapTuple  *tuple;			/* array of length MAX_TUPLE_STORE */
+	MinimalTuple *tuple;		/* array of length MAX_TUPLE_STORE */
 	int			nTuples;		/* number of tuples currently stored */
 	int			readCounter;	/* index of next tuple to extract */
 	bool		done;			/* true if reader is known exhausted */
@@ -54,8 +54,8 @@ typedef struct GMReaderTupleBuffer
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
 static int32 heap_compare_slots(Datum a, Datum b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
-static HeapTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
-

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-16 Thread Dilip Kumar

On Wed, Jun 17, 2020 at 9:33 AM Amit Kapila  wrote:
>
> On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila  wrote:
> > >
> > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila  
> > > wrote:
> > > >
> > > > I think one of the usages we still need is in ReorderBufferForget
> > > > because it can be called when we skip processing the txn.  See the
> > > > comments in DecodeCommit where we call this function.  If I am
> > > > correct, we need to probably collect all invalidations in
> > > > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > > > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> > > >
> > >
> > > One more point related to this is that after this patch series, we
> > > need to consider executing all invalidation during transaction abort.
> > > Because it is possible that due to memory overflow, we have processed
> > > some of the messages which also contain a few XACT_INVALIDATION
> > > messages, so to avoid cache pollution, we need to execute all of them
> > > in abort.  We also do the similar thing in Rollback/Rollback To
> > > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.
> >
> > I have analyzed this further and I think there is some problem with
> > that. If Instead of keeping the invalidation as an individual change,
> > if we try to combine them in ReorderBufferTxn's invalidation then what
> > happens if the (sub)transaction is aborted.  Basically, in this case,
> > we will end up executing all those invalidations for those we never
> > polluted the cache if we never try to stream it.  So this will affect
> > the normal case where we haven't streamed the transaction because
> > every time we have executed the invalidation logged by transaction
> > those are aborted.  One way is we develop the list at the
> > sub-transaction level and just before sending the transaction (on
> > commit) combine all the (sub) transaction's invalidation list.  But,
> > I think since we already have the invalidation in the commit message
> > then there is no point in adding this complexity.
> > But, my main worry is about the streaming transaction, the problems are
> > - Immediately on the arrival of individual invalidation, we can not
> > directly add to the top-level transaction's invalidation list because
> > later if the transaction aborted before we stream (or we directly
> > stream on commit) then we will get an unnecessarily long list of
> > invalidation which is done by aborted subtransaction.
> >
>
> Is there any problem you see with this or you are concerned with the
> efficiency?  Please note, we already do something similar in
> ReorderBufferForget and if your concern is efficiency then that
> applies to existing cases as well.  I think if we want we can improve
> it later in many ways and one of them you have already suggested, at
> this time, the main thing is correctness and also aborts are not
> frequent enough to worry too much about their performance.

As of now, I am not seeing the problem, I was just concerned about
processing more invalidation messages in the aborted cases compared to
current code, even if the streaming is off/ or transaction never
streamed as memory size is not crossed.  But, I agree that it is only
in the case of the abort, so I will work on this and later maybe we
can test the performance.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-16 Thread Amit Kapila

On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar  wrote:
>
> On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila  wrote:
> >
> > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila  wrote:
> > >
> > > I think one of the usages we still need is in ReorderBufferForget
> > > because it can be called when we skip processing the txn.  See the
> > > comments in DecodeCommit where we call this function.  If I am
> > > correct, we need to probably collect all invalidations in
> > > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> > >
> >
> > One more point related to this is that after this patch series, we
> > need to consider executing all invalidation during transaction abort.
> > Because it is possible that due to memory overflow, we have processed
> > some of the messages which also contain a few XACT_INVALIDATION
> > messages, so to avoid cache pollution, we need to execute all of them
> > in abort.  We also do the similar thing in Rollback/Rollback To
> > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.
>
> I have analyzed this further and I think there is some problem with
> that. If Instead of keeping the invalidation as an individual change,
> if we try to combine them in ReorderBufferTxn's invalidation then what
> happens if the (sub)transaction is aborted.  Basically, in this case,
> we will end up executing all those invalidations for those we never
> polluted the cache if we never try to stream it.  So this will affect
> the normal case where we haven't streamed the transaction because
> every time we have executed the invalidation logged by transaction
> those are aborted.  One way is we develop the list at the
> sub-transaction level and just before sending the transaction (on
> commit) combine all the (sub) transaction's invalidation list.  But,
> I think since we already have the invalidation in the commit message
> then there is no point in adding this complexity.
> But, my main worry is about the streaming transaction, the problems are
> - Immediately on the arrival of individual invalidation, we can not
> directly add to the top-level transaction's invalidation list because
> later if the transaction aborted before we stream (or we directly
> stream on commit) then we will get an unnecessarily long list of
> invalidation which is done by aborted subtransaction.
>

Is there any problem you see with this or you are concerned with the
efficiency?  Please note, we already do something similar in
ReorderBufferForget and if your concern is efficiency then that
applies to existing cases as well.  I think if we want we can improve
it later in many ways and one of them you have already suggested, at
this time, the main thing is correctness and also aborts are not
frequent enough to worry too much about their performance.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: valgrind versus pg_atomic_init()

2020-06-16 Thread Noah Misch

On Tue, Jun 16, 2020 at 08:35:58PM -0700, Andres Freund wrote:
> On June 16, 2020 8:24:29 PM PDT, Noah Misch  wrote:
> >Suppose the initializing process does:
> >
> >  pg_atomic_init_u64(>atomic, 123);
> >  somestruct->atomic_ready = true;
> >
> >In released versions, any process observing atomic_ready==true will
> >observe
> >the results of the pg_atomic_init_u64().  After the commit from this
> >thread,
> >that's no longer assured.
> 
> Why did that hold true before? There wasn't a barrier in platforms already 
> (wherever we know what 64 bit reads/writes have single copy atomicity).

You are right.  It didn't hold before.

Re: valgrind versus pg_atomic_init()

2020-06-16 Thread Tom Lane

Andres Freund  writes:
> On June 16, 2020 8:24:29 PM PDT, Noah Misch  wrote:
>> Suppose the initializing process does:
>> 
>> pg_atomic_init_u64(>atomic, 123);
>> somestruct->atomic_ready = true;
>> 
>> In released versions, any process observing atomic_ready==true will
>> observe
>> the results of the pg_atomic_init_u64().  After the commit from this
>> thread,
>> that's no longer assured.

> Why did that hold true before? There wasn't a barrier in platforms already 
> (wherever we know what 64 bit reads/writes have single copy atomicity).

I'm confused as to why this is even an interesting discussion.  If the
timing is so tight that another process could possibly observe partially-
initialized state in shared memory, how could we have confidence that the
other process doesn't look before we've initialized the atomic variable or
spinlock at all?

I think in practice all we need depend on in this area is that fork()
provides a full memory barrier.

regards, tom lane

Re: valgrind versus pg_atomic_init()

2020-06-16 Thread Andres Freund

Hi, 

On June 16, 2020 8:24:29 PM PDT, Noah Misch  wrote:
>On Sun, Jun 14, 2020 at 09:16:20PM -0700, Andres Freund wrote:
>> On 2020-06-14 18:55:27 -0700, Noah Misch wrote:
>> > Does something guarantee the write will be globally-visible by the
>time the
>> > first concurrent accessor shows up?
>> 
>> The function comments say:
>> 
>>  *
>>  * Has to be done before any concurrent usage..
>>  *
>>  * No barrier semantics.
>> 
>> 
>> > (If not, one could (a) do an unlocked ptr->value=0, then the atomic
>> > write, or (b) revert and improve the suppression.)  I don't doubt
>it's
>> > fine for the ways PostgreSQL uses atomics today, which generally
>> > initialize an atomic before the concurrent-accessor processes even
>> > exist.
>> 
>> I think it's unlikely that there are cases where you could safely
>> initialize the atomic without needing some form of synchronization
>> before it can be used. If a barrier were needed, what'd guarantee the
>> concurrent access happened after the initialization in the first
>place?
>
>Suppose the initializing process does:
>
>  pg_atomic_init_u64(>atomic, 123);
>  somestruct->atomic_ready = true;
>
>In released versions, any process observing atomic_ready==true will
>observe
>the results of the pg_atomic_init_u64().  After the commit from this
>thread,
>that's no longer assured.

Why did that hold true before? There wasn't a barrier in platforms already 
(wherever we know what 64 bit reads/writes have single copy atomicity).

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Remove dead forceSync parameter of XactLogCommitRecord()

2020-06-16 Thread Noah Misch

I think someone planned to have XactLogCommitRecord() use its forceSync
parameter instead of reading the forceSyncCommit global variable, but that
didn't happen.  I'd like to remove the parameter, as attached.  This has no
functional consequences, as detailed in the commit message.
Author: Noah Misch 
Commit: Noah Misch 

Remove dead forceSync parameter of XactLogCommitRecord().

The function has been reading global variable forceSyncCommit, mirroring
the intent of the caller that passed forceSync=forceSyncCommit.  The
other caller, RecordTransactionCommitPrepared(), passed false.  Since
COMMIT PREPARED can't share a transaction with any command, it certainly
doesn't share a transaction with a command that sets forceSyncCommit.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/access/transam/twophase.c 
b/src/backend/access/transam/twophase.c
index e190487..571eb7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2217,7 +2217,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
recptr = XactLogCommitRecord(committs,
 nchildren, 
children, nrels, rels,
 ninvalmsgs, 
invalmsgs,
-initfileinval, 
false,
+initfileinval,
 MyXactFlags | 
XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 xid, gid);
 
diff --git a/src/backend/access/transam/xact.c 
b/src/backend/access/transam/xact.c
index cd30b62..905dc7d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1048,7 +1048,9 @@ CommandCounterIncrement(void)
  * ForceSyncCommit
  *
  * Interface routine to allow commands to force a synchronous commit of the
- * current top-level transaction
+ * current top-level transaction.  Currently, two-phase commit does not
+ * persist and restore this variable.  So long as all callers use
+ * PreventInTransactionBlock(), that omission has no consequences.
  */
 void
 ForceSyncCommit(void)
@@ -1315,7 +1317,7 @@ RecordTransactionCommit(void)
XactLogCommitRecord(xactStopTimestamp,
nchildren, children, 
nrels, rels,
nmsgs, invalMessages,
-   RelcacheInitFileInval, 
forceSyncCommit,
+   RelcacheInitFileInval,
MyXactFlags,
InvalidTransactionId, 
NULL /* plain commit */ );
 
@@ -5468,7 +5470,7 @@ XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileNode *rels,
int nmsgs, SharedInvalidationMessage 
*msgs,
-   bool relcacheInval, bool forceSync,
+   bool relcacheInval,
int xactflags, TransactionId 
twophase_xid,
const char *twophase_gid)
 {
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 88025b1..db19187 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -434,7 +434,7 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz 
commit_time,
  int 
nsubxacts, TransactionId *subxacts,
  int 
nrels, RelFileNode *rels,
  int 
nmsgs, SharedInvalidationMessage *msgs,
- bool 
relcacheInval, bool forceSync,
+ bool 
relcacheInval,
  int 
xactflags,
  
TransactionId twophase_xid,
  const 
char *twophase_gid);

Re: valgrind versus pg_atomic_init()

2020-06-16 Thread Noah Misch

On Sun, Jun 14, 2020 at 09:16:20PM -0700, Andres Freund wrote:
> On 2020-06-14 18:55:27 -0700, Noah Misch wrote:
> > Does something guarantee the write will be globally-visible by the time the
> > first concurrent accessor shows up?
> 
> The function comments say:
> 
>  *
>  * Has to be done before any concurrent usage..
>  *
>  * No barrier semantics.
> 
> 
> > (If not, one could (a) do an unlocked ptr->value=0, then the atomic
> > write, or (b) revert and improve the suppression.)  I don't doubt it's
> > fine for the ways PostgreSQL uses atomics today, which generally
> > initialize an atomic before the concurrent-accessor processes even
> > exist.
> 
> I think it's unlikely that there are cases where you could safely
> initialize the atomic without needing some form of synchronization
> before it can be used. If a barrier were needed, what'd guarantee the
> concurrent access happened after the initialization in the first place?

Suppose the initializing process does:

  pg_atomic_init_u64(>atomic, 123);
  somestruct->atomic_ready = true;

In released versions, any process observing atomic_ready==true will observe
the results of the pg_atomic_init_u64().  After the commit from this thread,
that's no longer assured.  Having said that, I agree with a special case of
Tom's assertion about spinlocks, namely that this has same problem:

  /* somestruct->lock already happens to contain value 0 */
  SpinLockInit(>lock);
  somestruct->lock_ready = true;

Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera  
wrote in 
> On 2020-Jun-17, Fujii Masao wrote:
> > On 2020/06/17 3:50, Alvaro Herrera wrote:
> 
> > So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> > But do we want to do this? If we want, we should add the note about this
> > case into the docs? Otherwise the users would be surprised at termination
> > of backends by max_slot_wal_keep_size. I guess that it's basically rarely
> > happen, though.
> 
> Well, if we could distinguish a walsender from a non-walsender process,
> then maybe it would make sense to leave backends alive.  But do we want
> that?  I admit I don't know what would be the reason to have a
> non-walsender process with an active slot, so I don't have a good
> opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

> > > > +   /*
> > > > +* Signal to terminate the process using the 
> > > > replication slot.
> > > > +*
> > > > +* Try to signal every 100ms until it succeeds.
> > > > +*/
> > > > +   if (!killed && kill(active_pid, SIGTERM) == 0)
> > > > +   killed = true;
> > > > +   ConditionVariableTimedSleep(>active_cv, 100,
> > > > +   
> > > > WAIT_EVENT_REPLICATION_SLOT_DROP);
> > > > +   } while (ReplicationSlotIsActive(slot, NULL));
> > > 
> > > Note that here you're signalling only once and then sleeping many times
> > > in increments of 100ms -- you're not signalling every 100ms as the
> > > comment claims -- unless the signal fails, but you don't really expect
> > > that.  On the contrary, I'd claim that the logic is reversed: if the
> > > signal fails, *then* you should stop signalling.
> > 
> > You mean; in this code path, signaling fails only when the target process
> > disappears just before signaling. So if it fails, slot->active_pid is
> > expected to become 0 even without signaling more. Right?
> 
> I guess kill() can also fail if the PID now belongs to a process owned
> by a different user.  I think we've disregarded very quick reuse of
> PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected.  But we may make an extra call
to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Review for GetWALAvailability()

2020-06-16 Thread Alvaro Herrera

On 2020-Jun-17, Fujii Masao wrote:
> On 2020/06/17 3:50, Alvaro Herrera wrote:

> So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> But do we want to do this? If we want, we should add the note about this
> case into the docs? Otherwise the users would be surprised at termination
> of backends by max_slot_wal_keep_size. I guess that it's basically rarely
> happen, though.

Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive.  But do we want
that?  I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

> > > + /*
> > > +  * Signal to terminate the process using the replication slot.
> > > +  *
> > > +  * Try to signal every 100ms until it succeeds.
> > > +  */
> > > + if (!killed && kill(active_pid, SIGTERM) == 0)
> > > + killed = true;
> > > + ConditionVariableTimedSleep(>active_cv, 100,
> > > + 
> > > WAIT_EVENT_REPLICATION_SLOT_DROP);
> > > + } while (ReplicationSlotIsActive(slot, NULL));
> > 
> > Note that here you're signalling only once and then sleeping many times
> > in increments of 100ms -- you're not signalling every 100ms as the
> > comment claims -- unless the signal fails, but you don't really expect
> > that.  On the contrary, I'd claim that the logic is reversed: if the
> > signal fails, *then* you should stop signalling.
> 
> You mean; in this code path, signaling fails only when the target process
> disappears just before signaling. So if it fails, slot->active_pid is
> expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user.  I think we've disregarded very quick reuse of
PIDs, so we needn't concern ourselves with it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Review for GetWALAvailability()

2020-06-16 Thread Fujii Masao





On 2020/06/17 3:50, Alvaro Herrera wrote:

On 2020-Jun-17, Fujii Masao wrote:


While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication 
slot \"%s\" is too far behind",
wspid, 
NameStr(slotname;
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.


Good point.


So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.



int wspid = 
ReplicationSlotAcquire(NameStr(slotname),

   SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?


I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.


Understood. Thanks!



+   /*
+* Signal to terminate the process using the replication slot.
+*
+* Try to signal every 100ms until it succeeds.
+*/
+   if (!killed && kill(active_pid, SIGTERM) == 0)
+   killed = true;
+   ConditionVariableTimedSleep(>active_cv, 100,
+   
WAIT_EVENT_REPLICATION_SLOT_DROP);
+   } while (ReplicationSlotIsActive(slot, NULL));


Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.


You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: language cleanups in code and docs

2020-06-16 Thread Tom Lane

Alvaro Herrera  writes:
> On 2020-Jun-16, Tom Lane wrote:
>> "master" is the default branch name established by git, is it not?  Not
>> something we picked.

> Git itself is discussing this:
> https://public-inbox.org/git/41438a0f-50e4-4e58-a3a7-3daaecb55...@jramsay.com.au/T/#t
> and it seems that "main" is the winning choice.

Oh, interesting.  If they do change I'd be happy to follow suit.
But let's wait and see what they do, rather than possibly ending
up with our own private convention.

regards, tom lane

Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi

At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera  
wrote in 
> On 2020-Jun-16, Kyotaro Horiguchi wrote:
> 
> > I noticed the another issue. If some required WALs are removed, the
> > slot will be "invalidated", that is, restart_lsn is set to invalid
> > value. As the result we hardly see the "lost" state.
> > 
> > It can be "fixed" by remembering the validity of a slot separately
> > from restart_lsn. Is that worth doing?
> 
> We discussed this before.  I agree it would be better to do this
> in some way, but I fear that if we do it naively, some code might exist
> that reads the LSN without realizing that it needs to check the validity
> flag first.

Yes, that was my main concern on it. That's error-prone. How about
remembering the LSN where invalidation happened?  It's safe since no
others than slot-monitoring functions would look
last_invalidated_lsn. It can be reset if active_pid is a valid pid.

InvalidateObsoleteReplicationSlots:
 ...
SpinLockAcquire(>mutex);
+   s->data.last_invalidated_lsn = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(>mutex);

> On the other hand, maybe this is not a problem in practice, because if
> such a bug occurs, what will happen is that trying to read WAL from such
> a slot will return the error message that the WAL file cannot be found.
> Maybe this is acceptable?

I'm not sure.  For my part a problem of that would we need to look
into server logs to know what is acutally going on.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Improve planner cost estimations for alternative subplans

2020-06-16 Thread Melanie Plageman

On Fri, Jun 5, 2020 at 9:08 AM Alexey Bashtanov  wrote:

>
> In [1] we found a situation where it leads to a suboptimal plan,
> as it bloats the overall cost into large figures,
> a decision related to an outer part of the plan look negligible to the
> planner,
> and as a result it doesn't elaborate on choosing the optimal one.
>
>
I've just started looking at this patch today, but I was wondering if
you might include a test case which minimally reproduces the original
problem you had.
The only plan diff I see is in updatable_views.sql, and it doesn't
illustrate the problem as well as a more straightforward SELECT query
with EXISTS sublink might.

After putting in some logging, I see that there are only a
few non-catalog queries which exercise this code path.
This query from groupingsets.sql is an example of one such query:

select ten, sum(distinct four) from onek a
group by grouping sets((ten,four),(ten))
having exists (select 1 from onek b where sum(distinct a.four) = b.four);

But, the chosen plan for this query stays the same.

It would be helpful to see a query where a different plan is chosen
because of this change that is not from updatable_views.sql.

-- 
Melanie Plageman

Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi

At Wed, 17 Jun 2020 00:46:38 +0900, Fujii Masao  
wrote in 
> >> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan
> >> replication
> >>  slots array and uses the "for" loop in each scan. Also it calls
> >>  ReplicationSlotAcquire() for each "for" loop cycle, and
> >>  ReplicationSlotAcquire() uses another loop to scan replication slots
> >>  array. I don't think this is good design.
> >>
> >>  ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
> >>  InvalidateObsoleteReplicationSlots() already know the index of the
> >>  slot
> >>  that we want to find. The attached patch does that. Thought?
> > The inner loop is expected to run at most several times per
> > checkpoint, which won't be a serious problem. However, it is better if
> > we can get rid of that in a reasonable way.
> > The attached patch changes the behavior for SAB_Block. Before the
> > patch, it rescans from the first slot for the same name, but with the
> > patch it just rechecks the same slot.  The only caller of the function
> > with SAB_Block is ReplicationSlotDrop and I don't come up with a case
> > where another slot with the same name is created at different place
> > before the condition variable fires. But I'm not sure the change is
> > completely safe.
> 
> Yes, that change might not be safe. So I'm thinking another approach
> to
> fix the issues.
> 
> > Maybe some assertion is needed?
> > 
> >> 3. There is a corner case where the termination of walsender cleans up
> >>  the temporary replication slot while
> >>  InvalidateObsoleteReplicationSlots()
> >>  is sleeping on ConditionVariableTimedSleep(). In this case,
> >>  ReplicationSlotAcquire() is called in the subsequent cycle of the
> >>  "for"
> >>  loop, cannot find the slot and then emits ERROR message. This leads
> >>  to the failure of checkpoint by the checkpointer.
> > Agreed.
> > 
> >>  To avoid this case, if SAB_Inquire is specified,
> >>  ReplicationSlotAcquire()
> >>  should return the special value instead of emitting ERROR even when
> >>  it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
> >>  should
> >>  handle that special returned value.
> > I thought the same thing hearing that.
> 
> While reading InvalidateObsoleteReplicationSlots() code, I found
> another issue.
> 
>   ereport(LOG,
>   (errmsg("terminating walsender %d
>   because replication slot \"%s\" is too
>   far behind",
>   wspid,
>   NameStr(slotname;
>   (void) kill(wspid, SIGTERM);
> 
> wspid indicates the PID of process using the slot. That process can be
> a backend, for example, executing pg_replication_slot_advance().
> So "walsender" in the above log message is not always correct.

Agreed.

> 
>   int wspid = ReplicationSlotAcquire(NameStr(slotname),
>   
>SAB_Inquire);
> 
> Why do we need to call ReplicationSlotAcquire() here and mark the slot
> as
> used by the checkpointer? Isn't it enough to check directly the slot's
> active_pid, instead?
> Maybe ReplicationSlotAcquire() is necessary because
> ReplicationSlotRelease() is called later? If so, why do we need to
> call
> ReplicationSlotRelease()? ISTM that we don't need to do that if the
> slot's
> active_pid is zero. No?

My understanding of the reason is that we update a slot value here.
The restriction allows the owner of a slot to assume that all the slot
values don't voluntarily change.

slot.h:104
| * - Individual fields are protected by mutex where only the backend owning
| * the slot is authorized to update the fields from its own slot.  The
| * backend owning the slot does not need to take this lock when reading its
| * own fields, while concurrent backends not owning this slot should take the
| * lock when reading this slot's data.

> If my understanding is right, I'd like to propose the attached patch.
> It introduces DeactivateReplicationSlot() and replace the "for" loop
> in InvalidateObsoleteReplicationSlots() with
> it. ReplicationSlotAcquire()
> and ReplicationSlotRelease() are no longer called there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Add an option to allow run regression test for individual module on Windows build

2020-06-16 Thread David Zhang


Hi Hackers,

When I was working on an extension on Windows platform, I used the 
command 'vcregress contribcheck' to run the regression test for my 
module. However, this command will run the regression test for all the 
modules, I couldn't find a way to regress test my module only. I think 
it would be better to have such an option to make the development be 
more efficient (Maybe there is a solution already, but I can't find it).


The attached patch allow user to run the regression test by using either 
'vcregress contribcheck' or 'vcregress contribcheck postgres_fdw' if you 
want to test your extension directly, for example, 'postgres_fdw'.


--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index dac60f6264..9db740a847 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -448,23 +448,40 @@ sub subdircheck
return;
 }
 
+sub contribmodulecheck
+{
+   my $module = shift;
+   my $mstat = shift;
+
+   # these configuration-based exclusions must match Install.pm
+   next if ($module eq "uuid-ossp"  && !defined($config->{uuid}));
+   next if ($module eq "sslinfo"&& !defined($config->{openssl}));
+   next if ($module eq "xml2"   && !defined($config->{xml}));
+   next if ($module =~ /_plperl$/   && !defined($config->{perl}));
+   next if ($module =~ /_plpython$/ && !defined($config->{python}));
+   next if ($module eq "sepgsql");
+
+   subdircheck($module);
+   my $status = $? >> 8;
+   $$mstat ||= $status;
+}
+
 sub contribcheck
 {
chdir "../../../contrib";
my $mstat = 0;
-   foreach my $module (glob("*"))
-   {
-   # these configuration-based exclusions must match Install.pm
-   next if ($module eq "uuid-ossp"  && !defined($config->{uuid}));
-   next if ($module eq "sslinfo"&& 
!defined($config->{openssl}));
-   next if ($module eq "xml2"   && !defined($config->{xml}));
-   next if ($module =~ /_plperl$/   && !defined($config->{perl}));
-   next if ($module =~ /_plpython$/ && 
!defined($config->{python}));
-   next if ($module eq "sepgsql");
 
-   subdircheck($module);
-   my $status = $? >> 8;
-   $mstat ||= $status;
+   my $module = shift;
+   if ($module ne "" )
+   {
+   contribmodulecheck($module, \$mstat);
+   }
+   else
+   {
+   foreach my $module (glob("*"))
+   {
+   contribmodulecheck($module, \$mstat);
+   }
}
exit $mstat if $mstat;
return;
@@ -760,6 +777,8 @@ sub usage
  "  serial serial mode\n",
  "  parallel   parallel mode\n",
  "\nOption for : for taptest\n",
- "  TEST_DIR   (required) directory where tests reside\n";
+ "  TEST_DIR   (required) directory where tests reside\n",
+ "\nOption for : for contribcheck\n",
+ "  module   (optional) module name to be tested only\n";
exit(1);
 }

Re: language cleanups in code and docs

2020-06-16 Thread Alvaro Herrera

On 2020-Jun-16, Tom Lane wrote:

> Andrew Dunstan  writes:
> > On 6/15/20 2:22 PM, Andres Freund wrote:
> >> 2) 'master' as a reference to the branch. Personally I be in favor of
> >> changing the branch name, but it seems like it'd be better done as a
> >> somewhat separate discussion to me, as it affects development
> >> practices to some degree.
> 
> > I'm OK with this, but I would need plenty of notice to get the buildfarm
> > adjusted.
> 
> "master" is the default branch name established by git, is it not?  Not
> something we picked.  I don't feel like we need to be out front of the
> rest of the world in changing that.  It'd be a cheaper change than some of
> the other proposals in this thread, no doubt, but it would still create
> confusion for new hackers who are used to the standard git convention.

Git itself is discussing this:
https://public-inbox.org/git/41438a0f-50e4-4e58-a3a7-3daaecb55...@jramsay.com.au/T/#t
and it seems that "main" is the winning choice.

(Personally) I would leave master to have root, would leave root to have
default, would leave default to have primary, would leave primary to
have base, would leave base to have main, would leave main to have
mainline.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add A Glossary

2020-06-16 Thread Justin Pryzby

On Tue, Jun 16, 2020 at 08:09:26PM -0400, Alvaro Herrera wrote:
> diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
> index 25b03f3b37..e29b55e5ac 100644
> --- a/doc/src/sgml/glossary.sgml
> +++ b/doc/src/sgml/glossary.sgml
> @@ -395,15 +395,15 @@
>  
>   The base directory on the filesystem of a
>   server that contains 
> all
> - data files and subdirectories associated with an
> - instance (with the
> - exception of  linkend="glossary-tablespace">tablespaces).
> + data files and subdirectories associated with a
> + database cluster
> + (with the exception of
> + tablespaces).

and (optionally) WAL

> +  
> +   Database cluster
> +   
> +
> + A collection of databases and global SQL objects,
> + and their common static and dynamic meta-data.

metadata

> @@ -1245,12 +1255,17 @@
>   SQL objects,
>   which all reside in the same
>   database.
> - Each SQL object must reside in exactly one schema.
> + Each SQL object must reside in exactly one schema
> + (though certain types of SQL objects exist outside schemas).

(except for global objects which ..)

>  
>   The names of SQL objects of the same type in the same schema are 
> enforced
>   to be unique.
>   There is no restriction on reusing a name in multiple schemas.
> + For local objects that exist outside schemas, their names are enforced
> + unique across the whole database.  For global objects, their names

I would say "unique within the database"

> + are enforced unique across the whole
> + database cluster.

and "unique within the whole db cluster"

>Most local objects belong to a specific
> -  schema in their 
> containing database.
> +  schema in their
> +  containing database, such as
> +  all types of 
> relations,
> +  all types of 
> functions,

Maybe say: >Relations< (all types), and >Functions< (all types)

>   used as the default one for all SQL objects, called 
> pg_default.
>   
>  
"the default" (remove "one")

-- 
Justin

Re: language cleanups in code and docs

2020-06-16 Thread Dave Cramer

On Tue, 16 Jun 2020 at 19:55, Tom Lane  wrote:

> Andrew Dunstan  writes:
> > On 6/15/20 2:22 PM, Andres Freund wrote:
> >> 2) 'master' as a reference to the branch. Personally I be in favor of
> >> changing the branch name, but it seems like it'd be better done as a
> >> somewhat separate discussion to me, as it affects development
> >> practices to some degree.
>
> > I'm OK with this, but I would need plenty of notice to get the buildfarm
> > adjusted.
>
> "master" is the default branch name established by git, is it not?  Not
> something we picked.  I don't feel like we need to be out front of the
> rest of the world in changing that.  It'd be a cheaper change than some of
> the other proposals in this thread, no doubt, but it would still create
> confusion for new hackers who are used to the standard git convention.
>

While it is the default I expect that will change soon. Github is planning
on making main the default.
https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/

Many projects are moving from master to main.

I expect it will be less confusing than you think.

Dave Cramer
www.postgres.rocks

Re: elog(DEBUG2 in SpinLocked section.

2020-06-16 Thread Andres Freund

Hi,

On 2020-06-16 19:46:29 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I experimented with making the compiler warn about about some of these
> > kinds of mistakes without needing full test coverage:
>
> > I was able to get clang to warn about things like using palloc in signal
> > handlers, or using palloc while holding a spinlock. Which would be
> > great, except that it doesn't warn when there's an un-annotated
> > intermediary function. Even when that function is in the same TU.
>
> Hm.  Couldn't we make "calling an un-annotated function" be a violation
> in itself?

I don't see a way to do that with these annotations, unfortunately.

https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
https://clang.llvm.org/docs/AttributeReference.html#acquire-capability-acquire-shared-capability


> Certainly in the case of spinlocks, what we want is pretty
> nearly a total ban on calling anything at all.  I wouldn't cry too hard
> about having a similar policy for signal handlers.

It'd be interesting to try and see how invasive that'd be, if it were
possible to enforce. But...

Greetings,

Andres Freund

Re: Add A Glossary

2020-06-16 Thread Alvaro Herrera

On 2020-Jun-09, Jürgen Purtz wrote:

> Can you agree to the following definitions? If no, we can alternatively
> formulate for each of them: "Under discussion - currently not defined". My
> proposals are inspired by chapter 2.2 Concepts: "Tables are grouped into
> databases, and a collection of databases managed by a single PostgreSQL
> server instance constitutes a database cluster."

After sleeping on it a few more times, I don't oppose the idea of making
"instance" be the running state and "database cluster" the on-disk stuff
that supports the instance.  Here's a patch that does things pretty much
along the lines you suggested.

I made small adjustments to "SQL objects":

* SQL objects in schemas were said to have their names unique in the
schema, but we failed to say anything about names of objects not in
schemas and global objects.  Added that.

* Had example object types for global objects and objects not in
schemas, but no examples for objects in schemas.  Added that.


Some programs whose output we could tweak per this:
pg_ctl
> pg_ctl is a utility to initialize, start, stop, or control a PostgreSQL 
> server.
>  -D, --pgdata=DATADIR   location of the database storage area
to:
> pg_ctl is a utility to initialize or control a PostgreSQL database cluster.
>  -D, --pgdata=DATADIR   location of the database directory

pg_basebackup:
> pg_basebackup takes a base backup of a running PostgreSQL server.
to:
> pg_basebackup takes a base backup of a PostgreSQL instance.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 25b03f3b37..e29b55e5ac 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -395,15 +395,15 @@
 
  The base directory on the filesystem of a
  server that contains all
- data files and subdirectories associated with an
- instance (with the
- exception of tablespaces).
+ data files and subdirectories associated with a
+ database cluster
+ (with the exception of
+ tablespaces).
  The environment variable PGDATA is commonly used to
- refer to the
- data directory.
+ refer to the data directory.
 
 
- An instance's storage
+ A cluster's storage
  space comprises the data directory plus any additional tablespaces.
 
 
@@ -418,7 +418,7 @@

 
  A named collection of
- SQL objects.
+ local SQL objects.
 
 
  For more information, see
@@ -427,6 +427,22 @@

   
 
+  
+   Database cluster
+   
+
+ A collection of databases and global SQL objects,
+ and their common static and dynamic meta-data.
+ Sometimes referred to as a
+ cluster.
+
+
+ In PostgreSQL, the term
+ cluster is also sometimes used to refer to an instance.
+ (Don't confuse this term with the SQL command CLUSTER.)
+
+  
+
   
Database server

@@ -634,7 +650,7 @@
Function

 
- Any defined transformation of data. Many functions are already defined
+ A defined transformation of data.  Many functions are already defined
  within PostgreSQL itself, but user-defined
  ones can also be added.
 
@@ -724,14 +740,12 @@
Instance

 
- A set of databases and accompanying global SQL objects that are stored in
- the same data directory
- in a single server.
- If running, one
+ A group of backend and auxiliary processes that communicate using
+ a common shared memory area.  One 
  postmaster process
- manages a group of backend and auxiliary processes that communicate
- using a common shared memory
- area.  Many instances can run on the same
+ manages the instance; one instance manages exactly one
+ database cluster
+ with all its databases.  Many instances can run on the same
  server
  as long as their TCP ports do not conflict.
 
@@ -739,14 +753,10 @@
  The instance handles all key features of a DBMS:
  read and write access to files and shared memory,
  assurance of the ACID properties,
- connections to client processes,
+ connections to
+ client processes,
  privilege verification, crash recovery, replication, etc.
 
-
- In PostgreSQL, the term
- cluster is also sometimes used to refer to an instance.
- (Don't confuse this term with the SQL command CLUSTER.)
-

   
 
@@ -1245,12 +1255,17 @@
  SQL objects,
  which all reside in the same
  database.
- Each SQL object must reside in exactly one schema.
+ Each SQL object must reside in exactly one schema
+ (though certain types of SQL objects exist outside schemas).
 
 
  The names of SQL objects of the same type in the same schema are enforced
  to be unique.
  There is no restriction on reusing a name in multiple schemas.
+ For local objects that

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Masahiro Ikeda


I've attached the new version patch set. 0006 is a separate patch
which introduces 'prefer' mode to foreign_twophase_commit.


I hope we can use this feature. Thank you for making patches and 
discussions.
I'm currently understanding the logic and found some minor points to be 
fixed.


I'm sorry if my understanding is wrong.

* The v22 patches need rebase as they can't apply to the current master.

* FdwXactAtomicCommitParticipants said in 
src/backend/access/fdwxact/README

  is not implemented. Is FdwXactParticipants right?

* A following comment says that this code is for "One-phase",
  but second argument of FdwXactParticipantEndTransaction() describes
  this code is not "onephase".

AtEOXact_FdwXact() in fdwxact.c
/* One-phase rollback foreign transaction */
FdwXactParticipantEndTransaction(fdw_part, false, false);

static void
FdwXactParticipantEndTransaction(FdwXactParticipant *fdw_part, bool 
onephase,

bool for_commit)

* "two_phase_commit" option is mentioned in postgres-fdw.sgml,
   but I can't find related code.

* resolver.c comments have the sentence
  containing two blanks.(Emergency  Termination)

* There are some inconsistency with PostgreSQL wiki.
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

  I understand it's difficult to keep consistency, I think it's ok to 
fix later

  when these patches almost be able to be committed.

  - I can't find "two_phase_commit" option in the source code.
But 2PC is work if the remote server's "max_prepared_transactions" 
is set

to non zero value. It is correct work, isn't it?

  - some parameters are renamed or added in latest patches.
max_prepared_foreign_transaction, max_prepared_transactions and so 
on.


  - typo: froeign_transaction_resolver_timeout

Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION

Re: language cleanups in code and docs

2020-06-16 Thread Tom Lane

Andrew Dunstan  writes:
> On 6/15/20 2:22 PM, Andres Freund wrote:
>> 2) 'master' as a reference to the branch. Personally I be in favor of
>> changing the branch name, but it seems like it'd be better done as a
>> somewhat separate discussion to me, as it affects development
>> practices to some degree.

> I'm OK with this, but I would need plenty of notice to get the buildfarm
> adjusted.

"master" is the default branch name established by git, is it not?  Not
something we picked.  I don't feel like we need to be out front of the
rest of the world in changing that.  It'd be a cheaper change than some of
the other proposals in this thread, no doubt, but it would still create
confusion for new hackers who are used to the standard git convention.

regards, tom lane

Re: Operator class parameters and sgml docs

2020-06-16 Thread Peter Geoghegan

On Tue, Jun 16, 2020 at 4:24 AM Alexander Korotkov
 wrote:
> Thank you for patience.  The documentation patch is attached.  I think
> it requires review by native english speaker.

* "...paramaters that controls" should be "...paramaters that control".

* "with set of operator class specific option" should be "with a set
of operator class specific options".

* "The options could be accessible from each support function" should
be "The options can be accessed from other support functions"

(At least I think that that's what you meant)

It's very hard to write documentation like this, even for native
English speakers. I think that it's important to have something in
place, though. The GiST example helps a lot.

-- 
Peter Geoghegan

Re: elog(DEBUG2 in SpinLocked section.

2020-06-16 Thread Tom Lane

Andres Freund  writes:
> I experimented with making the compiler warn about about some of these
> kinds of mistakes without needing full test coverage:

> I was able to get clang to warn about things like using palloc in signal
> handlers, or using palloc while holding a spinlock. Which would be
> great, except that it doesn't warn when there's an un-annotated
> intermediary function. Even when that function is in the same TU.

Hm.  Couldn't we make "calling an un-annotated function" be a violation
in itself?  Certainly in the case of spinlocks, what we want is pretty
nearly a total ban on calling anything at all.  I wouldn't cry too hard
about having a similar policy for signal handlers.  (The postmaster's
handlers would have to be an exception for now.)

regards, tom lane

Re: language cleanups in code and docs

2020-06-16 Thread Andrew Dunstan



On 6/15/20 2:22 PM, Andres Freund wrote:
> 2) 'master' as a reference to the branch. Personally I be in favor of
>changing the branch name, but it seems like it'd be better done as a
>somewhat separate discussion to me, as it affects development
>practices to some degree.
>


I'm OK with this, but I would need plenty of notice to get the buildfarm
adjusted.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: elog(DEBUG2 in SpinLocked section.

2020-06-16 Thread Andres Freund

Hi,

On 2020-06-03 00:36:34 -0400, Tom Lane wrote:
> Should we think about adding automated detection of this type of
> mistake?  I don't like the attached as-is because of the #include
> footprint expansion, but maybe we can find a better way.

I experimented with making the compiler warn about about some of these
kinds of mistakes without needing full test coverage:

I was able to get clang to warn about things like using palloc in signal
handlers, or using palloc while holding a spinlock. Which would be
great, except that it doesn't warn when there's an un-annotated
intermediary function. Even when that function is in the same TU.

Here's my attempt: https://godbolt.org/z/xfa6Es

It does detect things like
spinlock_lock();
example_alloc(17);
spinlock_unlock();

:49:2: warning: cannot call function 'example_alloc' while mutex 
'holding_spinlock' is held [-Wthread-safety-analysis]

example_alloc(17);

^

which isn't too bad.

Does anybody think this would be useful even if it doesn't detect the
more complicated cases?

Greetings,

Andres Freund

Re: language cleanups in code and docs

2020-06-16 Thread David Steele


On 6/16/20 6:27 PM, Andres Freund wrote:

On 2020-06-16 17:14:57 -0400, David Steele wrote:

On 6/15/20 2:22 PM, Andres Freund wrote:



0008: docs: WIP multi-master rephrasing.
I like neither the new nor the old language much. I'd welcome input.


Why not multi-primary?


My understanding of primary is that there really can't be two things
that are primary in relation to each other. 


Well, I think the same is true for multi-master and that's pretty common.


active/active is probably
the most common term in use besides multi-master.


Works for me and can always be updated later if we come up with 
something better. At least active-active will be easier to search for.



One last thing -- are we considering back-patching any/all of this?


I don't think there's a good reason to do so.


I was thinking of back-patching pain but if you don't think that's an 
issue then I'm not worried about it.



Thanks for the look!


You are welcome!

--
-David
da...@pgmasters.net

Re: language cleanups in code and docs

2020-06-16 Thread Andres Freund

Hi,

On 2020-06-16 17:14:57 -0400, David Steele wrote:
> On 6/15/20 2:22 PM, Andres Freund wrote:
> > 
> > We've removed the use of "slave" from most of the repo (one use
> > remained, included here), but we didn't do the same for master. In the
> > attached series I replaced most of the uses.
> > 
> > 0001: tap tests: s/master/primary/
> >Pretty clear cut imo.
> 
> Nothing to argue with here as far as I can see. It's a lot of churn, though,
> so the sooner it goes in the better so people can update for the next CF.

Yea, unless somebody protests I'm planning to push this part soon.


> > 0004: code: s/master/$other/
> >This is most of the remaining uses of master in code. A number of
> >references to 'master' in the context of toast, a few uses of 'master
> >copy'. I guess some of these are a bit less clear cut.
> 
> Not sure I love authoritative, e.g.
> 
> +  * fullPageWrites is the authoritative value used by all backends to
> 
> and
> 
> +  * grabbed a WAL insertion lock to read the authoritative value in
> 
> Possibly "shared"?

I don't think shared is necessarily correct for all of these. E.g. in
the GetRedoRecPtr() there's two shared values at play, but only one is
"authoritative".


> +  * Create the Tcl interpreter subsidiary to pltcl_hold_interp.
> 
> Maybe use "worker" here? Not much we can do about the Tcl function name,
> though. It's pretty localized, though, so may not matter much.

I don't think it matters much what we use here


> > 0008: docs: WIP multi-master rephrasing.
> >I like neither the new nor the old language much. I'd welcome input.
> 
> Why not multi-primary?

My understanding of primary is that there really can't be two things
that are primary in relation to each other. active/active is probably
the most common term in use besides multi-master.


> One last thing -- are we considering back-patching any/all of this?

I don't think there's a good reason to do so.

Thanks for the look!

Greetings,

Andres Freund

Re: language cleanups in code and docs

2020-06-16 Thread David Steele


Hi Andres,

Thanks for doing this!

On 6/15/20 2:22 PM, Andres Freund wrote:


We've removed the use of "slave" from most of the repo (one use
remained, included here), but we didn't do the same for master. In the
attached series I replaced most of the uses.

0001: tap tests: s/master/primary/
   Pretty clear cut imo.


Nothing to argue with here as far as I can see. It's a lot of churn, 
though, so the sooner it goes in the better so people can update for the 
next CF.



0002: code: s/master/primary/
   This also includes a few minor other changes (s/in master/on the
   primary/, a few 'the's added). Perhaps it'd be better to do those
   separately?


I would only commit the grammar/language separately if the commit as a 
whole does not go in. Some of them really do make the comments much 
clearer beyond just in/on.


I think the user-facing messages, e.g.:

-			 errhint("Make sure the configuration parameter \"%s\" is set on the 
master server.",
+			 errhint("Make sure the configuration parameter \"%s\" is set on the 
primary server.",


should go in no matter what so we are consistent with our documentation. 
Same for the postgresql.conf updates.



0003: code: s/master/leader/
   This feels pretty obvious. We've largely used the leader / worker
   terminology, but there were a few uses of master left.


Yeah, leader already outnumbers master by a lot. Looks good.


0004: code: s/master/$other/
   This is most of the remaining uses of master in code. A number of
   references to 'master' in the context of toast, a few uses of 'master
   copy'. I guess some of these are a bit less clear cut.


Not sure I love authoritative, e.g.

+* fullPageWrites is the authoritative value used by all backends to

and

+* grabbed a WAL insertion lock to read the authoritative value in

Possibly "shared"?

+* Create the Tcl interpreter subsidiary to pltcl_hold_interp.

Maybe use "worker" here? Not much we can do about the Tcl function name, 
though. It's pretty localized, though, so may not matter much.



0005: docs: s/master/primary/
   These seem mostly pretty straightforward to me. The changes in
   high-availability.sgml probably deserve the most attention.


+on primary and the current time on the standby. Delays in transfer

on *the* primary

I saw a few places where readability could be improved, but this patch 
did not make any of them worse, and did make a few better.



0006: docs: s/master/root/
   Here using root seems a lot better than master anyway (master seems
   confusing in regard to inheritance scenarios). But perhaps parent
   would be better? Went with root since it's about the topmost table.


I looked through to see if there was an instance of parent that should 
be changed to root but I didn't see any. Even if there are, it's no 
worse than before.



0007: docs: s/master/supervisor/
   I guess this could be a bit more contentious. Supervisor seems clearer
   to me, but I can see why people would disagree. See also later point
   about changes I have not done at this stage.


Supervisor seems OK to me, but the postmaster has a special place since 
there is only one of them. Main supervisor, root supervisor?



0008: docs: WIP multi-master rephrasing.
   I like neither the new nor the old language much. I'd welcome input.


Why not multi-primary?


After this series there are only two widespread use of 'master' in the
tree.
1) 'postmaster'. As changing that would be somewhat invasive, the word
is a bit more ambiguous, and it's largely just internal, I've left
this alone for now. I personally would rather see this renamed as
supervisor, which'd imo actually would also be a lot more
descriptive. I'm willing to do the work, but only if there's at least
some agreement.


FWIW, I don't consider this to be a very big change from an end-user 
perspective. I don't think the majority of users interact directly with 
the postmaster, rather they are using systemd, pg_ctl, pg_ctlcluster, etc.


As for postmaster.pid and postmaster.opts, we have renamed plenty of 
things in the past so this is just one more. They'd be better and 
clearer as postgresql.pid and postgresql.opts, IMO.


Overall I'm +.5 because I may just be ignorant of the pain this will cause.


2) 'master' as a reference to the branch. Personally I be in favor of
changing the branch name, but it seems like it'd be better done as a
somewhat separate discussion to me, as it affects development
practices to some degree.


Happily this has no end-user impact. I think "main" is a good 
alternative but I agree this feels like a separate topic.


One last thing -- are we considering back-patching any/all of this?

Regards,
--
-David
da...@pgmasters.net

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-16 Thread Andres Freund

Hi,

On 2020-06-16 14:59:19 -0400, Robert Haas wrote:
> On Mon, Jun 15, 2020 at 9:37 PM Andres Freund  wrote:
> > What do you think about 0002?
> >
> > With regard to the cost of the expensive test in 0003, I'm somewhat
> > inclined to add that to the buildfarm for a few days and see how it
> > actually affects the few bf animals without atomics. We can rip it out
> > after we got some additional coverage (or leave it in if it turns out to
> > be cheap enough in comparison).
>
> I looked over these patches briefly today. I don't have any objection
> to 0001 or 0002.

Cool. I was mainly interested in those for now.

> I think 0003 looks a little strange: it seems to be
> testing things that might be implementation details of other things,
> and I'm not sure that's really correct. In particular:

My main motivation was to have something that runs more often than than
the embeded test in s_lock.c's that nobody ever runs (they wouldn't even
pass with disabled spinlocks, as S_LOCK_FREE isn't implemented).

> + /* and that "contended" acquisition works */
> + s_lock(_w_lock.lock, "testfile", 17, "testfunc");
> + S_UNLOCK(_w_lock.lock);
>
> I didn't think we had formally promised that s_lock() is actually
> defined or working on all platforms.

Hm? Isn't s_lock the, as its comment says, "platform-independent portion
of waiting for a spinlock."?  I also don't think we need to purely
follow external APIs in internal tests.

> More generally, I don't think it's entirely clear what all of these
> tests are testing. Like, I can see that data_before and data_after are
> intended to test that the lock actually fits in the space allowed for
> it, but at the same time, I think empty implementations of all of
> these functions would pass regression, as would many horribly or
> subtly buggy implementations.

Sure, there's a lot that'd pass. But it's more than we had before. It
did catch a bug much quicker than I'd have otherwise found it, FWIW.

I don't think an empty implementation would pass btw, as long as TAS is
defined.

> So I feel like the intent of these tests isn't entirely clear, and
> should probably be explained better, at a minimum -- and perhaps we
> should think harder about what a good testing framework would look
> like.

Yea, we could use something better. But I don't see that happening
quickly, and having something seems better than nothing.

> I would rather have tests that either pass or fail and report a result
> explicitly, rather than tests that rely on hangs or crashes.

That seems quite hard to achieve. I really just wanted to have something
I can do some very basic tests to catch issues quicker.

The atomics tests found numerous issues btw, despite also not testing
concurrency.

I think we generally have way too few of such trivial tests. They can
find plenty "real world" issues, but more importantly make it much
quicker to iterate when working on some piece of code.

> I don't have any real complaints about the functionality of 0004 on a
> quick read-through, but I'm again a bit skeptical of the tests. Not as
> much as with 0003, though.

Without the tests I couldn't even reproduce a deadlock due to the
nesting. So they imo are pretty essential?

Greetings,

Andres Freund

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-16 Thread Alvaro Herrera

On 2020-Jun-15, Andres Freund wrote:

> > Also, I wonder if someone would be willing to set up a BF animal for this.
> 
> FWIW, I've requested a buildfarm animal id for this a few days ago, but
> haven't received a response yet...

I did send it out, with name rorqual -- didn't you get that?  Will send
the secret separately.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-16 Thread Robert Haas

On Mon, Jun 15, 2020 at 9:37 PM Andres Freund  wrote:
> What do you think about 0002?
>
> With regard to the cost of the expensive test in 0003, I'm somewhat
> inclined to add that to the buildfarm for a few days and see how it
> actually affects the few bf animals without atomics. We can rip it out
> after we got some additional coverage (or leave it in if it turns out to
> be cheap enough in comparison).

I looked over these patches briefly today. I don't have any objection
to 0001 or 0002. I think 0003 looks a little strange: it seems to be
testing things that might be implementation details of other things,
and I'm not sure that's really correct. In particular:

+ /* and that "contended" acquisition works */
+ s_lock(_w_lock.lock, "testfile", 17, "testfunc");
+ S_UNLOCK(_w_lock.lock);

I didn't think we had formally promised that s_lock() is actually
defined or working on all platforms.

More generally, I don't think it's entirely clear what all of these
tests are testing. Like, I can see that data_before and data_after are
intended to test that the lock actually fits in the space allowed for
it, but at the same time, I think empty implementations of all of
these functions would pass regression, as would many horribly or
subtly buggy implementations. For example, consider this:

+ /* test basic operations via the SpinLock* API */
+ SpinLockInit(_w_lock.lock);
+ SpinLockAcquire(_w_lock.lock);
+ SpinLockRelease(_w_lock.lock);

What does it look like for this test to fail? I guess one of those
operations has to fail an assert or hang forever, because it's not
like we're checking the return value. So I feel like the intent of
these tests isn't entirely clear, and should probably be explained
better, at a minimum -- and perhaps we should think harder about what
a good testing framework would look like. I would rather have tests
that either pass or fail and report a result explicitly, rather than
tests that rely on hangs or crashes.

Parenthetically, "cyle" != "cycle".

I don't have any real complaints about the functionality of 0004 on a
quick read-through, but I'm again a bit skeptical of the tests. Not as
much as with 0003, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Review for GetWALAvailability()

2020-06-16 Thread Alvaro Herrera

On 2020-Jun-17, Fujii Masao wrote:

> While reading InvalidateObsoleteReplicationSlots() code, I found another 
> issue.
> 
>   ereport(LOG,
>   (errmsg("terminating walsender %d 
> because replication slot \"%s\" is too far behind",
>   wspid, 
> NameStr(slotname;
>   (void) kill(wspid, SIGTERM);
> 
> wspid indicates the PID of process using the slot. That process can be
> a backend, for example, executing pg_replication_slot_advance().
> So "walsender" in the above log message is not always correct.

Good point.

>   int wspid = 
> ReplicationSlotAcquire(NameStr(slotname),
>   
>SAB_Inquire);
> 
> Why do we need to call ReplicationSlotAcquire() here and mark the slot as
> used by the checkpointer? Isn't it enough to check directly the slot's
> active_pid, instead?
> 
> Maybe ReplicationSlotAcquire() is necessary because
> ReplicationSlotRelease() is called later? If so, why do we need to call
> ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
> active_pid is zero. No?

I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.


> + /*
> +  * Signal to terminate the process using the replication slot.
> +  *
> +  * Try to signal every 100ms until it succeeds.
> +  */
> + if (!killed && kill(active_pid, SIGTERM) == 0)
> + killed = true;
> + ConditionVariableTimedSleep(>active_cv, 100,
> + 
> WAIT_EVENT_REPLICATION_SLOT_DROP);
> + } while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Review for GetWALAvailability()

2020-06-16 Thread Alvaro Herrera

On 2020-Jun-16, Kyotaro Horiguchi wrote:

> I noticed the another issue. If some required WALs are removed, the
> slot will be "invalidated", that is, restart_lsn is set to invalid
> value. As the result we hardly see the "lost" state.
> 
> It can be "fixed" by remembering the validity of a slot separately
> from restart_lsn. Is that worth doing?

We discussed this before.  I agree it would be better to do this
in some way, but I fear that if we do it naively, some code might exist
that reads the LSN without realizing that it needs to check the validity
flag first.

On the other hand, maybe this is not a problem in practice, because if
such a bug occurs, what will happen is that trying to read WAL from such
a slot will return the error message that the WAL file cannot be found.
Maybe this is acceptable?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

PostgreSQL 13 Beta 2 Release Date

2020-06-16 Thread Jonathan S. Katz

Hi,

Sticking with precedent for the timing of a Beta 2, the RMT[1] has set
the PostgreSQL 13 Beta 2 release date to be June 25, 2020. As such, if
you have open items[2] that you can finish by the end of this weekend
(June 21 AOE), please do so :)

Thanks,

Jonathan

[1] https://wiki.postgresql.org/wiki/Release_Management_Team
[2] https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items



signature.asc
Description: OpenPGP digital signature

Re: Infinities in type numeric

2020-06-16 Thread Tom Lane

Dean Rasheed  writes:
> On Fri, 12 Jun 2020 at 02:16, Tom Lane  wrote:
>> * I had to invent some semantics for non-standardized functions,
>> particularly numeric_mod, numeric_gcd, numeric_lcm.  This area
>> could use review to be sure that I chose desirable behaviors.

> I think the semantics you've chosen for numeric_mod() are reasonable,
> and I think they're consistent with POSIX fmod().

Ah, I had not thought to look at fmod().  I see that POSIX treats
x-is-infinite the same as y-is-zero: raise EDOM and return NaN.
I think it's okay to deviate to the extent of throwing an error in
one case and returning NaN in the other, but I added a comment
noting the difference.

> Similar arguments apply to lcm(), so I'd say that both gcd() and lcm()
> should return 'NaN' if either input is 'Inf' or 'NaN'.

Works for me; that's easier anyway.

The attached v3 patch fixes these things and also takes care of an
oversight in v2: I'd made numeric() apply typmod restrictions to Inf,
but not numeric_in() or numeric_recv().  I believe the patch itself
is in pretty good shape now, though there are still some issues
elsewhere as noted in the first message in this thread.

Thanks for reviewing!

regards, tom lane

diff --git a/contrib/jsonb_plperl/jsonb_plperl.c b/contrib/jsonb_plperl/jsonb_plperl.c
index ed361efbe2..b81ba54b80 100644
--- a/contrib/jsonb_plperl/jsonb_plperl.c
+++ b/contrib/jsonb_plperl/jsonb_plperl.c
@@ -227,10 +227,8 @@ SV_to_JsonbValue(SV *in, JsonbParseState **jsonb_state, bool is_elem)
 /*
  * jsonb doesn't allow infinity or NaN (per JSON
  * specification), but the numeric type that is used for the
- * storage accepts NaN, so we have to prevent it here
- * explicitly.  We don't really have to check for isinf()
- * here, as numeric doesn't allow it and it would be caught
- * later, but it makes for a nicer error message.
+ * storage accepts those, so we have to reject them here
+ * explicitly.
  */
 if (isinf(nval))
 	ereport(ERROR,
diff --git a/contrib/jsonb_plpython/jsonb_plpython.c b/contrib/jsonb_plpython/jsonb_plpython.c
index e09308daf0..836c178770 100644
--- a/contrib/jsonb_plpython/jsonb_plpython.c
+++ b/contrib/jsonb_plpython/jsonb_plpython.c
@@ -387,14 +387,17 @@ PLyNumber_ToJsonbValue(PyObject *obj, JsonbValue *jbvNum)
 	pfree(str);
 
 	/*
-	 * jsonb doesn't allow NaN (per JSON specification), so we have to prevent
-	 * it here explicitly.  (Infinity is also not allowed in jsonb, but
-	 * numeric_in above already catches that.)
+	 * jsonb doesn't allow NaN or infinity (per JSON specification), so we
+	 * have to reject those here explicitly.
 	 */
 	if (numeric_is_nan(num))
 		ereport(ERROR,
 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
  errmsg("cannot convert NaN to jsonb")));
+	if (numeric_is_inf(num))
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("cannot convert infinity to jsonb")));
 
 	jbvNum->type = jbvNumeric;
 	jbvNum->val.numeric = num;
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 3df189ad85..a9ed269e15 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -554,9 +554,9 @@ NUMERIC(precision)
 
 NUMERIC
 
- without any precision or scale creates a column in which numeric
- values of any precision and scale can be stored, up to the
- implementation limit on precision.  A column of this kind will
+ without any precision or scale creates an unconstrained
+ numeric column in which numeric values of any length can be
+ stored, up to the implementation limits.  A column of this kind will
  not coerce input values to any particular scale, whereas
  numeric columns with a declared scale will coerce
  input values to that scale.  (The SQL standard
@@ -568,10 +568,10 @@ NUMERIC
 
 
  
-  The maximum allowed precision when explicitly specified in the
-  type declaration is 1000; NUMERIC without a specified
-  precision is subject to the limits described in .
+  The maximum precision that can be explicitly specified in
+  a NUMERIC type declaration is 1000.  An
+  unconstrained NUMERIC column is subject to the limits
+  described in .
  
 
 
@@ -593,6 +593,11 @@ NUMERIC
  plus three to eight bytes overhead.
 
 
+
+ infinity
+ numeric (data type)
+
+
 
  NaN
  not a number
@@ -604,13 +609,39 @@ NUMERIC
 
 
 
- In addition to ordinary numeric values, the numeric
- type allows the special value NaN, meaning
- not-a-number.  Any operation on NaN
- yields another NaN.  When writing this value
- as a constant in an SQL command, you must put quotes around it,
- for example UPDATE table SET x = 'NaN'.  On input,
- the string NaN is recognized in a case-insensitive manner.
+ In addition to ordinary numeric values, the numeric type
+ has several special values:
+
+Infinity

Re: Parallel Seq Scan vs kernel read ahead

2020-06-16 Thread Robert Haas

On Tue, Jun 16, 2020 at 6:57 AM Amit Kapila  wrote:
> I agree that won't be a common scenario but apart from that also I am
> not sure if we can conclude that the proposed patch won't cause any
> regressions.  See one of the tests [1] done by Soumyadeep where the
> patch has caused regression in one of the cases, now we can either try
> to improve the patch and see we didn't cause any regressions or assume
> that those are some minority cases which we don't care.  Another point
> is that this thread has started with a theory that this idea can give
> benefits on certain filesystems and AFAICS we have tested it on one
> other type of system, so not sure if that is sufficient.

Yeah, it seems like those cases might need some more investigation,
but they're also not necessarily an argument for a configuration
setting. It's not so much that I dislike the idea of being able to
configure something here; it's really that I don't want a reloption
that feels like magic. For example, we know that work_mem can be
really hard to configure because there may be no value that's high
enough to make your queries run fast during normal periods but low
enough to avoid running out of memory during busy periods. That kind
of thing sucks, and we should avoid creating more such cases.

One problem here is that the best value might depend not only on the
relation but on the individual query. A GUC could be changed
per-query, but different tables in the query might need different
values. Changing a reloption requires locking, and you wouldn't want
to have to keep changing it for each different query. Now if we figure
out that something is hardware-dependent -- like we come up with a
good formula that adjusts the value automatically most of the time,
but say it needs to more more on SSDs than on spinning disks or the
other way around, well then that's a good candidate for some kind of
setting, maybe a tablespace option. But if it seems to depend on the
query, we need a better idea, not a user-configurable setting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Bruce Momjian

On Tue, Jun 16, 2020 at 06:42:52PM +0530, Ashutosh Bapat wrote:
> > Is there some mapping between GXID and XIDs allocated for each node or
> > will each node use the GXID as XID to modify the data?   Are we fine
> > with parking the work for global snapshots and atomic visibility to a
> > separate patch and just proceed with the design proposed by this
> > patch?
> 
> Distributed transaction involves, atomic commit,  atomic visibility
> and global consistency. 2PC is the only practical solution for atomic
> commit. There are some improvements over 2PC but those are add ons to
> the basic 2PC, which is what this patch provides. Atomic visibility
> and global consistency however have alternative solutions but all of
> those solutions require 2PC to be supported. Each of those are large
> pieces of work and trying to get everything in may not work. Once we
> have basic 2PC in place, there will be a ground to experiment with
> solutions for global consistency and atomic visibility. If we manage
> to do it right, we could make it pluggable as well. So, I think we
> should concentrate on supporting basic 2PC work now.

Very good summary, thank you.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: hashagg slowdown due to spill changes

2020-06-16 Thread Tomas Vondra

On Mon, Jun 15, 2020 at 07:38:45PM -0700, Jeff Davis wrote:

On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:

On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
 wrote:
> But just reverting 4cad2534d will make this much worse, I think, as
> illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about
it.

I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:

* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.

* I used a flag in the context rather than two separate callbacks to
the expression walker.

This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.

Looks reasonable. I can't redo my tests at the moment, the machine is
busy with something else. I'll give it a try over the weekend.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: language cleanups in code and docs

2020-06-16 Thread Bruce Momjian

On Mon, Jun 15, 2020 at 11:22:35AM -0700, Andres Freund wrote:
> 0006: docs: s/master/root/
>   Here using root seems a lot better than master anyway (master seems
>   confusing in regard to inheritance scenarios). But perhaps parent
>   would be better? Went with root since it's about the topmost table.

Because we allow multiple levels of inheritance, I have always wanted a
clear term for the top-most parent.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: slow get_actual_variable_range with long running transactions

2020-06-16 Thread Marc Cousin

On 16/06/2020 18:28, Marc Cousin wrote:
> Oh, sorry about that, I forgot to detail this. I tested on both 10.13 (which 
> is the production environment on which we faced this), and on 12.3, with the 
> same problem.
> 
> On 16/06/2020 17:51, Tom Lane wrote:
>> Marc Cousin  writes:
>>> Of course, what happens here is that the histogram says that max(a) is 
>>> 100, and get_actual_variable_range verifies the real upper bound. And 
>>> has to read quite a few dead index records.
>>
>> We've revised that logic several times to reduce the scope of the
>> problem.  Since you didn't say which PG version you're using exactly,
>> it's hard to offer any concrete suggestions.  But I fear there's
>> not a lot you can do about it, other than upgrading if you're on a
>> pre-v11 release.
>>
>>  regards, tom lane
>>
> 
As you told me this I did some more tests on PG 12.

The first attempt is the same (maybe some hints being set or something like 
that), the second is much faster. So nothing to see here, sorry for the noise.

And sorry for the previous top posting :)

Regards



signature.asc
Description: OpenPGP digital signature

Re: slow get_actual_variable_range with long running transactions

2020-06-16 Thread Marc Cousin

Oh, sorry about that, I forgot to detail this. I tested on both 10.13 (which is 
the production environment on which we faced this), and on 12.3, with the same 
problem.

On 16/06/2020 17:51, Tom Lane wrote:
> Marc Cousin  writes:
>> Of course, what happens here is that the histogram says that max(a) is 
>> 100, and get_actual_variable_range verifies the real upper bound. And 
>> has to read quite a few dead index records.
> 
> We've revised that logic several times to reduce the scope of the
> problem.  Since you didn't say which PG version you're using exactly,
> it's hard to offer any concrete suggestions.  But I fear there's
> not a lot you can do about it, other than upgrading if you're on a
> pre-v11 release.
> 
>   regards, tom lane
> 



signature.asc
Description: OpenPGP digital signature

Re: slow get_actual_variable_range with long running transactions

2020-06-16 Thread Tom Lane

Marc Cousin  writes:
> Of course, what happens here is that the histogram says that max(a) is 
> 100, and get_actual_variable_range verifies the real upper bound. And has 
> to read quite a few dead index records.

We've revised that logic several times to reduce the scope of the
problem.  Since you didn't say which PG version you're using exactly,
it's hard to offer any concrete suggestions.  But I fear there's
not a lot you can do about it, other than upgrading if you're on a
pre-v11 release.

regards, tom lane

Re: Review for GetWALAvailability()

2020-06-16 Thread Fujii Masao




On 2020/06/16 14:00, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao  
wrote in

In short, it is known behavior but it was judged as useless to prevent
that.
That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.
While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.


BTW, I read the code of InvalidateObsoleteReplicationSlots() and
probably
found some issues in it.

1. Each cycle of the "for" loop in
InvalidateObsoleteReplicationSlots()
 emits the log message  "terminating walsender ...". This means that
 if it takes more than 10ms for walsender to exit after it's signaled,
 the second and subsequent cycles would happen and output the same
 log message several times. IMO that log message should be output
 only once.


Sounds reasonable.


2. InvalidateObsoleteReplicationSlots() uses the loop to scan
replication 
 slots array and uses the "for" loop in each scan. Also it calls
 ReplicationSlotAcquire() for each "for" loop cycle, and
 ReplicationSlotAcquire() uses another loop to scan replication slots
 array. I don't think this is good design.

 ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
 InvalidateObsoleteReplicationSlots() already know the index of the
 slot
 that we want to find. The attached patch does that. Thought?


The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot.  The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe.


Yes, that change might not be safe. So I'm thinking another approach to
fix the issues.


Maybe some assertion is needed?


3. There is a corner case where the termination of walsender cleans up
 the temporary replication slot while
 InvalidateObsoleteReplicationSlots()
 is sleeping on ConditionVariableTimedSleep(). In this case,
 ReplicationSlotAcquire() is called in the subsequent cycle of the
 "for"
 loop, cannot find the slot and then emits ERROR message. This leads
 to the failure of checkpoint by the checkpointer.


Agreed.


 To avoid this case, if SAB_Inquire is specified,
 ReplicationSlotAcquire()
 should return the special value instead of emitting ERROR even when
 it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
 should
 handle that special returned value.


I thought the same thing hearing that.


While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication 
slot \"%s\" is too far behind",
wspid, 
NameStr(slotname;
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.



int wspid = 
ReplicationSlotAcquire(NameStr(slotname),

   SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?

If my understanding is right, I'd like to propose the attached patch.
It introduces DeactivateReplicationSlot() and replace the "for" loop
in InvalidateObsoleteReplicationSlots() with it. ReplicationSlotAcquire()
and ReplicationSlotRelease() are no longer called there.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..b89b6da768 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,8 @@ ReplicationSlot

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Tom Lane

Mark Dilger  writes:
> I am a bit surprised to see that you are right about this, because non-latin 
> languages often have transliteration/romanization schemes for writing the 
> language in the Latin alphabet, developed before computers had wide spread 
> adoption of non-ASCII character sets, and still in use today for text 
> messaging.  I expected to find stemming rules for transliterated words, but 
> can't find any indication of that, neither in the postgres sources, nor in 
> the snowball sources I pulled from their repo.  Is there some architectural 
> separation of stemming from transliteration such that we'd never need to 
> worry about it?  If snowball ever published stemmers for transliterated text, 
> we might have to revisit this issue, but for now your proposed change sounds 
> fine to me.

Agreed, if the Snowball stemmers worked on romanized texts then the
situation would be different.  But they don't, AFAICS.  Don't know
if that is architectural, or a policy decision, or just lack of
round tuits.

The thing that I actually find a bit shaky in this area is our
architectural decision to route words to different dictionaries
depending on whether they are all-ASCII or not.  AIUI that was
done purely on the basis of the Russian/English case; it would
fail badly if say you wanted to separate Russian from French.
However, I have no great desire to revisit that design right now.

regards, tom lane

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Mark Dilger

> On Jun 16, 2020, at 7:37 AM, Tom Lane  wrote:
> 
> I wrote:
>> Peter Eisentraut  writes:
>>> Moreover, AFAIK, the following other languages do not use Latin-based 
>>> alphabets:
> 
>>> arabic  arabic  \
>>> greek   greek   \
>>> nepali  nepali  \
>>> tamil   tamil   \
> 
>> Hmm.  I think all of those entries are ones that got added by me while
>> absorbing post-2007 Snowball updates, and I confess that I did not think
>> about this point.  Maybe these should be changed.
> 
> After further reflection, I think these are indeed mistakes and we should
> change them all.  The argument for the Russian/English case, AIUI, is
> "if we come across an all-ASCII word, it is most certainly not Russian,
> and the most likely Latin-based language is English".  Given the world
> as it is, I think the same argument works for all non-Latin-alphabet
> languages.  Obviously specific applications might have a different idea
> of the best fallback language, but that's why we let users make their
> own text search configurations.  For general-purpose use, falling back
> to English seems reasonable.  And we can be dead certain that applying
> a Greek stemmer to an ASCII word will do nothing useful, so the
> configuration choice shown above is unhelpful.

I am a bit surprised to see that you are right about this, because non-latin 
languages often have transliteration/romanization schemes for writing the 
language in the Latin alphabet, developed before computers had wide spread 
adoption of non-ASCII character sets, and still in use today for text 
messaging.  I expected to find stemming rules for transliterated words, but 
can't find any indication of that, neither in the postgres sources, nor in the 
snowball sources I pulled from their repo.  Is there some architectural 
separation of stemming from transliteration such that we'd never need to worry 
about it?  If snowball ever published stemmers for transliterated text, we 
might have to revisit this issue, but for now your proposed change sounds fine 
to me.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan vs kernel read ahead

2020-06-16 Thread Robert Haas

On Mon, Jun 15, 2020 at 5:09 PM David Rowley  wrote:
> To summarise what's all been proposed so far:
>
> 1. Use a constant, (e.g. 64) as the parallel step size
> 2. Ramp up the step size over time
> 3. Ramp down the step size towards the end of the scan.
> 4. Auto-determine a good stepsize based on the size of the relation.
> 5. Add GUC to allow users to control or override the step size.
> 6. Add relption to allow users to control or override the step size.
>
> Here are my thoughts on each of those:
>
> #1 is a bad idea as there are legitimate use-cases for using parallel
> query on small tables. e.g calling some expensive parallel safe
> function. Small tables are more likely to be cached.

I agree.

> #2 I don't quite understand why this is useful

I was thinking that if the query had a small LIMIT, you'd want to
avoid handing out excessively large chunks, but actually that seems
like it might just be fuzzy thinking on my part. We're not committing
to scanning the entirety of the chunk just because we've assigned it
to a worker.

> #3 I understand this is to try to make it so workers all complete
> around about the same time.
> #4 We really should be doing it this way.
> #5 Having a global knob to control something that is very specific to
> the size of a relation does not make much sense to me.
> #6. I imagine someone will have some weird use-case that works better
> when parallel workers get 1 page at a time. I'm not convinced that
> they're not doing something else wrong.

Agree with all of that.

> So my vote is for 4 with possibly 3, if we can come up with something
> smart enough * that works well in parallel.  I think there's less of a
> need for this if we divided the relation into more chunks, e.g. 8192
> or 16384.

I agree with that too.

> * Perhaps when there are less than 2 full chunks remaining, workers
> can just take half of what is left. Or more specifically
> Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work
> out allocating an amount of pages proportional to the amount of beer
> each mathematician receives in the "An infinite number of
> mathematicians walk into a bar" joke, obviously with the exception
> that we stop dividing when we get to 1. However, I'm not quite sure
> how well that can be made to work with multiple bartenders working in
> parallel.

That doesn't sound nearly aggressive enough to me. I mean, let's
suppose that we're concerned about the scenario where one chunk takes
50x as long as all the other chunks. Well, if we have 1024 chunks
total, and we hit the problem chunk near the beginning, there will be
no problem. In effect, there are 1073 units of work instead of 1024,
and we accidentally assigned one guy 50 units of work when we thought
we were assigning 1 unit of work. If there's enough work left that we
can assign each other worker 49 units more than what we would have
done had that chunk been the same cost as all the others, then there's
no problem. So for instance if there are 4 workers, we can still even
things out if we hit the problematic chunk more than ~150 chunks from
the end. If we're closer to the end than that, there's no way to avoid
the slow chunk delaying the overall completion time, and the problem
gets worse as the problem chunk gets closer to the end.

How can we improve? Well, if when we're less than 150 chunks from the
end, we reduce the chunk size by 2x, then instead of having 1 chunk
that is 50x as expensive, hopefully we'll have 2 smaller chunks that
are each 50x as expensive. They'll get assigned to 2 different
workers, and the remaining 2 workers now need enough extra work from
other chunks to even out the work distribution, which should still be
possible. It gets tough though if breaking the one expensive chunk in
half produces 1 regular-price half-chunk and one half-chunk that is
50x as expensive as all the others. Because we have <150 chunks left,
there's no way to keep everybody else busy until the sole expensive
half-chunk completes. In a sufficiently-extreme scenario, assigning
even a single full block to a worker is too much, and you really want
to handle the tuples out individually.

Anyway, if we don't do anything special until we get down to the last
2 chunks, it's only going to make a difference when one of those last
2 chunks happens to be the expensive one. If say the third-to-last
chunk is the expensive one, subdividing the last 2 chunks lets all the
workers who didn't get the expensive chunk fight over the scraps, but
that's not an improvement. If anything it's worse, because there's
more communication overhead and you don't gain anything vs. just
assigning each chunk to a worker straight up.

In a real example we don't know that we have a single expensive chunk
-- each chunk just has its own cost, and they could all be the same,
or some could be much more expensive. When we have a lot of work left,
we can be fairly cavalier in handing out larger chunks of it with the
full confidence that even if some

slow get_actual_variable_range with long running transactions

2020-06-16 Thread Marc Cousin

Hi,

We're having an issue with planner performance when doing large deletes at the 
same time as we have long running transactions, from what we gathered, because 
of the scan to find the actual minimum and maximum values of the table.

Instead of trying to explain what happens, here is a very simple example:

(Our load is very similar to this scenario)

The "test" table has one column with an index on it.

Session 1:
=# insert into test select generate_series(1,1000);

Session 2: do the long running transaction:
=# begin;
=# do_whatever_to_get_a_long_running_transaction

Session 1:
=# delete from test where a>100;
=# analyze test;
=# explain select * from test where a > 1100;
  QUERY PLAN  
--
 Index Only Scan using idxa on test  (cost=0.42..4.44 rows=1 width=4)
   Index Cond: (a > 1100)
(2 rows)

Time: 2606,068 ms (00:02,606)

Of course, what happens here is that the histogram says that max(a) is 100, 
and get_actual_variable_range verifies the real upper bound. And has to read 
quite a few dead index records.

Is there something to do to avoid the problem (except for the long running 
transaction, which unfortunately is out of our control) ?

Regards



signature.asc
Description: OpenPGP digital signature

Use TableAm API in pg_table_size

2020-06-16 Thread Georgios

Hi,

I noticed that there are several file layout assumptions in dbsize.c which 
might not hold true for non heap relations attached with the TableAm API. It 
seems logical that in order to retrieve the disk size of a relation, the 
existing size method to be used instead.

A small patch is included to demonstrate how such an implementation can look 
like. Also, the existing method for heap, table_block_relation_size, should be 
able to address the valid cases where a fork number does not exist.

If this is considered valid, then the same can be applied for indexes too. The 
more generic calculate_relation_size can be adapted to call into the TableAm 
for those kinds of relations that makes sense. If agreed, a more complete patch 
can be provided.

Cheers,
//Georgios

0001-Use-tableam-for-pg_table_size.patch
Description: Binary data

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Tom Lane

I wrote:
> Peter Eisentraut  writes:
>> Moreover, AFAIK, the following other languages do not use Latin-based 
>> alphabets:

>> arabic  arabic  \
>> greek   greek   \
>> nepali  nepali  \
>> tamil   tamil   \

> Hmm.  I think all of those entries are ones that got added by me while
> absorbing post-2007 Snowball updates, and I confess that I did not think
> about this point.  Maybe these should be changed.

After further reflection, I think these are indeed mistakes and we should
change them all.  The argument for the Russian/English case, AIUI, is
"if we come across an all-ASCII word, it is most certainly not Russian,
and the most likely Latin-based language is English".  Given the world
as it is, I think the same argument works for all non-Latin-alphabet
languages.  Obviously specific applications might have a different idea
of the best fallback language, but that's why we let users make their
own text search configurations.  For general-purpose use, falling back
to English seems reasonable.  And we can be dead certain that applying
a Greek stemmer to an ASCII word will do nothing useful, so the
configuration choice shown above is unhelpful.

regards, tom lane

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Tatsuo Ishii

>> > I think the problem mentioned above can occur with this as well or if
>> > I am missing something then can you explain in further detail how it
>> > won't create problem in the scenario I have used above?
>>
>> So the problem you mentioned above is like this? (S1/S2 denotes
>> transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
>> already committed on N1, S2 sees the row on N1.  However S2 does not
>> see the row on N2 since S1 has not committed on N2 yet.
>>
> 
> Yeah, something on these lines but S2 can execute the query on N1
> directly which should fetch the data from both N1 and N2.

The algorythm assumes that any client should access database through a
middle ware. Such direct access is prohibited.

> Even if
> there is a solution using REPEATABLE READ isolation level we might not
> prefer to use that as the only level for distributed transactions, it
> might be too costly but let us first see how does it solve the
> problem?

The paper extends Snapshot Isolation (SI, which is same as our
REPEATABLE READ isolation level) to "Global Snapshot Isolation", GSI).
I think GSI will solve the problem (atomic visibility) we are
discussing.

Unlike READ COMMITTED, REPEATABLE READ acquires snapshot at the time
when the first command is executed in a transaction (READ COMMITTED
acquires a snapshot at each command in a transaction). Pangea controls
the timing of the snapshot acquisition on pair of transactions
(S1/N1,N2 or S2/N1,N2) so that each pair acquires the same
snapshot. To achieve this, while some transactions are trying to
acquire snapshot, any commit operation should be postponed. Likewise
any snapshot acquisition should wait until any in progress commit
operations are finished (see Algorithm I to III in the paper for more
details). With this rule, the previous example now looks like this:
you can see SELECT on S2/N1 and S2/N2 give the same result.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: BEGIN;
BEGIN
S1/N2: BEGIN;
BEGIN
S2/N1: BEGIN;
BEGIN
S1/N1: SET transaction_isolation TO 'repeatable read';
SET
S1/N2: SET transaction_isolation TO 'repeatable read';
SET
S2/N1: SET transaction_isolation TO 'repeatable read';
SET
S1/N1: INSERT INTO t1 VALUES (1);
INSERT 0 1
S1/N2: INSERT INTO t1 VALUES (1);
INSERT 0 1
S2/N1: SELECT * FROM t1;
 i 
---
(0 rows)

S2/N2: SELECT * FROM t1;
 i 
---
(0 rows)

S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Oleg Bartunov

On Tue, Jun 16, 2020 at 4:53 PM Tom Lane  wrote:

> Peter Eisentraut  writes:
> > There are two cases where these two columns are not the same:
>
> >  hindi   english \
> >  russian english \
>
> > The second one is old; the first one I added using the second one as
> > example.  But I wonder what the rationale for this is.  Maybe for hindi
> > one could make some kind of cultural argument, but for russian this
> > seems entirely arbitrary.
>
> Perhaps it is, but we have actual Russians who think it's a good idea.
> I recall questioning that point some years ago, and Oleg replied that
> they'd done that intentionally because (a) technical Russian uses a lot
> of English words, and (b) it's easy to tell which is which thanks to
> the disjoint letter sets.
>
>
Yes, you are right.


> Whether the same is true for Hindi, I have no idea.
>
> > Moreover, AFAIK, the following other languages do not use Latin-based
> > alphabets:
>
> >  arabic  arabic  \
> >  greek   greek   \
> >  nepali  nepali  \
> >  tamil   tamil   \
>
> Hmm.  I think all of those entries are ones that got added by me while
> absorbing post-2007 Snowball updates, and I confess that I did not think
> about this point.  Maybe these should be changed.
>
> regards, tom lane
>
>
>

-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-16 Thread Dilip Kumar

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila  wrote:
>
> On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila  wrote:
> >
> > I think one of the usages we still need is in ReorderBufferForget
> > because it can be called when we skip processing the txn.  See the
> > comments in DecodeCommit where we call this function.  If I am
> > correct, we need to probably collect all invalidations in
> > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> >
>
> One more point related to this is that after this patch series, we
> need to consider executing all invalidation during transaction abort.
> Because it is possible that due to memory overflow, we have processed
> some of the messages which also contain a few XACT_INVALIDATION
> messages, so to avoid cache pollution, we need to execute all of them
> in abort.  We also do the similar thing in Rollback/Rollback To
> Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

I have analyzed this further and I think there is some problem with
that. If Instead of keeping the invalidation as an individual change,
if we try to combine them in ReorderBufferTxn's invalidation then what
happens if the (sub)transaction is aborted.  Basically, in this case,
we will end up executing all those invalidations for those we never
polluted the cache if we never try to stream it.  So this will affect
the normal case where we haven't streamed the transaction because
every time we have executed the invalidation logged by transaction
those are aborted.  One way is we develop the list at the
sub-transaction level and just before sending the transaction (on
commit) combine all the (sub) transaction's invalidation list.  But,
I think since we already have the invalidation in the commit message
then there is no point in adding this complexity.
But, my main worry is about the streaming transaction, the problems are
- Immediately on the arrival of individual invalidation, we can not
directly add to the top-level transaction's invalidation list because
later if the transaction aborted before we stream (or we directly
stream on commit) then we will get an unnecessarily long list of
invalidation which is done by aborted subtransaction.
- If we keep collecting in the individual subtransaction's
ReorderBufferTxn->invalidations,  then the problem is when to merge
it?  I think it is a good idea to merge them all as soon as we try to
stream it/or on commit?  So since this solution of combining the (sub)
transaction's invalidation is required for the streaming case we can
use it as common solution whether it streams due to the memory
overflow or due to the commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: snowball ASCII stemmer configuration

2020-06-16 Thread Tom Lane

Peter Eisentraut  writes:
> There are two cases where these two columns are not the same:

>  hindi   english \
>  russian english \

> The second one is old; the first one I added using the second one as 
> example.  But I wonder what the rationale for this is.  Maybe for hindi 
> one could make some kind of cultural argument, but for russian this 
> seems entirely arbitrary.

Perhaps it is, but we have actual Russians who think it's a good idea.
I recall questioning that point some years ago, and Oleg replied that
they'd done that intentionally because (a) technical Russian uses a lot
of English words, and (b) it's easy to tell which is which thanks to
the disjoint letter sets.

Whether the same is true for Hindi, I have no idea.

> Moreover, AFAIK, the following other languages do not use Latin-based 
> alphabets:

>  arabic  arabic  \
>  greek   greek   \
>  nepali  nepali  \
>  tamil   tamil   \

Hmm.  I think all of those entries are ones that got added by me while
absorbing post-2007 Snowball updates, and I confess that I did not think
about this point.  Maybe these should be changed.

regards, tom lane

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 19:24, Amit Kapila  wrote:
>

Thank you for your reviews on 0003 patch. I've incorporated your
comments. I'll submit the latest version patch later as the design or
scope might change as a result of the discussion.

>
> Few more comments on v22-0003-Documentation-update
> --
> 1.
> +  When disabled there can be risk of database
> +  consistency among all servers that involved in the distributed
> +  transaction when some foreign server crashes during committing the
> +  distributed transaction.
>
> Will it read better if rephrase above to something like: "When
> disabled there can be a risk of database
> consistency if one or more foreign servers crashes while committing
> the distributed transaction."?

Fixed.

>
> 2.
> +   id="guc-foreign-transaction-resolution-rety-interval"
> xreflabel="foreign_transaction_resolution_retry_interval">
> +   foreign_transaction_resolution_retry_interval
> (integer)
> +
> + foreign_transaction_resolution_interval
> configuration parameter
> +
> +   
> +   
> +
> + Specify how long the foreign transaction resolver should
> wait when the last resolution
> + fails before retrying to resolve foreign transaction. This
> parameter can only be set in the
> + postgresql.conf file or on the server
> command line.
> +
> +
> + The default value is 10 seconds.
> +
> +   
> +  
>
> Typo.   id="guc-foreign-transaction-resolution-rety-interval", spelling of
> retry is wrong.  Do we really need such a guc parameter?  I think we
> can come up with some simple algorithm to retry after a few seconds
> and then increase that interval of retry if we fail again or something
> like that.  I don't know how users can come up with some non-default
> value for this variable.

For example, in a low-reliable network environment, setting lower
value would help to minimize the backend wait time in case of
connection lost. But I also agree with your point. In terms of
implementation, having backends wait for the fixed time is more simple
but we can do such incremental interval by remembering the retry count
for each foreign transaction.

An open question regarding retrying foreign transaction resolution is
how we process the case where an involved foreign server is down for a
very long. If an online transaction is waiting to be resolved, there
is no way to exit from the wait loop other than either the user sends
a cancel request or the crashed server is restored. But if the foreign
server has to be down for a long time, I think it’s not practical to
send a cancel request because the client would need something like a
timeout mechanism. So I think it might be better to provide a way to
cancel the waiting without the user sending a cancel. For example,
having a timeout or having the limit of the retry count. If an
in-doubt transaction is waiting to be resolved, we keep trying to
resolve the foreign transaction at an interval. But I wonder if the
user might want to disable the automatic in-doubt foreign transaction
in some cases, for example, where the user knows the crashed server
will not be restored for a long time. I’m thinking that we can provide
a way to disable automatic foreign transaction resolution or disable
it for the particular foreign transaction.

>
> 3
> +   xreflabel="foreign_transaction_resolver_timeout">
> +   foreign_transaction_resolver_timeout
> (integer)
> +
> + foreign_transaction_resolver_timeout
> configuration parameter
> +
> +   
> +   
> +
> + Terminate foreign transaction resolver processes that don't
> have any foreign
> + transactions to resolve longer than the specified number of
> milliseconds.
> + A value of zero disables the timeout mechanism, meaning it
> connects to one
> + database until stopping manually.
>
> Can we mention the function name using which one can stop the resolver 
> process?

Fixed.

>
> 4.
> +   Using the PostgreSQL's atomic commit ensures 
> that
> +   all changes on foreign servers end in either commit or rollback using the
> +   transaction callback routines
>
> Can we slightly rephase this "Using the PostgreSQL's atomic commit
> ensures that all the changes on foreign servers are either committed
> or rolled back using the transaction callback routines"?

Fixed.

>
> 5.
> +   Prepare all transactions on foreign servers.
> +   PostgreSQL distributed transaction manager
> +   prepares all transaction on the foreign servers if two-phase commit is
> +   required. Two-phase commit is required when the transaction modifies
> +   data on two or more servers including the local server itself and
> +is
> +   required.
>
> /PostgreSQL/PostgreSQL's.

Fixed.

>
>  If all preparations on foreign

Re: Infinities in type numeric

2020-06-16 Thread Tom Lane

Vik Fearing  writes:
> On 6/12/20 7:00 PM, Tom Lane wrote:
>> If we did that, you'd never see Inf in a
>> standard-conforming column, since SQL doesn't allow unconstrained
>> numeric columns IIRC.

> It does.  The precision and scale are both optional.
> If the precision is missing, it's implementation defined; if the scale
> is missing, it's 0.

Ah, right, the way in which we deviate from the spec is that an
unconstrained numeric column doesn't coerce every entry to scale 0.

Still, that *is* a spec deviation, so adding "... and it allows Inf"
doesn't seem like it's making things worse for spec-compliant apps.

regards, tom lane

Re: language cleanups in code and docs

2020-06-16 Thread Tom Lane

Magnus Hagander  writes:
> I'd be more worried about for example postmaster.pid, which would break a
> *lot* of scripts and integrations. postmaster is also exposed in the system
> catalogs.

Oooh, that's an excellent point.  A lot of random stuff knows that file
name.

To be clear, I'm not against removing incidental uses of the word
"master".  But the specific case of "postmaster" seems a little too
far ingrained to be worth changing.

regards, tom lane

Re: language cleanups in code and docs

2020-06-16 Thread Andrew Dunstan



On 6/16/20 9:10 AM, Joe Conway wrote:
> On 6/16/20 3:26 AM, Magnus Hagander wrote:
>> On Tue, Jun 16, 2020 at 2:23 AM Andres Freund wrote:
>> postmaster is just a symlink, which we very well could just leave in
>> place... I was really just thinking of the code level stuff. And I think
>> there's some clarity reasons to rename it as well (see comments by
>> others in the thread).
>>
>> Is the symlink even used? If not we could just get rid of it. 
>
> I am pretty sure that last time I checked Devrim was still using it in his
> systemd unit file bundled with the PGDG rpms, although that was probably a
> couple of years ago.
>


Just checked a recent install and it's there.


Honestly, I think I'm with Tom, and we can just let this one alone.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-16 Thread Ranier Vilela

Em ter., 16 de jun. de 2020 às 01:10, Justin Pryzby 
escreveu:

> On Mon, Jun 15, 2020 at 11:49:33PM -0300, Ranier Vilela wrote:
> > I can confirm that the problem is in pgrename (dirmod.c),
> > something is not OK, with MoveFileEx, even with the
> > (MOVEFILE_REPLACE_EXISTING) flag.
> >
> > Replacing MoveFileEx, with
> > unlink (to);
> > rename (from, to);
> >
> > #if defined (WIN32) &&! defined (__ CYGWIN__)
> > unlink(to);
> > while (rename (from, to)! = 0)
> > #else
> > while (rename (from, to) <0)
> > #endif
> >
> > The log messages have disappeared.
> >
> > I suspect that if the target (to) file exists, MoveFileEx, it is failing
> to
> > rename, even with the flag enabled.
> >
> > Windows have the native rename function (
> >
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rename-wrename?view=vs-2019
> > )
> > However, it fails if the target (to) file exists.
> >
> > Question, is it acceptable delete the target file, if it exists, to
> > guarantee the rename?
>
> I don't think so - the idea behind writing $f.tmp and renaming it to $f is
> that
> it's *atomic*.
>
"atomic" here means, rename operation only, see.
1. create $f.tmp
2. write
3. close
4. rename to $f


>
> I found an earlier thread addressing the same problem - we probably both
> should've found this earlier.
> https://commitfest.postgresql.org/27/2230/
>
> https://www.postgresql.org/message-id/flat/CAPpHfds9trA6ipezK3BsuuOSQwEmESiqj8pkOxACFJpoLpcoNw%40mail.gmail.com#9b04576b717175e9dbf03cc991977d3f

Very interesting.
That link, gave me a light.
https://www.virtualbox.org/ticket/2350
Users report this same situation in production environments with a high
load.
Here I have only one notebook, with 8gb ram, windows 10 64 bits (2004),
msvc 2019 (64 bits).
Without any load, just starting and stopping the server, the messages
appear in the log.


>
>
> That thread goes back to 2017, so I don't think this is a new problem in
> v13.
> I'm not sure why you wouldn't also see the same behavior in v12.
>
Windows Server 2003 (32 bits) with Postgres 9.6 (32 bits), none log (could
rename).`
Windows Server 2016 (64 bits) with Postgres 12 (64 bits), have log (cound
rename).

Yes V12 can confirm, too.

File postgresql-Mon.log:
2019-10-14 11:56:26 GMT [4844]: [1-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
File postgresql-Sat.log:
2019-09-28 10:15:52 GMT [4804]: [2-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
2019-10-05 12:01:23 GMT [4792]: [1-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
2019-10-05 23:55:31 GMT [4792]: [2-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
File postgresql-Sun.log:
2019-10-06 07:43:36 GMT [4792]: [3-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
File postgresql-Tue.log:
2019-10-01 07:31:46 GMT [4804]: [3-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
2019-10-22 14:44:25 GMT [3868]: [1-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
File postgresql-Wed.log:
2019-10-02 22:20:52 GMT [4212]: [1-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied
2019-10-09 11:05:02 GMT [3712]: [1-1] user=,db=,app=,client= LOG:  could
not rename temporary statistics file "pg_stat_tmp/global.tmp" to
"pg_stat_tmp/global.stat": Permission denied


>
> There's a couple patches in that thread, and the latest patch was rejected.
>
> Maybe you'd want to test them out and provide feedback.
>
I will try.


> BTW, the first patch does:
>
> !   if (filePresent)
> !   {
> !   if (ReplaceFile(to, from, NULL,
> REPLACEFILE_IGNORE_MERGE_ERRORS, 0, 0))
> !   break;
> !   }
> !   else
> !   {
> !   if (MoveFileEx(from, to,
> MOVEFILE_REPLACE_EXISTING))
> !   break;
> !   }
>
> Since it's racy to first check if the file exists, I would've thought we
> should
> instead do:
>
> ret = ReplaceFile()
> if ret == OK:
> break
> else if ret == FileDoesntExist:
> if MoveFileEx():
> break
>
> Or, should we just try to create a file first, to allow ReplaceFile() to
> always
> work ?
>
This seemingly, it works too.
+ while

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Ashutosh Bapat

On Tue, Jun 16, 2020 at 3:40 PM Amit Kapila  wrote:
>
> On Mon, Jun 15, 2020 at 7:06 PM Masahiko Sawada
>  wrote:
> >
> > On Mon, 15 Jun 2020 at 15:20, Amit Kapila  wrote:
> > >
> > >
> > > > > Even the local
> > > > > server is not modified, since a resolver process commits prepared
> > > > > foreign transactions one by one another user could see an inconsistent
> > > > > result. Providing globally consistent snapshots to transactions
> > > > > involving foreign servers is one of the solutions.
> > >
> > > How would it be able to do that?  Say, when it decides to take a
> > > snapshot the transaction on the foreign server appears to be committed
> > > but the transaction on the local server won't appear to be committed,
> > > so the consistent data visibility problem as mentioned above could
> > > still arise.
> >
> > There are many solutions. For instance, in Postgres-XC/X2 (and maybe
> > XL), there is a GTM node that is responsible for providing global
> > transaction IDs (GXID) and globally consistent snapshots. All
> > transactions need to access GTM when checking the distributed
> > transaction status as well as starting transactions and ending
> > transactions. IIUC if a global transaction accesses a tuple whose GXID
> > is included in its global snapshot it waits for that transaction to be
> > committed or rolled back.
> >
>
> Is there some mapping between GXID and XIDs allocated for each node or
> will each node use the GXID as XID to modify the data?   Are we fine
> with parking the work for global snapshots and atomic visibility to a
> separate patch and just proceed with the design proposed by this
> patch?

Distributed transaction involves, atomic commit,  atomic visibility
and global consistency. 2PC is the only practical solution for atomic
commit. There are some improvements over 2PC but those are add ons to
the basic 2PC, which is what this patch provides. Atomic visibility
and global consistency however have alternative solutions but all of
those solutions require 2PC to be supported. Each of those are large
pieces of work and trying to get everything in may not work. Once we
have basic 2PC in place, there will be a ground to experiment with
solutions for global consistency and atomic visibility. If we manage
to do it right, we could make it pluggable as well. So, I think we
should concentrate on supporting basic 2PC work now.

> I am asking because I thought there might be some impact on
> the design of this patch based on what we decide for that work.
>

Since 2PC is at the heart of any distributed transaction system, the
impact will be low. Figuring all of that, without having basic 2PC,
will be very hard.

-- 
Best Wishes,
Ashutosh Bapat

Re: language cleanups in code and docs

2020-06-16 Thread Joe Conway

On 6/16/20 3:26 AM, Magnus Hagander wrote:
> On Tue, Jun 16, 2020 at 2:23 AM Andres Freund wrote:
> postmaster is just a symlink, which we very well could just leave in
> place... I was really just thinking of the code level stuff. And I think
> there's some clarity reasons to rename it as well (see comments by
> others in the thread).
> 
> Is the symlink even used? If not we could just get rid of it. 


I am pretty sure that last time I checked Devrim was still using it in his
systemd unit file bundled with the PGDG rpms, although that was probably a
couple of years ago.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.

2020-06-16 Thread Ashutosh Bapat

On Tue, 16 Jun 2020 at 11:45, Amit Langote  wrote:

> On Fri, Jun 12, 2020 at 9:22 PM Ashutosh Bapat
>  wrote:
> > On Wed, Jun 3, 2020 at 12:44 PM Amit Langote 
> wrote:
> > > Are you saying that the planner should take into account the state of
> > > the cursor specified in WHERE CURRENT OF to determine which of the
> > > tables to scan for the UPDATE?  Note that neither partition pruning
> > > nor constraint exclusion know that CurrentOfExpr can possibly allow to
> > > exclude children of the UPDATE target.
> >
> > Yes. But it may not be possible to know the value of current of at the
> > time of planning since that need not be a plan time constant. This
> > pruning has to happen at run time.
>
> Good point about not doing anything at planning time.
>
> I wonder if it wouldn't be okay to simply make execCurrentOf() return
> false if it can't find either a row mark or a Scan node in the cursor
> matching the table being updated/deleted from, instead of giving an
> error message?  I mean what do we gain by erroring out here instead of
> simply not doing anything?  Now, it would be nicer if we could do so
> only if the table being updated/deleted from is a child table, but it
> seems pretty inconvenient to tell that from the bottom of a plan tree
> from where execCurrentOf() is called.
>

A safe guard from a bug where current of is set to wrong table or
something. Quite rare bug but if we can fix the problem itself removing a
safe guard doesn't seem wise.


> The other option would be to have some bespoke "pruning" logic in,
> say, ExecInitModifyTable() that fetches the current active table from
> the cursor and processes only the matching child result relation.


looks better if that works and I don't see a reason why it won't work.


> Or
> maybe wait until we have run-time pruning for ModifyTable, because the
> result relation code restructuring required for that will also be
> something we'd need for this.
>
>
I don't see much difference in the final plan with either options.

-- 
Best Wishes,
Ashutosh

Re: Include access method in listTables output

2020-06-16 Thread Georgios

‐‐‐ Original Message ‐‐‐
On Monday, June 15, 2020 3:20 AM, vignesh C  wrote:

> On Tue, Jun 9, 2020 at 6:45 PM Georgios gkokola...@protonmail.com wrote:
>
> >
>
> > > Please add it to the commitfest at https://commitfest.postgresql.org/28/
> >
> > Thank you very much for your time. Added to the commitfest as suggested.
>
> Patch applies cleanly, make check & make check-world passes.

Thank you for your time and comments! Please find v2 of the patch
attached.

>
> Few comments:
>
> -   if (pset.sversion >= 12)
>
> -   appendPQExpBufferStr(,
>
> -   "\n LEFT JOIN pg_catalog.pg_am am ON am.oid = c.relam");
>
> Should we include pset.hide_tableam check along with the version check?

I opted against it, since it seems more intuitive to have a single
switch and placed on the display part. A similar pattern can be found
in describeOneTableDetails(). I do not hold a strong opinion and will
gladly ament if insisted upon.

>
> -   if (pset.sversion >= 12 && !pset.hide_tableam)
>
> -   {
>
> -   appendPQExpBuffer(,
>
> -   ",\n am.amname as \"%s\"",
>
> -   gettext_noop("Access Method"));
>
> -   }
>
> Braces can be removed & the second line can be combined with earlier.

Agreed on the braces.

Disagree on the second line as this style is in line with the rest of
code. Consistency, buf, format string and gettext_noop() are found in
their own lines within this file.

>
> We could include the test in psql file by configuring HIDE_TABLEAM.
>

Agreed.

> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>

0001-Include-AM-in-listTables-output-v2.patch
Description: Binary data

Re: TAP tests and symlinks on Windows

2020-06-16 Thread Andrew Dunstan



On 6/16/20 8:24 AM, Peter Eisentraut wrote:
> On 2020-06-09 09:33, Juan José Santamaría Flecha wrote:
>> The difference seems to be MSYS2, it also fails for me if I do not
>> include 'Win32::Symlink' with Perl 5.30.2.
>
> MSYS2, which is basically Cygwin, emulates symlinks with junction
> points, so this happens to work for our purpose.  We could therefore
> enable these tests in that environment, if we could come up with a
> reliable way to detect it.


>From src/bin/pg_dump/t/010_dump_connstr.pl:


if ($^O eq 'msys' && `uname -or` =~ /^[2-9].*Msys/)


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TAP tests and symlinks on Windows

2020-06-16 Thread Peter Eisentraut


On 2020-06-09 09:33, Juan José Santamaría Flecha wrote:
The difference seems to be MSYS2, it also fails for me if I do not 
include 'Win32::Symlink' with Perl 5.30.2.


MSYS2, which is basically Cygwin, emulates symlinks with junction 
points, so this happens to work for our purpose.  We could therefore 
enable these tests in that environment, if we could come up with a 
reliable way to detect it.


Also, if we are going to add Win32::Symlink to the mix, we should make 
sure things continue to work correctly under MSYS2.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: factorial of negative numbers

2020-06-16 Thread Dean Rasheed

On Tue, 16 Jun 2020 at 12:18, Peter Eisentraut
 wrote:
>
> On 2020-06-16 11:49, Dean Rasheed wrote:
> > With [1], we could return 'Infinity', which would be more correct from
> > a mathematical point of view, and might be preferable to erroring-out
> > in some contexts.
>
> But the limit of the gamma function is either negative or positive
> infinity, depending on from what side you come, so we can't just return
> one of those two.
>

Hmm yeah, although that's only really the case if you define it in
terms of continuous real input values.

I think you're probably right though. Raising an out-of-range error
seems like the best option.

Regards,
Dean

Re: TAP tests and symlinks on Windows

2020-06-16 Thread Andrew Dunstan



On 6/15/20 2:23 AM, Michael Paquier wrote:
> On Sat, Jun 13, 2020 at 03:00:54PM +0900, Michael Paquier wrote:
>> My take would be to actually enforce that as a requirement for 14~ if
>> that works reliably, and of course not backpatch that change as that's
>> clearly an improvement and not a bug fix.  It would be good to check
>> the status of each buildfarm member first though.  And I would need to
>> also check my own stuff to begin with..
> So, I have been looking at that.  And indeed as Peter said we are
> visibly missing one call to perl2host in 010_pg_basebackup.pl.
>
> Another thing I spotted is that Win32::Symlink does not allow to
> detect properly if a path is a symlink using -l, causing one of the
> tests of pg_basebackup to fail when checking if a tablespace path has
> been updted.  It would be good to get more people to test this patch
> with different environments than mine.  I am also adding Andrew
> Dunstan in CC as the owner of the buildfarm animals running currently
> TAP tests for confirmation about the presence of Win32::Symlink
> there as I am afraid it would cause failures: drongo, fairywen,
> jacana and bowerbird.



Not one of them has it.


I think we'll need a dynamic test for its presence rather than just
assuming it's there. (Use require in an eval for this).


However, since all of them would currently fail we wouldn't actually
have any test coverage. I could see about installing it on one or two
animals (jacana would be a problem, it's using a very old and limited
perl to run TAP tests.)


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Operator class parameters and sgml docs

2020-06-16 Thread Alexander Korotkov

On Thu, May 28, 2020 at 11:02 PM Alexander Korotkov
 wrote:
>
> On Thu, May 21, 2020 at 3:17 AM Alexander Korotkov
>  wrote:
> >
> > On Thu, May 21, 2020 at 12:37 AM Peter Geoghegan  wrote:
> > > Commit 911e7020770 added a variety of new support routines to index
> > > AMs. For example, it added a support function 5 to btree (see
> > > BTOPTIONS_PROC), but didn't document this alongside the other support
> > > functions in btree.sgml.
> > >
> > > It looks like the new support functions are fundamentally different to
> > > the existing ones in that they exist only as a way of supplying
> > > parameters to other support functions. The idea was to preserve
> > > compatibility with the old support function signatures. Even still, I
> > > think that the new support functions should get some mention alongside
> > > the older support functions.
> > >
> > > I also wonder whether or not xindex.sgml needs to be updated to
> > > account for opclass parameters.
> >
> > Thank you for pointing.  I'm going to take a look on this in next few days.
>
> I'm sorry for the delay.  I was very busy with various stuff.  I'm
> going to post docs patch next week.


Thank you for patience.  The documentation patch is attached.  I think
it requires review by native english speaker.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


opclass_options_doc.patch
Description: Binary data

Re: factorial of negative numbers

2020-06-16 Thread Peter Eisentraut


On 2020-06-16 11:49, Dean Rasheed wrote:

With [1], we could return 'Infinity', which would be more correct from
a mathematical point of view, and might be preferable to erroring-out
in some contexts.


But the limit of the gamma function is either negative or positive 
infinity, depending on from what side you come, so we can't just return 
one of those two.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel Seq Scan vs kernel read ahead

2020-06-16 Thread Amit Kapila

On Mon, Jun 15, 2020 at 8:59 PM Robert Haas  wrote:
>
> On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila  wrote:
> > The performance can vary based on qualification where some workers
> > discard more rows as compared to others, with the current system with
> > step-size as one, the probability of unequal work among workers is
> > quite low as compared to larger step-sizes.
>
> It seems like this would require incredibly bad luck, though.
>

I agree that won't be a common scenario but apart from that also I am
not sure if we can conclude that the proposed patch won't cause any
regressions.  See one of the tests [1] done by Soumyadeep where the
patch has caused regression in one of the cases, now we can either try
to improve the patch and see we didn't cause any regressions or assume
that those are some minority cases which we don't care.  Another point
is that this thread has started with a theory that this idea can give
benefits on certain filesystems and AFAICS we have tested it on one
other type of system, so not sure if that is sufficient.

Having said that, I just want to clarify that I am positive about this
work but just not very sure that it is a universal win based on the
testing done till now.

[1] - 
https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCNwGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: factorial of negative numbers

2020-06-16 Thread Juan José Santamaría Flecha

On Tue, Jun 16, 2020 at 11:50 AM Dean Rasheed 
wrote:

> On Tue, 16 Jun 2020 at 10:09, Juan José Santamaría Flecha
>  wrote:
> >
> > It is defined as NaN (or undefined), which is not in the realm of
> integer numbers. You might get a clear idea of the logic from [1], where
> they also make a case for the error being ERRCODE_DIVISION_BY_ZERO.
> >
> > [1] http://mathforum.org/library/drmath/view/60851.html
> >
>
> Hmm, I think ERRCODE_DIVISION_BY_ZERO should probably be reserved for
> actual division functions.
>
> With [1], we could return 'Infinity', which would be more correct from
> a mathematical point of view, and might be preferable to erroring-out
> in some contexts.
>
> [1]
> https://www.postgresql.org/message-id/606717.1591924582%40sss.pgh.pa.us

Returning division-by-zero would be confusing for the user.

I think that out-of-range would be a reasonable solution for "FUNCTION
factorial(integer) RETURNS integer", because it could only return an
integer when the input is a positive integer, but for "FUNCTION
factorial(integer) RETURNS numeric" the returned value should be 'NaN'
without error.

Regards,

Juan José Santamaría Flecha

Re: create database with template doesn't copy database ACL

2020-06-16 Thread Bruce Momjian

On Mon, Jun 15, 2020 at 10:10:32AM -0400, Bruce Momjian wrote:
> On Mon, Jun 15, 2020 at 12:14:55AM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > Well, I thought we copied everything except things tha can be specified
> > > as different in CREATE DATABASE, though I can see why we would not copy
> > > them.  Should we document this or issue a notice about not copying
> > > non-default database attributes?
> > 
> > We do not need a notice for behavior that (a) has stood for twenty years
> > or so, and (b) is considerably less broken than any alternative would be.
> > If you feel the docs need improvement, have at that.
> 
> Well, I realize it has been this way for a long time, and that no one
> else has complained, but there should be a way for people to know what
> is being copied from the template and what is not.  Do we have a clear
> description of what is copied and skipped?

We already mentioned that ALTER DATABASE settings are not copied, so the
attached patch adds a mention that GRANT-level permissions are not
copied either.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 504c1b0224..d116b321bc 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -236,8 +236,8 @@ CREATE DATABASE name
 

 Database-level configuration parameters (set via ) are not copied from the template
-database.
+linkend="sql-alterdatabase"/>) and database-level permissions (set via
+) are not copied from the template database.

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-16 Thread Amit Kapila

On Mon, Jun 15, 2020 at 7:06 PM Masahiko Sawada
 wrote:
>
> On Mon, 15 Jun 2020 at 15:20, Amit Kapila  wrote:
> >
> >
> > > > Even the local
> > > > server is not modified, since a resolver process commits prepared
> > > > foreign transactions one by one another user could see an inconsistent
> > > > result. Providing globally consistent snapshots to transactions
> > > > involving foreign servers is one of the solutions.
> >
> > How would it be able to do that?  Say, when it decides to take a
> > snapshot the transaction on the foreign server appears to be committed
> > but the transaction on the local server won't appear to be committed,
> > so the consistent data visibility problem as mentioned above could
> > still arise.
>
> There are many solutions. For instance, in Postgres-XC/X2 (and maybe
> XL), there is a GTM node that is responsible for providing global
> transaction IDs (GXID) and globally consistent snapshots. All
> transactions need to access GTM when checking the distributed
> transaction status as well as starting transactions and ending
> transactions. IIUC if a global transaction accesses a tuple whose GXID
> is included in its global snapshot it waits for that transaction to be
> committed or rolled back.
>

Is there some mapping between GXID and XIDs allocated for each node or
will each node use the GXID as XID to modify the data?   Are we fine
with parking the work for global snapshots and atomic visibility to a
separate patch and just proceed with the design proposed by this
patch?  I am asking because I thought there might be some impact on
the design of this patch based on what we decide for that work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Recording test runtimes with the buildfarm

2020-06-16 Thread David Rowley

On Fri, 12 Jun 2020 at 04:32, Tom Lane  wrote:
>
> Magnus Hagander  writes:
> > On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> > It seems pretty trivial to for example get all the steps out of check.log
> > and their timing with a regexp.
>
> Yeah, I don't see why we can't scrape this data from the existing
> buildfarm output, at least for the core regression tests.  New
> infrastructure could make it easier/cheaper, but I don't think we
> should invest in that until we've got a provably useful tool.

Looking at the data mentioned in the logs for install-check-C, it
seems some animals are quite stable with their timings but others are
quite unstable.

Taking the lowest half of the timings of each animal,test combination,
where the animal ran the test at least 50 times, the top 3 animals
with the most consistent timing are.

postgres=# select animal, avg(stddev) from (select animal,testname,
stddev(ms) from (select
animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select
animal,testname,array_agg(ms order by ms) ar from run_details group by
animal,testname) a)b group by 1,2  having count(*) > 50) c group by 1
order by avg(stddev) limit 3;
   animal   |avg
+---
 mule   | 4.750935819647279
 massasauga | 5.410419286413067
 eider  |  6.06834118301505
(3 rows)

And the least consistent 3 are:

postgres=# select animal, avg(stddev) from (select animal,testname,
stddev(ms) from (select
animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select
animal,testname,array_agg(ms order by ms) ar from run_details group by
animal,testname) a)b group by 1,2  having count(*) > 50) c group by 1
order by avg(stddev) desc limit 3;
  animal  |avg
--+---
 lorikeet | 830.9292062818336
 gharial  | 725.9874198764217
 dory | 683.5182140482121
(3 rows)

If I look at a random test from mule:

postgres=# select commit,time,result,ms from run_details where animal
= 'mule' and testname = 'join' order by time desc limit 10;
 commit  |time | result | ms
-+-++-
 e532b1d | 2020-06-15 19:30:03 | ok | 563
 7a3543c | 2020-06-15 15:30:03 | ok | 584
 3baa7e3 | 2020-06-15 11:30:03 | ok | 596
 47d4d0c | 2020-06-15 07:30:03 | ok | 533
 decbe2b | 2020-06-14 15:30:04 | ok | 535
 378badc | 2020-06-14 07:30:04 | ok | 563
 23cbeda | 2020-06-13 19:30:04 | ok | 557
 8f5b596 | 2020-06-13 07:30:04 | ok | 553
 6472572 | 2020-06-13 03:30:04 | ok | 580
 9a7fccd | 2020-06-12 23:30:04 | ok | 561
(10 rows)

and from lorikeet:

postgres=# select commit,time,result,ms from run_details where animal
= 'lorikeet' and testname = 'join' order by time desc limit 10;
 commit  |time | result |  ms
-+-++--
 47d4d0c | 2020-06-15 08:28:35 | ok | 8890
 decbe2b | 2020-06-14 20:28:33 | ok | 8878
 378badc | 2020-06-14 08:28:35 | ok | 8854
 cc07264 | 2020-06-14 05:22:59 | ok | 8883
 8f5b596 | 2020-06-13 10:36:14 | ok | 8942
 2f48ede | 2020-06-12 20:28:41 | ok | 8904
 ffd2582 | 2020-06-12 08:29:52 | ok | 2016
 7aa4fb5 | 2020-06-11 23:21:26 | ok | 1939
 ad9291f | 2020-06-11 09:56:48 | ok | 1924
 c2bd1fe | 2020-06-10 23:01:42 | ok | 1873
(10 rows)

I can supply the data I used for this, just send me an offlist email.
It's about 19MB using bzip2.

I didn't look at the make check data and I do see some animals use the
parallel_schedule for make installcheck, which my regex neglected to
account for, so that data is missing from the set.

David

Re: Parallel copy

2020-06-16 Thread Amit Kapila

On Mon, Jun 15, 2020 at 7:41 PM Ashutosh Sharma  wrote:
>
> Thanks Amit for the clarifications. Regarding partitioned table, one of the 
> question was - if we are loading data into a partitioned table using COPY 
> command, then the input file would contain tuples for different tables 
> (partitions) unlike the normal table case where all the tuples in the input 
> file would belong to the same table. So, in such a case, how are we going to 
> accumulate tuples into the DSM? I mean will the leader process check which 
> tuple needs to be routed to which partition and accordingly accumulate them 
> into the DSM. For e.g. let's say in the input data file we have 10 tuples 
> where the 1st tuple belongs to partition1, 2nd belongs to partition2 and 
> likewise. So, in such cases, will the leader process accumulate all the 
> tuples belonging to partition1 into one DSM and tuples belonging to 
> partition2 into some other DSM and assign them to the worker process or we 
> have taken some other approach to handle this scenario?
>

No, all the tuples (for all partitions) will be accumulated in a
single DSM and the workers/leader will route the tuple to an
appropriate partition.

> Further, I haven't got much time to look into the links that you have shared 
> in your previous response. Will have a look into those and will also slowly 
> start looking into the patches  as and when I get some time. Thank you.
>

Yeah, it will be good if you go through all the emails once because
most of the decisions (and design) in the patch is supposed to be
based on the discussion in this thread.

Note - Please don't top post, try to give inline replies.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: factorial of negative numbers

2020-06-16 Thread Dean Rasheed

On Tue, 16 Jun 2020 at 10:09, Juan José Santamaría Flecha
 wrote:
>
> It is defined as NaN (or undefined), which is not in the realm of integer 
> numbers. You might get a clear idea of the logic from [1], where they also 
> make a case for the error being ERRCODE_DIVISION_BY_ZERO.
>
> [1] http://mathforum.org/library/drmath/view/60851.html
>

Hmm, I think ERRCODE_DIVISION_BY_ZERO should probably be reserved for
actual division functions.

With [1], we could return 'Infinity', which would be more correct from
a mathematical point of view, and might be preferable to erroring-out
in some contexts.

[1] https://www.postgresql.org/message-id/606717.1591924582%40sss.pgh.pa.us

Regards,
Dean

Re: factorial of negative numbers

2020-06-16 Thread Dean Rasheed

On Tue, 16 Jun 2020 at 09:55, Bruce Momjian  wrote:
>
> On Tue, Jun 16, 2020 at 08:31:21AM +0100, Dean Rasheed wrote:
> >
> > Most common implementations do regard factorial as undefined for
> > anything other than positive integers, as well as following the
> > convention that factorial(0) = 1. Some implementations extend the
> > factorial to non-integer inputs, negative inputs, or even complex
> > inputs by defining it in terms of the gamma function. However, even
> > then, it is undefined for negative integer inputs.
>
> Wow, they define it for negative inputs, but not negative integer
> inputs?  I am curious what the logic is behind that.
>

That's just the way the maths works out. The gamma function is
well-defined for all real and complex numbers except for zero and
negative integers, where it has poles (singularities/infinities).
Actually the gamma function isn't the only possible extension of the
factorial function, but it's the one nearly everyone uses, if they
bother at all (most people don't).

Curiously, the most widespread implementation is probably the
calculator in MS Windows.

Regards,
Dean

Re: pg_upgrade fails if vacuum_defer_cleanup_age > 0

2020-06-16 Thread Bruce Momjian

On Tue, Jun 16, 2020 at 08:39:57AM +0200, Laurenz Albe wrote:
> On Mon, 2020-06-15 at 20:59 -0400, Bruce Momjian wrote:
> > On Sat, Jun 13, 2020 at 08:46:36AM -0400, Bruce Momjian wrote:
> > > On Wed, Jun 10, 2020 at 04:07:05PM +0200, Laurenz Albe wrote:
> > > > A customer's upgrade failed, and it took me a while to
> > > > figure out that the problem was that they had set
> > > > "vacuum_defer_cleanup_age=1" on the new cluster.
> > > > 
> > > > The consequence was that the "vacuumdb --freeze" that
> > > > takes place before copying commit log files failed to
> > > > freeze "pg_database".
> > > > That caused later updates to the table to fail with
> > > > "Could not open file "pg_xact/": No such file or directory."
> > > > 
> > > > I think it would increase the robustness of pg_upgrade to
> > > > force "vacuum_defer_cleanup_age" to 0 on the new cluster.
> >
> > Thank you, applied to all supported PG versions.
> 
> Thanks for picking this up and taking care of it.

Sure.  I never noticed how this setting, when it was added in 2009,
could affect pg_uprade, but is certainly can:

commit efc16ea520
Author: Simon Riggs 
Date:   Sat Dec 19 01:32:45 2009 +

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: factorial of negative numbers

2020-06-16 Thread Juan José Santamaría Flecha

On Tue, Jun 16, 2020 at 10:55 AM Bruce Momjian  wrote:

> On Tue, Jun 16, 2020 at 08:31:21AM +0100, Dean Rasheed wrote:
> >
> > Most common implementations do regard factorial as undefined for
> > anything other than positive integers, as well as following the
> > convention that factorial(0) = 1. Some implementations extend the
> > factorial to non-integer inputs, negative inputs, or even complex
> > inputs by defining it in terms of the gamma function. However, even
> > then, it is undefined for negative integer inputs.
>
> Wow, they define it for negative inputs, but not negative integer
> inputs?  I am curious what the logic is behind that.
>

It is defined as NaN (or undefined), which is not in the realm of integer
numbers. You might get a clear idea of the logic from [1], where they also
make a case for the error being ERRCODE_DIVISION_BY_ZERO.

[1] http://mathforum.org/library/drmath/view/60851.html

Regards,

Juan José Santamaría Flecha

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-16 Thread Amit Kapila

On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila  wrote:
>
> I have few more comments on the patch
> 0013-Change-buffile-interface-required-for-streaming-.patch:
>

Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
1.
The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash

Lets slightly reword the part of the comment as "The subxact file is
created iff there is any suxact info under this xid."

2.
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
  subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
  stream_close_file();

+ /* Commit the per-stream transaction */
+ CommitTransactionCommand();

Before calling commit, ensure that we are in a valid transaction.  I
think we can have an Assert for IsTransactionState().

3.
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)

  int64 i;
  int64 subidx;
- int fd;
+ BufFile*fd;
  bool found = false;
  char path[MAXPGPATH];
+ StreamXidHash *ent;

  subidx = -1;
+ ensure_transaction();
  subxact_info_read(MyLogicalRepWorker->subid, xid);

Why to call ensure_transaction here?  Is there any reason that we
won't have a valid transaction by now?  If not, then its better to
have an Assert for IsTransactionState().

4.
- if (write(fd, , sizeof(nsubxacts)) != sizeof(nsubxacts))
+ if (BufFileWrite(fd, , sizeof(nsubxacts)) != sizeof(nsubxacts))
  {
- int save_errno = errno;
+ int save_errno = errno;

- CloseTransientFile(fd);
+ BufFileClose(fd);

On error, won't these files be close automatically?  If so, why at
this place and before other errors, we need to close this?

5.
if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",

Can we change the error message to "could not read from streaming
transactions file .." or something like that and similarly we can
change the message for failure in reading changes file?

6.
if (BufFileWrite(fd, , sizeof(nsubxacts)) != sizeof(nsubxacts))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",

Similar to previous, can we change it to "could not write to streaming
transactions file

7.
@@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
bool first_segment)
  * for writing, in append mode.
  */
  if (first_segment)
- flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
- else
- flags = (O_WRONLY | O_APPEND | PG_BINARY);
+ {
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ */
+ SharedFileSet *fileset =
+ MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));

- stream_fd = OpenTransientFile(path, flags);
+ PrepareTempTablespaces();
+ SharedFileSetInit(fileset, NULL);

Why are we calling PrepareTempTablespaces here? It is already called
in SharedFileSetInit.

8.
+ /*
+ * Start a transaction on stream start, this transaction will be committed
+ * on the stream stop.  We need the transaction for handling the buffile,
+ * used for serializing the streaming data and subxact info.
+ */
+ ensure_transaction();

I think we need this for PrepareTempTablespaces to set the
temptablespaces.  Also, isn't it required for a cleanup of buffile
resources at the transaction end?  Are there any other reasons for it
as well?  The comment should be a bit more clear for why we need a
transaction here.

9.
* Open a file for streamed changes from a toplevel transaction identified
 * by stream_xid (global variable). If it's the first chunk of streamed
 * changes for this transaction, perform cleanup by removing existing
 * files after a possible previous crash.
..
stream_open_file(Oid subid, TransactionId xid, bool first_segment)

The above part comment atop stream_open_file needs to be changed after
new implementation.

10.
 * enabled.  This context is reeset on each stream stop.
*/
LogicalStreamingContext = AllocSetContextCreate(ApplyContext,

/reeset/reset

11.
stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
{
..
+ /* No entry created for this xid so simply return. */
+ if (ent == NULL)
+ return;
..
}

Is there any reason or scenario where this ent can be NULL?  If not,
it will be better to have an Assert for the same.

12.
subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */
+ if (nsubxacts == 0)
  {
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m",
- path)));
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ ent->subxact_fileset = NULL;
..
}

Here don't we need to free the subxact_fileset before setting it to NULL?

13.
+ /*
+ * Scan complete hash and delete the underlying files for the the xids.
+ * Also delete

Re: factorial of negative numbers

2020-06-16 Thread Bruce Momjian

On Tue, Jun 16, 2020 at 08:31:21AM +0100, Dean Rasheed wrote:
> On Tue, 16 Jun 2020 at 06:00, Ashutosh Bapat
>  wrote:
> >
> > Divison by zero is really undefined, 12345678 * 12345678 (just some 
> > numbers) is out of range of say int4, but factorial of a negative number 
> > has some meaning and is defined but PostgreSQL does not support it.
> >
> 
> Actually, I think undefined/out-of-range is the right error to throw here.
> 
> Most common implementations do regard factorial as undefined for
> anything other than positive integers, as well as following the
> convention that factorial(0) = 1. Some implementations extend the
> factorial to non-integer inputs, negative inputs, or even complex
> inputs by defining it in terms of the gamma function. However, even
> then, it is undefined for negative integer inputs.

Wow, they define it for negative inputs, but not negative integer
inputs?  I am curious what the logic is behind that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Asynchronous Append on postgres_fdw nodes.

2020-06-16 Thread Kyotaro Horiguchi

Thanks.

My conclusion on this is the async patch is not the cause of the
behavior change mentioned here.

At Mon, 15 Jun 2020 14:59:18 +0500, "Andrey V. Lepikhov" 
 wrote in 
> > Could you tell me how did you get the first plan?
> 
> 1. Use clear current vanilla master.
> 
> 2. Start two instances with the script 'frgn2n.sh' from attachment.
> There are I set GUCs:
> enable_partitionwise_join = true
> enable_partitionwise_aggregate = true
> 
> 3. Execute query:
> explain analyze SELECT sum(parts.b)
>   FROM parts, second
>   WHERE parts.a = second.a AND second.b < 100;
> 
> That's all.

With mater/HEAD, I got the second (local join) plan for a while first
then got the first (remote join). The cause of the plan change was
found to be autovacuum on the remote node.

Before the vacuum the result of remote estimation was as follows.

Node2 (remote)
=#  EXPLAIN SELECT r4.b FROM (public.part_1 r4 INNER JOIN public.second_1 r8 ON 
(((r4.a = r8.a)) AND ((r8.b < 100;
QUERY PLAN 
---
 Merge Join  (cost=2269.20..3689.70 rows=94449 width=4)
   Merge Cond: (r8.a = r4.a)
   ->  Sort  (cost=74.23..76.11 rows=753 width=4)
 Sort Key: r8.a
 ->  Seq Scan on second_1 r8  (cost=0.00..38.25 rows=753 width=4)
   Filter: (b < 100)
   ->  Sort  (cost=2194.97..2257.68 rows=25086 width=8)
 Sort Key: r4.a
 ->  Seq Scan on part_1 r4  (cost=0.00..361.86 rows=25086 width=8)
(9 rows)

After running a vacuum it changes as follows.

   QUERY PLAN   

 Hash Join  (cost=5.90..776.31 rows=9741 width=4)
   Hash Cond: (r4.a = r8.a)
   ->  Seq Scan on part_1 r4  (cost=0.00..360.78 rows=24978 width=8)
   ->  Hash  (cost=4.93..4.93 rows=78 width=4)
 ->  Seq Scan on second_1 r8  (cost=0.00..4.93 rows=78 width=4)
   Filter: (b < 100)
(6 rows)

That changes the plan on the local side the way you saw.  I saw the
exactly same behavior with the async execution patch.

regards.




FYI, the explain results for another plan changed as follows. It is
estimated to return 25839 rows, which is far less than 94449. So local
join beated remote join.

=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
QUERY PLAN
--
 Sort  (cost=2194.97..2257.68 rows=25086 width=8)
   Sort Key: a
   ->  Seq Scan on part_1  (cost=0.00..361.86 rows=25086 width=8)
(3 rows)
=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS 
LAST;
   QUERY PLAN
-
 Sort  (cost=74.23..76.11 rows=753 width=4)
   Sort Key: a
   ->  Seq Scan on second_1  (cost=0.00..38.25 rows=753 width=4)
 Filter: (b < 100)
(4 rows)

Are changed to:

=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
QUERY PLAN
--
 Sort  (cost=2185.22..2247.66 rows=24978 width=8)
   Sort Key: a
   ->  Seq Scan on part_1  (cost=0.00..360.78 rows=24978 width=8)
(3 rows)

horiguti=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a 
ASC NULLS LAST;
  QUERY PLAN   
---
 Sort  (cost=7.38..7.57 rows=78 width=4)
   Sort Key: a
   ->  Seq Scan on second_1  (cost=0.00..4.93 rows=78 width=4)
 Filter: (b < 100)
(4 rows)

They return 25056 rows, which is far more than 9741 rows. So remote
join won.

Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Infinities in type numeric

2020-06-16 Thread Vik Fearing

On 6/12/20 7:00 PM, Tom Lane wrote:
> Robert Haas  writes:
>> On Thu, Jun 11, 2020 at 9:16 PM Tom Lane  wrote:
>>> We had a discussion recently about how it'd be a good idea to support
>>> infinity values in type numeric [1].
> 
>> FWIW, I don't particularly like the idea. Back when I was an
>> application developer, I remember having to insert special cases into
>> any code that dealt with double precision to deal with +/-Inf and NaN.
>> I was happy that I didn't need them for numeric, too. So this change
>> would have made me sad.
> 
> Well, you're already stuck with special-casing numeric NaN, so I'm
> not sure that Inf makes your life noticeably worse on that score.
> 
> This does tie into something I have a question about in the patch's
> comments though.  As the patch stands, numeric(numeric, integer)
> (that is, the typmod-enforcement function) just lets infinities
> through regardless of the typmod, on the grounds that it is/was also
> letting NaNs through regardless of typmod.  But you could certainly
> make the argument that Inf should only be allowed in an unconstrained
> numeric column, because by definition it overflows any finite precision
> restriction.  If we did that, you'd never see Inf in a
> standard-conforming column, since SQL doesn't allow unconstrained
> numeric columns IIRC.


It does.  The precision and scale are both optional.

If the precision is missing, it's implementation defined; if the scale
is missing, it's 0.
-- 
Vik Fearing

snowball ASCII stemmer configuration

2020-06-16 Thread Peter Eisentraut

While I was updating the snowball code, I noticed something strange.  In 
src/backend/snowball/Makefile:


# first column is language name and also name of dictionary for 
not-all-ASCII

# words, second is name of dictionary for all-ASCII words
# Note order dependency: use of some other language as ASCII dictionary
# must come after creation of that language
LANGUAGES=  \
arabic  arabic  \
basque  basque  \
catalan catalan \
etc.

There are two cases where these two columns are not the same:

hindi   english \
russian english \

The second one is old; the first one I added using the second one as 
example.  But I wonder what the rationale for this is.  Maybe for hindi 
one could make some kind of cultural argument, but for russian this 
seems entirely arbitrary.  Perhaps using "simple" would be more sound here.


Moreover, AFAIK, the following other languages do not use Latin-based 
alphabets:


arabic  arabic  \
greek   greek   \
nepali  nepali  \
tamil   tamil   \

So I wonder by what rationale they use their own stemmer for the ASCII 
fallback, which is probably not going to produce anything significant.


What's the general idea here?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TAP tests and symlinks on Windows

2020-06-16 Thread Juan José Santamaría Flecha

On Mon, Jun 15, 2020 at 8:23 AM Michael Paquier  wrote:

>
> Another thing I spotted is that Win32::Symlink does not allow to
> detect properly if a path is a symlink using -l, causing one of the
> tests of pg_basebackup to fail when checking if a tablespace path has
> been updted.  It would be good to get more people to test this patch
> with different environments than mine.  I am also adding Andrew
> Dunstan in CC as the owner of the buildfarm animals running currently
> TAP tests for confirmation about the presence of Win32::Symlink
> there as I am afraid it would cause failures: drongo, fairywen,
> jacana and bowerbird.
>

This patch works on my Windows 10 / Visual Studio 2019 / Perl 5.30.2
machine.

Regards,

Juan José Santamaría Flecha

Re: factorial of negative numbers

2020-06-16 Thread Dean Rasheed

On Tue, 16 Jun 2020 at 06:00, Ashutosh Bapat
 wrote:
>
> Divison by zero is really undefined, 12345678 * 12345678 (just some numbers) 
> is out of range of say int4, but factorial of a negative number has some 
> meaning and is defined but PostgreSQL does not support it.
>

Actually, I think undefined/out-of-range is the right error to throw here.

Most common implementations do regard factorial as undefined for
anything other than positive integers, as well as following the
convention that factorial(0) = 1. Some implementations extend the
factorial to non-integer inputs, negative inputs, or even complex
inputs by defining it in terms of the gamma function. However, even
then, it is undefined for negative integer inputs.

Regards,
Dean

[1] https://en.wikipedia.org/wiki/Factorial
[2] https://en.wikipedia.org/wiki/Gamma_function

Re: BufFileRead() error signalling

2020-06-16 Thread Michael Paquier

On Tue, Jun 16, 2020 at 05:41:31PM +1200, Thomas Munro wrote:
> Thanks!  Pushed.  I went ahead and made it void in master only.

Thanks.
--
Michael


signature.asc
Description: PGP signature

Re: Physical replication slot advance is not persistent

2020-06-16 Thread Michael Paquier

On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote:
> New test reproduces this issue well. Left it running for a couple of hours
> in repeat and it seems to be stable.

Thanks for testing.  I have been thinking about the minimum xmin and
LSN computations on advancing, and actually I have switched the
recomputing to be called at the end of pg_replication_slot_advance().
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.

> we can safely use $current_lsn used for pg_replication_slot_advance(), since
> reatart_lsn is set as is there. It may make the test a bit simpler as well.

We could do that.  Now I found cleaner the direct comparison of
pg_replication_slots.restart before and after the restart.  So I have
kept it.
--
Michael
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 1b929a603e..bfa4c874d5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -621,6 +621,13 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(>data.name);
 	nulls[0] = false;
 
+	/*
+	 * Recompute the minimum LSN and xmin across all slots to adjust with
+	 * the advancing potentially done.
+	 */
+	ReplicationSlotsComputeRequiredXmin(false);
+	ReplicationSlotsComputeRequiredLSN();
+
 	ReplicationSlotRelease();
 
 	/* Return the reached position. */
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c316c1808..778f11b28b 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 35;
+use Test::More tests => 36;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -364,15 +364,26 @@ my $is_replayed = $node_standby_2->safe_psql('postgres',
 	qq[SELECT 1 FROM replayed WHERE val = $newval]);
 is($is_replayed, qq(1), "standby_2 didn't replay master value $newval");
 
+# Drop any existing slots on the primary, for the follow-up tests.
+$node_master->safe_psql('postgres',
+	"SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots;");
+
 # Test physical slot advancing and its durability.  Create a new slot on
 # the primary, not used by any of the standbys. This reserves WAL at creation.
 my $phys_slot = 'phys_slot';
 $node_master->safe_psql('postgres',
 	"SELECT pg_create_physical_replication_slot('$phys_slot', true);");
+# Generate some WAL, and switch to a new segment, used to check that
+# the previous segment is correctly getting recycled as the slot advancing
+# would recompute the minimum LSN calculated across all slots.
+my $segment_removed = $node_master->safe_psql('postgres',
+	'SELECT pg_walfile_name(pg_current_wal_lsn())');
+chomp($segment_removed);
 $node_master->psql(
 	'postgres', "
 	CREATE TABLE tab_phys_slot (a int);
-	INSERT INTO tab_phys_slot VALUES (generate_series(1,10));");
+	INSERT INTO tab_phys_slot VALUES (generate_series(1,10));
+	SELECT pg_switch_wal();");
 my $current_lsn =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
 chomp($current_lsn);
@@ -392,3 +403,9 @@ my $phys_restart_lsn_post = $node_master->safe_psql('postgres',
 chomp($phys_restart_lsn_post);
 ok( ($phys_restart_lsn_pre cmp $phys_restart_lsn_post) == 0,
 	"physical slot advance persists across restarts");
+
+# Check if the previous segment gets correctly recycled after the
+# server stopped cleanly, causing a shutdown checkpoint to be generated.
+my $master_data = $node_master->data_dir;
+ok(!-f "$master_data/pg_wal/$segment_removed",
+	"WAL segment $segment_removed recycled after physical slot advancing");


signature.asc
Description: PGP signature

Re: language cleanups in code and docs

2020-06-16 Thread Magnus Hagander

On Tue, Jun 16, 2020 at 2:23 AM Andres Freund  wrote:

> Hi,
>
> On 2020-06-15 19:54:25 -0400, Tom Lane wrote:
> > Daniel Gustafsson  writes:
> > > On 15 Jun 2020, at 20:22, Andres Freund  wrote:
> > >> 1) 'postmaster'. As changing that would be somewhat invasive, the word
> > >> is a bit more ambiguous, and it's largely just internal, I've left
> > >> this alone for now. I personally would rather see this renamed as
> > >> supervisor, which'd imo actually would also be a lot more
> > >> descriptive. I'm willing to do the work, but only if there's at least
> > >> some agreement.
> >
> > > FWIW, I've never really liked the name postmaster as I don't think it
> conveys
> > > meaning.  I support renaming to supervisor or a similar term.
> >
> > Meh.  That's carrying PC naming foibles to the point where they
> > negatively affect our users (by breaking start scripts and such).
> > I think we should leave this alone.
>
> postmaster is just a symlink, which we very well could just leave in
> place... I was really just thinking of the code level stuff. And I think
> there's some clarity reasons to rename it as well (see comments by
> others in the thread).
>
>
Is the symlink even used? If not we could just get rid of it.

I'd be more worried about for example postmaster.pid, which would break a
*lot* of scripts and integrations. postmaster is also exposed in the system
catalogs.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

回复：回复：回复：how to create index concurrently on partitioned table

2020-06-16 Thread 李杰(慎追)

> I am afraid that this makes the error handling more complicated, with
> risks of having inconsistent partition trees.  That's the point you
> raised.  This one is going to need more thoughts.
> CIC is an operation that exists while allowing read and writes to
> still happen in parallel, so that's fine IMO if it takes time.  Now it
> is true that it may not scale well so we should be careful with the
> approach taken.  Also, I think that the case of REINDEX would require
> less work overall because we already have some code in place to gather
> multiple indexes from one or more relations and work on these
> consistently, all at once.

I'm with you on that. 
(Scheme 1)
We can refer to the implementation in the ReindexRelationConcurrently,
 in the three phases of the REINDEX CONCURRENTLY,
all indexes of the partitions are operated one by one in each phase.
In this way, we can maintain the consistency of the entire partitioned table 
index.
After we implement CIC in this way, we can also complete reindex partitioned 
table index concurrently (this is not done now.)

Looking back, let's talk about our original schema.
(Scheme 2)
If CIC is performed one by one on each partition, 
how can we make it on subsequent partitions continue when an error occurs in 
the second partition?
 If this problem can be solved, It's not that I can't accept it.
Because a partition CIC error is acceptable, you can reindex it later.
Pseudo indexes on partitioned tables are useless for real queries, 
 but the indexes on partitions are really useful.

Scheme 1, more elegant, can also implement reindex concurrently on partitioned 
table, 
but the implementation is more complex.
Scheme 2: simple implementation, but not so friendly.

Hi Jsutin,
Which scheme do you think is more helpful to realize CIC?

--
发件人：Michael Paquier 
发送时间：2020年6月16日(星期二) 09:02
收件人：李杰(慎追) 
抄　送：Justin Pryzby ; pgsql-hackers 
; 曾文旌(义从) ; 
Alvaro Herrera 
主　题：Re: 回复：回复：how to create index concurrently on partitioned table

On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
> This is a good idea.
> We should maintain the consistency of the entire partition table.
> However, this is not a small change in the code.
> May be we need to make a new design for DefineIndex function

Indeed.  I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+   /*
+* CIC needs to mark a partitioned table as VALID, which itself
+* requires setting READY, which is unset for CIC (even though
+* it's meaningless for an index without storage).
+*/
+   if (concurrent)
+   {
+   PopActiveSnapshot();
+   CommitTransactionCommand();
+   StartTransactionCommand();
+   index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+   CommandCounterIncrement();
+   index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
}

I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees.  That's the point you
raised.  This one is going to need more thoughts.

> But most importantly, if we use three steps to implement CIC, 
> we will spend more time than ordinary tables, especially in high
> concurrency cases.  To wait for one of partitions which the users to
> use  frequently, the creation of other partition indexes is delayed. 
> Is it worth doing this?

CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time.  Now it
is true that it may not scale well so we should be careful with the
approach taken.  Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael

Re: pg_upgrade fails if vacuum_defer_cleanup_age > 0

2020-06-16 Thread Laurenz Albe

On Mon, 2020-06-15 at 20:59 -0400, Bruce Momjian wrote:
> On Sat, Jun 13, 2020 at 08:46:36AM -0400, Bruce Momjian wrote:
> > On Wed, Jun 10, 2020 at 04:07:05PM +0200, Laurenz Albe wrote:
> > > A customer's upgrade failed, and it took me a while to
> > > figure out that the problem was that they had set
> > > "vacuum_defer_cleanup_age=1" on the new cluster.
> > > 
> > > The consequence was that the "vacuumdb --freeze" that
> > > takes place before copying commit log files failed to
> > > freeze "pg_database".
> > > That caused later updates to the table to fail with
> > > "Could not open file "pg_xact/": No such file or directory."
> > > 
> > > I think it would increase the robustness of pg_upgrade to
> > > force "vacuum_defer_cleanup_age" to 0 on the new cluster.
>
> Thank you, applied to all supported PG versions.

Thanks for picking this up and taking care of it.

Yours,
Laurenz Albe

Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.

2020-06-16 Thread Amit Langote

On Fri, Jun 12, 2020 at 9:22 PM Ashutosh Bapat
 wrote:
> On Wed, Jun 3, 2020 at 12:44 PM Amit Langote  wrote:
> > Are you saying that the planner should take into account the state of
> > the cursor specified in WHERE CURRENT OF to determine which of the
> > tables to scan for the UPDATE?  Note that neither partition pruning
> > nor constraint exclusion know that CurrentOfExpr can possibly allow to
> > exclude children of the UPDATE target.
>
> Yes. But it may not be possible to know the value of current of at the
> time of planning since that need not be a plan time constant. This
> pruning has to happen at run time.

Good point about not doing anything at planning time.

I wonder if it wouldn't be okay to simply make execCurrentOf() return
false if it can't find either a row mark or a Scan node in the cursor
matching the table being updated/deleted from, instead of giving an
error message?  I mean what do we gain by erroring out here instead of
simply not doing anything?  Now, it would be nicer if we could do so
only if the table being updated/deleted from is a child table, but it
seems pretty inconvenient to tell that from the bottom of a plan tree
from where execCurrentOf() is called.

The other option would be to have some bespoke "pruning" logic in,
say, ExecInitModifyTable() that fetches the current active table from
the cursor and processes only the matching child result relation.  Or
maybe wait until we have run-time pruning for ModifyTable, because the
result relation code restructuring required for that will also be
something we'd need for this.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

98 matches

Mail list logo