Re: BufFileRead() error signalling

2020-06-15 Thread Thomas Munro
On Tue, Jun 9, 2020 at 2:32 PM Michael Paquier  wrote:
> Looks fine to me.  Are you planning to send an extra patch to switch
> BufFileWrite() to void for 14~?

Thanks!  Pushed.  I went ahead and made it void in master only.




Re: Failures with installcheck and low work_mem value in 13~

2020-06-15 Thread Michael Paquier
On Mon, Jun 15, 2020 at 10:29:41PM +0900, Michael Paquier wrote:
> Attempting to run installcheck with 13~ and a value of work_mem lower
> than the default causes two failures, both related to incremental
> sorts (here work_mem = 1MB):
> 1) Test incremental_sort:
> @@ -4,12 +4,13 @@
>  select * from (select * from tenk1 order by four) t order by four, ten;
>  QUERY PLAN 
>  ---
> - Sort
> + Incremental Sort
> Sort Key: tenk1.four, tenk1.ten
> +   Presorted Key: tenk1.four
> ->  Sort
>   Sort Key: tenk1.four
>   ->  Seq Scan on tenk1
> -(5 rows)
> +(6 rows)

Looking at this one, it happens that this is the first test in
incremental_sort.sql, and we have the following comment:
-- When we have to sort the entire table, incremental sort will
-- be slower than plain sort, so it should not be used.
explain (costs off)
select * from (select * from tenk1 order by four) t order by four, ten;

When using such a low value of work_mem, why do we switch to an
incremental sort if we know that it is going to be slower than a plain
sort?  Something looks wrong in the planner choice here.
--
Michael


signature.asc
Description: PGP signature


Re: language cleanups in code and docs

2020-06-15 Thread Peter Geoghegan
On Mon, Jun 15, 2020 at 4:54 PM Tom Lane  wrote:
> Meh.  That's carrying PC naming foibles to the point where they
> negatively affect our users (by breaking start scripts and such).
> I think we should leave this alone.

+1. Apart from the practical considerations, I just don't see a
problem with the word postmaster. My mother is a postmistress.

I'm in favor of updating any individual instances of the word "master"
to the preferred equivalent in code and code comments, though.

-- 
Peter Geoghegan




Re: Review for GetWALAvailability()

2020-06-15 Thread Kyotaro Horiguchi
At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao  
wrote in 
> > In short, it is known behavior but it was judged as useless to prevent
> > that.
> > That can happen when checkpointer removes up to the segment that is
> > being read by walsender.  I think that that doesn't happen (or
> > happenswithin a narrow time window?) for physical replication but
> > happenes for logical replication.
> > While development, I once added walsender a code to exit for that
> > reason, but finally it is moved to InvalidateObsoleteReplicationSlots
> > as a bit defferent function.
> 
> BTW, I read the code of InvalidateObsoleteReplicationSlots() and
> probably
> found some issues in it.
> 
> 1. Each cycle of the "for" loop in
> InvalidateObsoleteReplicationSlots()
> emits the log message  "terminating walsender ...". This means that
> if it takes more than 10ms for walsender to exit after it's signaled,
> the second and subsequent cycles would happen and output the same
> log message several times. IMO that log message should be output
> only once.

Sounds reasonable.

> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan
> replication 
> slots array and uses the "for" loop in each scan. Also it calls
> ReplicationSlotAcquire() for each "for" loop cycle, and
> ReplicationSlotAcquire() uses another loop to scan replication slots
> array. I don't think this is good design.
> 
> ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
> InvalidateObsoleteReplicationSlots() already know the index of the
> slot
> that we want to find. The attached patch does that. Thought?

The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot.  The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe. Maybe some assertion is needed?

> 3. There is a corner case where the termination of walsender cleans up
> the temporary replication slot while
> InvalidateObsoleteReplicationSlots()
> is sleeping on ConditionVariableTimedSleep(). In this case,
> ReplicationSlotAcquire() is called in the subsequent cycle of the
> "for"
> loop, cannot find the slot and then emits ERROR message. This leads
> to the failure of checkpoint by the checkpointer.

Agreed.

> To avoid this case, if SAB_Inquire is specified,
> ReplicationSlotAcquire()
> should return the special value instead of emitting ERROR even when
> it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
> should
> handle that special returned value.

I thought the same thing hearing that. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: factorial of negative numbers

2020-06-15 Thread Ashutosh Bapat
On Tue, 16 Jun 2020 at 08:48, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-06-15 13:15, Ashutosh Bapat wrote:
> > Here are some comments
> > I see below in the .out but there's not corresponding SQL in .sql.
> > +SELECT factorial(-4);
> > + factorial
> > +---
> > + 1
> > +(1 row)
> > +
> >
> > Should we also add -4! to cover both function as well as the operator?
>
> I will add that.  I wasn't actually sure about the precedence of these
> operators, so it is interesting as a regression test for that as well.
>
> +1.


> > +if (num < 0)
> > +ereport(ERROR,
> > +(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> >
> > This looks more of ERRCODE_FEATURE_NOT_SUPPORTED esp. since factorial
> > of negative numbers is defined but we are not supporting it. I looked
> > at some other usages of this error code. All of them are really are
> > out of range value errors.
>
> The proposed error message says this is undefined.  If we use an error
> code that says it's not implemented, then the message should also
> reflect that.


Yes. BTW, OUT_OF_RANGE is not exactly "undefined" either. I searched for an
error code for "UNDEFINED" result but didn't find any.


> But that would in turn open an invitation for someone to
> implement it, and I'm not sure we want that.


 It will be more complex code, so difficult to implement but why do we
prevent why not.


> It could go either way,
> but we should be clear on what we want.
>
>
Divison by zero is really undefined, 12345678 * 12345678 (just some
numbers) is out of range of say int4, but factorial of a negative number
has some meaning and is defined but PostgreSQL does not support it.

-- 
Best Wishes,
Ashutosh


Re: [PATCH] Missing links between system catalog documentation pages

2020-06-15 Thread Fabien COELHO



Hello Dagfinn,


The attached patch makes the first mention of another system catalog or
view (as well as pg_hba.conf in pg_hba_file_lines) a link, for easier
navigation.


Why only the first mention? It seems unlikely that I would ever read such 
chapter linearly, and even so that I would want to jump especially on the 
first occurrence but not on others, so ISTM that it should be done all 
mentions?


--
Fabien.




Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-15 Thread Justin Pryzby
On Mon, Jun 15, 2020 at 11:49:33PM -0300, Ranier Vilela wrote:
> I can confirm that the problem is in pgrename (dirmod.c),
> something is not OK, with MoveFileEx, even with the
> (MOVEFILE_REPLACE_EXISTING) flag.
> 
> Replacing MoveFileEx, with
> unlink (to);
> rename (from, to);
> 
> #if defined (WIN32) &&! defined (__ CYGWIN__)
> unlink(to);
> while (rename (from, to)! = 0)
> #else
> while (rename (from, to) <0)
> #endif
> 
> The log messages have disappeared.
> 
> I suspect that if the target (to) file exists, MoveFileEx, it is failing to
> rename, even with the flag enabled.
> 
> Windows have the native rename function (
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rename-wrename?view=vs-2019
> )
> However, it fails if the target (to) file exists.
> 
> Question, is it acceptable delete the target file, if it exists, to
> guarantee the rename?

I don't think so - the idea behind writing $f.tmp and renaming it to $f is that
it's *atomic*.

I found an earlier thread addressing the same problem - we probably both
should've found this earlier.
https://commitfest.postgresql.org/27/2230/
https://www.postgresql.org/message-id/flat/CAPpHfds9trA6ipezK3BsuuOSQwEmESiqj8pkOxACFJpoLpcoNw%40mail.gmail.com#9b04576b717175e9dbf03cc991977d3f

That thread goes back to 2017, so I don't think this is a new problem in v13.
I'm not sure why you wouldn't also see the same behavior in v12.

There's a couple patches in that thread, and the latest patch was rejected.

Maybe you'd want to test them out and provide feedback.

BTW, the first patch does:

!   if (filePresent)
!   {
!   if (ReplaceFile(to, from, NULL, 
REPLACEFILE_IGNORE_MERGE_ERRORS, 0, 0))
!   break;
!   }
!   else
!   {
!   if (MoveFileEx(from, to, MOVEFILE_REPLACE_EXISTING))
!   break;
!   }

Since it's racy to first check if the file exists, I would've thought we should
instead do:

ret = ReplaceFile()
if ret == OK:
break
else if ret == FileDoesntExist:
if MoveFileEx():
break

Or, should we just try to create a file first, to allow ReplaceFile() to always
work ?

-- 
Justin




Re: Failures with wal_consistency_checking and 13~

2020-06-15 Thread Michael Paquier
On Mon, Jun 15, 2020 at 11:33:42PM -0300, Ranier Vilela wrote:
> With Postgres 13, Windows 10 (home), msvc 2019 64 bits,
> Shutting down, without interrupting the database, consistently, this log
> has appeared.
> 
> 2020-06-15 21:40:35.998 -03 [3380] LOG:  database system shutdown was
> interrupted; last known up at 2020-06-15 21:36:01 -03
> 2020-06-15 21:40:37.978 -03 [3380] LOG:  database system was not properly
> shut down; automatic recovery in progress
> 2020-06-15 21:40:37.992 -03 [3380] LOG:  redo starts at 0/8A809C78
> 2020-06-15 21:40:37.992 -03 [3380] LOG:  invalid record length at
> 0/8A809CB0: wanted 24, got 0
> 2020-06-15 21:40:37.992 -03 [3380] LOG:  redo done at 0/8A809C78

Could you please keep discussions on their own thread please?  After
stopping Postgres in a sudden way (immediate mode or just SIGKILL), it
is normal to see crash recovery happen, as well as it is normal to see
an "invalid record length" in the logs when the end of WAL is reached,
meaning the end of recovery.

Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-15 Thread Michael Paquier
On Mon, Jun 15, 2020 at 09:49:31AM -0300, Ranier Vilela wrote:
> II already reported on another thread, that vcregress is failing with
> (float8 and partitionprune) and now these messages are showing up.
> None buildfarm animal, have that combination, but as Postgres officially
> supports it ..

Well, it happens that I do have a VM with MSVC 2019 64-bit and Windows
10 (17763.rs5_release.180914-1434 coming directly from Microsoft's
website).  Testing with it, I am not seeing any issues related to
temporary file renames, though getting a disk full I could find some
errors in writing and/or closing temporary statistics file because of
ENOSPC but that's not a surprise in this situation, and regression
tests just work fine.  Honestly, I see nothing to act on based on the
information provided in this thread.
--
Michael


signature.asc
Description: PGP signature


Re: factorial of negative numbers

2020-06-15 Thread Peter Eisentraut

On 2020-06-15 13:15, Ashutosh Bapat wrote:

Here are some comments
I see below in the .out but there's not corresponding SQL in .sql.
+SELECT factorial(-4);
+ factorial
+---
+ 1
+(1 row)
+

Should we also add -4! to cover both function as well as the operator?


I will add that.  I wasn't actually sure about the precedence of these 
operators, so it is interesting as a regression test for that as well.



+if (num < 0)
+ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),

This looks more of ERRCODE_FEATURE_NOT_SUPPORTED esp. since factorial
of negative numbers is defined but we are not supporting it. I looked
at some other usages of this error code. All of them are really are
out of range value errors.


The proposed error message says this is undefined.  If we use an error 
code that says it's not implemented, then the message should also 
reflect that.  But that would in turn open an invitation for someone to 
implement it, and I'm not sure we want that.  It could go either way, 
but we should be clear on what we want.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-15 Thread Kyotaro Horiguchi
At Mon, 15 Jun 2020 18:59:49 +0900, Fujii Masao  
wrote in 
> > It was a kind of hard to decide. Even when max_slot_wal_keep_size is
> > smaller than max_wal_size, the segments more than
> > max_slot_wal_keep_size are not guaranteed to be kept.  In that case
> > the state transits as NORMAL->LOST skipping the "RESERVED" state.
> > Putting aside whether the setting is useful or not, I thought that the
> > state transition is somewhat abrupt.
> 
> IMO the direct transition of the state from normal to lost is ok to me
> if each state is clearly defined.
> 
> >> Or, if that condition is really necessary, the document should be
> >> updated so that the note about the condition is added.
> > Does the following make sense?
> > https://www.postgresql.org/docs/13/view-pg-replication-slots.html
> > normal means that the claimed files are within max_wal_size.
> > + If max_slot_wal_keep_size is smaller than max_wal_size, this state
> > + will not appear.
> 
> I don't think this change is enough. For example, when
> max_slot_wal_keep_size
> is smaller than max_wal_size and the amount of WAL files claimed by
> the slot
> is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But
> which is
> inconsistent with the meaning of "reserved" in the docs.

You're right.

> To consider what should be reported in wal_status, could you tell me
> what
> purpose and how the users is expected to use this information?

I saw that the "reserved" is the state where slots are working to
retain segments, and "normal" is the state to indicate that "WAL
segments are within max_wal_size", which is orthogonal to the notion
of "reserved".  So it seems to me useless when the retained WAL
segments cannot exceeds max_wal_size.

With longer description they would be:

"reserved under max_wal_size"
"reserved over max_wal_size"
"lost some segements"

Come to think of that, I realized that my trouble was just the
wording.  Are the following wordings make sense to you?

"reserved"  - retained within max_wal_size
"extended"  - retained over max_wal_size
"lost"  - lost some segments

With these wordings I can live with "not extended"=>"lost". Of course
more appropriate wording are welcome.

> Even if walsender is terminated during the state "lost", unless
> checkpointer
> removes the required WAL files, the state can go back to "reserved"
> after
> new replication connection is established. This is the same as what
> you're
> explaining at the above?

GetWALAvailability checks restart_lsn against lastRemovedSegNo, thus
the "lost" cannot be seen unless checkpointer actually have removed
the segment at restart_lsn (and restart_lsn has not been invalidated).
However, walsenders are killed before that segments are actually
removed so there're cases where physical walreceiver reconnects before
RemoveOldXloFiles removes all segments, then removed after
reconnection. "lost" can go back to "resrved" in that case. (Physical
walreceiver can connect to invalid-restart_lsn slot)

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-15 Thread Ranier Vilela
I can confirm that the problem is in pgrename (dirmod.c),
something is not OK, with MoveFileEx, even with the
(MOVEFILE_REPLACE_EXISTING) flag.

Replacing MoveFileEx, with
unlink (to);
rename (from, to);

#if defined (WIN32) &&! defined (__ CYGWIN__)
unlink(to);
while (rename (from, to)! = 0)
#else
while (rename (from, to) <0)
#endif

The log messages have disappeared.

I suspect that if the target (to) file exists, MoveFileEx, it is failing to
rename, even with the flag enabled.

Windows have the native rename function (
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rename-wrename?view=vs-2019
)
However, it fails if the target (to) file exists.

Question, is it acceptable delete the target file, if it exists, to
guarantee the rename?

regards,
Ranier Vilela


logfile
Description: Binary data


Re: hashagg slowdown due to spill changes

2020-06-15 Thread Jeff Davis
On Mon, 2020-06-15 at 11:19 -0400, Robert Haas wrote:
> On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
>  wrote:
> > But just reverting 4cad2534d will make this much worse, I think, as
> > illustrated by the benchmarks I did in [1].
> 
> I share this concern, although I do not know what we should do about
> it.

I attached an updated version of Melanie's patch, combined with the
changes to copy only the necessary attributes to a new slot before
spilling. There are a couple changes:

* I didn't see a reason to descend into a GroupingFunc node, so I
removed that.

* I used a flag in the context rather than two separate callbacks to
the expression walker.

This patch gives the space benefits that we see on master, without the
regression for small numbers of tuples. I saw a little bit of noise in
my test results, but I'm pretty sure it's a win all around. It could
use some review/cleanup though.

Regards,
Jeff Davis

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 331acee2814..5b5aa96c517 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -358,6 +358,14 @@ typedef struct HashAggBatch
 	int64		input_tuples;	/* number of tuples in this batch */
 } HashAggBatch;
 
+/* used to find referenced colnos */
+typedef struct FindColsContext
+{
+	bool	   is_aggref;		/* is under an aggref */
+	Bitmapset *aggregated;		/* column references under an aggref */
+	Bitmapset *unaggregated;	/* other column references */
+} FindColsContext;
+
 static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -390,8 +398,9 @@ static void finalize_aggregates(AggState *aggstate,
 AggStatePerAgg peragg,
 AggStatePerGroup pergroup);
 static TupleTableSlot *project_aggregates(AggState *aggstate);
-static Bitmapset *find_unaggregated_cols(AggState *aggstate);
-static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
+static void find_cols(AggState *aggstate, Bitmapset **aggregated,
+	  Bitmapset **unaggregated);
+static bool find_cols_walker(Node *node, FindColsContext *context);
 static void build_hash_tables(AggState *aggstate);
 static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
@@ -424,8 +433,8 @@ static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
 			   int used_bits, uint64 input_tuples,
 			   double hashentrysize);
-static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
-uint32 hash);
+static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
+TupleTableSlot *slot, uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
  int setno);
 static void hashagg_tapeinfo_init(AggState *aggstate);
@@ -1374,26 +1383,28 @@ project_aggregates(AggState *aggstate)
 }
 
 /*
- * find_unaggregated_cols
- *	  Construct a bitmapset of the column numbers of un-aggregated Vars
- *	  appearing in our targetlist and qual (HAVING clause)
+ * Walk tlist and qual to find referenced colnos, dividing them into
+ * aggregated and unaggregated sets.
  */
-static Bitmapset *
-find_unaggregated_cols(AggState *aggstate)
+static void
+find_cols(AggState *aggstate, Bitmapset **aggregated, Bitmapset **unaggregated)
 {
-	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
-	Bitmapset  *colnos;
-
-	colnos = NULL;
-	(void) find_unaggregated_cols_walker((Node *) node->plan.targetlist,
-		 );
-	(void) find_unaggregated_cols_walker((Node *) node->plan.qual,
-		 );
-	return colnos;
+	Agg *agg = (Agg *) aggstate->ss.ps.plan;
+	FindColsContext context;
+
+	context.is_aggref = false;
+	context.aggregated = NULL;
+	context.unaggregated = NULL;
+
+	(void) find_cols_walker((Node *) agg->plan.targetlist, );
+	(void) find_cols_walker((Node *) agg->plan.qual, );
+
+	*aggregated = context.aggregated;
+	*unaggregated = context.unaggregated;
 }
 
 static bool
-find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
+find_cols_walker(Node *node, FindColsContext *context)
 {
 	if (node == NULL)
 		return false;
@@ -1404,16 +1415,24 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 		/* setrefs.c should have set the varno to OUTER_VAR */
 		Assert(var->varno == OUTER_VAR);
 		Assert(var->varlevelsup == 0);
-		*colnos = bms_add_member(*colnos, var->varattno);
+		if (context->is_aggref)
+			context->aggregated = bms_add_member(context->aggregated,
+ var->varattno);
+		else
+			context->unaggregated = bms_add_member(context->unaggregated,
+   var->varattno);
 		return false;
 	}
-	if (IsA(node, Aggref) || IsA(node, GroupingFunc))
+	if (IsA(node, Aggref))
 	{
-		/* do not descend into 

Re: Failures with wal_consistency_checking and 13~

2020-06-15 Thread Ranier Vilela
Em seg., 15 de jun. de 2020 às 10:14, Michael Paquier 
escreveu:

> Hi all,
>
> I have begun my annual study of WAL consistency across replays, and
> wal_consistency_checking = 'all' is pointing out at some issues with
> at least VACUUM and SPGist:
> FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
> blkno 15
> CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
> XID 4619
>
> It may be possible that there are other failures, I have just run
> installcheck and this is the first failure I saw after replaying all
> the generated WAL on a standby.  Please note that I have also checked
> 12, and installcheck passes.
>
With Postgres 13, Windows 10 (home), msvc 2019 64 bits,
Shutting down, without interrupting the database, consistently, this log
has appeared.

2020-06-15 21:40:35.998 -03 [3380] LOG:  database system shutdown was
interrupted; last known up at 2020-06-15 21:36:01 -03
2020-06-15 21:40:37.978 -03 [3380] LOG:  database system was not properly
shut down; automatic recovery in progress
2020-06-15 21:40:37.992 -03 [3380] LOG:  redo starts at 0/8A809C78
2020-06-15 21:40:37.992 -03 [3380] LOG:  invalid record length at
0/8A809CB0: wanted 24, got 0
2020-06-15 21:40:37.992 -03 [3380] LOG:  redo done at 0/8A809C78

regards,
Ranier Vilela


Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-15 Thread Andres Freund
Hi,

On 2020-06-09 17:04:42 -0400, Robert Haas wrote:
> On Tue, Jun 9, 2020 at 3:37 PM Andres Freund  wrote:
> > Hm. Looking at this again, perhaps the better fix would be to simply not
> > look at the concrete values of the barrier inside the signal handler?
> > E.g. we could have a new PROCSIG_GLOBAL_BARRIER, which just triggers
> > ProcSignalBarrierPending to be set. And then have
> > ProcessProcSignalBarrier do the check that's currently in
> > CheckProcSignalBarrier()?
> 
> That seems like a good idea.

What do you think about 0002?


With regard to the cost of the expensive test in 0003, I'm somewhat
inclined to add that to the buildfarm for a few days and see how it
actually affects the few bf animals without atomics. We can rip it out
after we got some additional coverage (or leave it in if it turns out to
be cheap enough in comparison).


> Also, I wonder if someone would be willing to set up a BF animal for this.

FWIW, I've requested a buildfarm animal id for this a few days ago, but
haven't received a response yet...

Greetings,

Andres Freund
>From 36601fe27dfefae7f5c1221fbfd364c6d7def4b5 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 8 Jun 2020 15:25:49 -0700
Subject: [PATCH v1 1/4] spinlock emulation: Fix bug when more than INT_MAX
 spinlocks are initialized.

Once the counter goes negative we ended up with spinlocks that errored
out on first use (due to check in tas_sema).

Author: Andres Freund
Discussion: https://postgr.es/m/20200606023103.avzrctgv7476x...@alap3.anarazel.de
Backpatch: 9.5-
---
 src/backend/storage/lmgr/spin.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 4d2a4c6641a..753943e46d6 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -106,7 +106,7 @@ SpinlockSemaInit(void)
 void
 s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
-	static int	counter = 0;
+	static uint32 counter = 0;
 
 	*lock = ((++counter) % NUM_SPINLOCK_SEMAPHORES) + 1;
 }
-- 
2.25.0.114.g5b0ca878e0

>From 73d72693fc7a55ae327212f3932d2a45eb47d13b Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 15 Jun 2020 18:23:10 -0700
Subject: [PATCH v1 2/4] Avoid potential spinlock use inside a signal handler
 for global barriers.

On platforms without support for 64bit atomic operations where we also
cannot rely on 64bit reads to have single copy atomicity, such atomics
are implemented using a spinlock based fallback. That means it's not
safe to even read such atomics from within a signal handler (since the
signal handler might run when the spinlock already is held).

To avoid this issue defer global barrier processing out of the signal
handler. Instead of checking local / shared barrier generation to
determine whether to set ProcSignalBarrierPending, introduce
PROCSIGNAL_BARRIER and always set ProcSignalBarrierPending when
receiving such a signal. Additionally avoid redundant work in
ProcessProcSignalBarrier if ProcSignalBarrierPending is unnecessarily.

Also do a small amount of other polishing.

Author: Andres Freund
Discussion: https://postgr.es/m/20200609193723.eu5ilsjxwdpyx...@alap3.anarazel.de
Backpatch: 13-, where the code was introduced.
---
 src/include/storage/procsignal.h |  1 +
 src/backend/storage/ipc/procsignal.c | 87 
 2 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index a0c0bc3ce55..5cb39697f38 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -33,6 +33,7 @@ typedef enum
 	PROCSIG_NOTIFY_INTERRUPT,	/* listen/notify interrupt */
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
+	PROCSIG_BARRIER,			/* global barrier interrupt  */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c809196d06a..4fa385b0ece 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -320,7 +320,7 @@ SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
 uint64
 EmitProcSignalBarrier(ProcSignalBarrierType type)
 {
-	uint64		flagbit = UINT64CONST(1) << (uint64) type;
+	uint32		flagbit = 1 << (uint32) type;
 	uint64		generation;
 
 	/*
@@ -363,7 +363,11 @@ EmitProcSignalBarrier(ProcSignalBarrierType type)
 		pid_t		pid = slot->pss_pid;
 
 		if (pid != 0)
+		{
+			/* see SendProcSignal for details */
+			slot->pss_signalFlags[PROCSIG_BARRIER] = true;
 			kill(pid, SIGUSR1);
+		}
 	}
 
 	return generation;
@@ -383,6 +387,8 @@ WaitForProcSignalBarrier(uint64 generation)
 {
 	long		timeout = 125L;
 
+	Assert(generation <= pg_atomic_read_u64(>psh_barrierGeneration));
+
 	for (int i = NumProcSignalSlots - 1; i >= 0; i--)
 	{
 		volatile ProcSignalSlot *slot 

Re: 回复:回复:how to create index concurrently on partitioned table

2020-06-15 Thread Michael Paquier
On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
> This is a good idea.
> We should maintain the consistency of the entire partition table.
> However, this is not a small change in the code.
> May be we need to make a new design for DefineIndex function

Indeed.  I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+   /*
+* CIC needs to mark a partitioned table as VALID, which itself
+* requires setting READY, which is unset for CIC (even though
+* it's meaningless for an index without storage).
+*/
+   if (concurrent)
+   {
+   PopActiveSnapshot();
+   CommitTransactionCommand();
+   StartTransactionCommand();
+   index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+   CommandCounterIncrement();
+   index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
}

I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees.  That's the point you
raised.  This one is going to need more thoughts.

> But most importantly, if we use three steps to implement CIC, 
> we will spend more time than ordinary tables, especially in high
> concurrency cases.  To wait for one of partitions which the users to
> use  frequently, the creation of other partition indexes is delayed. 
> Is it worth doing this?

CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time.  Now it
is true that it may not scale well so we should be careful with the
approach taken.  Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael


signature.asc
Description: PGP signature


Re: [bug?] Is the format of tables in the documentation broken in PG 13?

2020-06-15 Thread Tom Lane
"tsunakawa.ta...@fujitsu.com"  writes:
> From: Daniel Gustafsson 
>> That's less good.  The W3C Web Accessibility Initiative has guidance for
>> multi-level header tables which might be useful here.
>> https://www.w3.org/WAI/tutorials/tables/multi-level/
>> Maybe if we use the id and headers attributes we can give screen readers
>> enough clues to make sense of the information?

> Hm, I think I'll look into this.

Please do.  I looked at the referenced link a bit, but it wasn't clear
to me that they suggested anything useful to do :-(.

It'd probably be best to discuss this on the pgsql-docs list.

regards, tom lane




Re: pg_upgrade fails if vacuum_defer_cleanup_age > 0

2020-06-15 Thread Bruce Momjian
On Sat, Jun 13, 2020 at 08:46:36AM -0400, Bruce Momjian wrote:
> On Wed, Jun 10, 2020 at 04:07:05PM +0200, Laurenz Albe wrote:
> > A customer's upgrade failed, and it took me a while to
> > figure out that the problem was that they had set
> > "vacuum_defer_cleanup_age=1" on the new cluster.
> > 
> > The consequence was that the "vacuumdb --freeze" that
> > takes place before copying commit log files failed to
> > freeze "pg_database".
> > That caused later updates to the table to fail with
> > "Could not open file "pg_xact/": No such file or directory."
> > 
> > I think it would increase the robustness of pg_upgrade to
> > force "vacuum_defer_cleanup_age" to 0 on the new cluster.
> 
> Wow, I can see setting "vacuum_defer_cleanup_age" to a non-zero value as
> causing big problems for pg_upgrade, and being hard to diagnose.  I will
> apply this patch to all branches.

Thank you, applied to all supported PG versions.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: language cleanups in code and docs

2020-06-15 Thread Andres Freund
Hi,

On 2020-06-15 19:54:25 -0400, Tom Lane wrote:
> Daniel Gustafsson  writes:
> > On 15 Jun 2020, at 20:22, Andres Freund  wrote:
> >> 1) 'postmaster'. As changing that would be somewhat invasive, the word
> >> is a bit more ambiguous, and it's largely just internal, I've left
> >> this alone for now. I personally would rather see this renamed as
> >> supervisor, which'd imo actually would also be a lot more
> >> descriptive. I'm willing to do the work, but only if there's at least
> >> some agreement.
>
> > FWIW, I've never really liked the name postmaster as I don't think it 
> > conveys
> > meaning.  I support renaming to supervisor or a similar term.
>
> Meh.  That's carrying PC naming foibles to the point where they
> negatively affect our users (by breaking start scripts and such).
> I think we should leave this alone.

postmaster is just a symlink, which we very well could just leave in
place... I was really just thinking of the code level stuff. And I think
there's some clarity reasons to rename it as well (see comments by
others in the thread).

Anyway, for now my focus is on patches in the series...

- Andres




RE: [bug?] Is the format of tables in the documentation broken in PG 13?

2020-06-15 Thread tsunakawa.ta...@fujitsu.com
From: Daniel Gustafsson 
> Yes, this was a deliberate change made to be able to fit more expansive
> descriptions of columns etc.

Thanks for your quick response and information.  I'm relieved to know that it 
was not broken.


> That's less good.  The W3C Web Accessibility Initiative has guidance for
> multi-
> level header tables which might be useful here.
> 
>   https://www.w3.org/WAI/tutorials/tables/multi-level/
> 
> Maybe if we use the id and headers attributes we can give screen readers
> enough
> clues to make sense of the information?

Hm, I think I'll look into this.


Regards
Takayuki Tsunakawa





Re: language cleanups in code and docs

2020-06-15 Thread Tom Lane
Daniel Gustafsson  writes:
> On 15 Jun 2020, at 20:22, Andres Freund  wrote:
>> 1) 'postmaster'. As changing that would be somewhat invasive, the word
>> is a bit more ambiguous, and it's largely just internal, I've left
>> this alone for now. I personally would rather see this renamed as
>> supervisor, which'd imo actually would also be a lot more
>> descriptive. I'm willing to do the work, but only if there's at least
>> some agreement.

> FWIW, I've never really liked the name postmaster as I don't think it conveys
> meaning.  I support renaming to supervisor or a similar term.

Meh.  That's carrying PC naming foibles to the point where they
negatively affect our users (by breaking start scripts and such).
I think we should leave this alone.

regards, tom lane




Re: language cleanups in code and docs

2020-06-15 Thread Bruce Momjian
On Tue, Jun 16, 2020 at 09:53:34AM +1200, Thomas Munro wrote:
> On Tue, Jun 16, 2020 at 7:04 AM Daniel Gustafsson  wrote:
> > > On 15 Jun 2020, at 20:22, Andres Freund  wrote:
> >
> > Thanks for picking this up!
> >
> > > 1) 'postmaster'. As changing that would be somewhat invasive, the word
> > >   is a bit more ambiguous, and it's largely just internal, I've left
> > >   this alone for now. I personally would rather see this renamed as
> > >   supervisor, which'd imo actually would also be a lot more
> > >   descriptive. I'm willing to do the work, but only if there's at least
> > >   some agreement.
> >
> > FWIW, I've never really liked the name postmaster as I don't think it 
> > conveys
> > meaning.  I support renaming to supervisor or a similar term.
> 
> +1.  Postmaster has always sounded like a mailer daemon or something,
> which we ain't.

Postmaster is historically confused with the postmaster email account:

https://en.wikipedia.org/wiki/Postmaster_(computing)

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: language cleanups in code and docs

2020-06-15 Thread Thomas Munro
On Tue, Jun 16, 2020 at 7:04 AM Daniel Gustafsson  wrote:
> > On 15 Jun 2020, at 20:22, Andres Freund  wrote:
>
> Thanks for picking this up!
>
> > 1) 'postmaster'. As changing that would be somewhat invasive, the word
> >   is a bit more ambiguous, and it's largely just internal, I've left
> >   this alone for now. I personally would rather see this renamed as
> >   supervisor, which'd imo actually would also be a lot more
> >   descriptive. I'm willing to do the work, but only if there's at least
> >   some agreement.
>
> FWIW, I've never really liked the name postmaster as I don't think it conveys
> meaning.  I support renaming to supervisor or a similar term.

+1.  Postmaster has always sounded like a mailer daemon or something,
which we ain't.




Re: Parallel Seq Scan vs kernel read ahead

2020-06-15 Thread David Rowley
On Tue, 16 Jun 2020 at 03:29, Robert Haas  wrote:
>
> On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila  wrote:
> > The performance can vary based on qualification where some workers
> > discard more rows as compared to others, with the current system with
> > step-size as one, the probability of unequal work among workers is
> > quite low as compared to larger step-sizes.
>
> It seems like this would require incredibly bad luck, though. If the
> step size is less than 1/1024 of the relation size, and we ramp down
> for, say, the last 5% of the relation, then the worst case is that
> chunk 972 of 1024 is super-slow compared to all the other blocks, so
> that it takes longer to process chunk 972 only than it does to process
> chunks 973-1024 combined. It is not impossible, but that chunk has to
> be like 50x worse than all the others, which doesn't seem like
> something that is going to happen often enough to be worth worrying
> about very much. I'm not saying it will never happen. I'm just
> skeptical about the benefit of adding a GUC or reloption for a corner
> case like this. I think people will fiddle with it when it isn't
> really needed, and won't realize it exists in the scenarios where it
> would have helped.

I'm trying to think of likely scenarios where "lots of work at the
end" is going to be common.  I can think of queue processing, but
everything I can think about there requires an UPDATE to the processed
flag, which won't be using parallel query anyway. There's then
processing something based on some period of time like "the last
hour", "today". For append-only tables the latest information is
likely to be at the end of the heap.  For that, anyone that's getting
a SeqScan on a large relation should likely have added an index. If a
btree is too costly, then BRIN is pretty perfect for that case.

FWIW, I'm not really keen on adding a reloption or a GUC.  I've also
voiced here that I'm not even keen on the ramp-up.

To summarise what's all been proposed so far:

1. Use a constant, (e.g. 64) as the parallel step size
2. Ramp up the step size over time
3. Ramp down the step size towards the end of the scan.
4. Auto-determine a good stepsize based on the size of the relation.
5. Add GUC to allow users to control or override the step size.
6. Add relption to allow users to control or override the step size.


Here are my thoughts on each of those:

#1 is a bad idea as there are legitimate use-cases for using parallel
query on small tables. e.g calling some expensive parallel safe
function. Small tables are more likely to be cached.
#2 I don't quite understand why this is useful
#3 I understand this is to try to make it so workers all complete
around about the same time.
#4 We really should be doing it this way.
#5 Having a global knob to control something that is very specific to
the size of a relation does not make much sense to me.
#6. I imagine someone will have some weird use-case that works better
when parallel workers get 1 page at a time. I'm not convinced that
they're not doing something else wrong.

So my vote is for 4 with possibly 3, if we can come up with something
smart enough * that works well in parallel.  I think there's less of a
need for this if we divided the relation into more chunks, e.g. 8192
or 16384.

* Perhaps when there are less than 2 full chunks remaining, workers
can just take half of what is left. Or more specifically
Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work
out allocating an amount of pages proportional to the amount of beer
each mathematician receives in the "An infinite number of
mathematicians walk into a bar" joke, obviously with the exception
that we stop dividing when we get to 1. However, I'm not quite sure
how well that can be made to work with multiple bartenders working in
parallel.

David




Re: tar-related code in PostgreSQL

2020-06-15 Thread Robert Haas
On Mon, Apr 27, 2020 at 2:07 PM Robert Haas  wrote:
> > I'd lean mildly to holding 0002 until after we branch.  It probably
> > won't break anything, but it probably won't fix anything either.
>
> True.

Committed now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




pg_dump --where option

2020-06-15 Thread Surafel Temesgen
Internally pg_dump have capability to filter the table data to dump by same
filter clause but it have no interface to use it and the patch here [1]
adds interface to it but it have at-least two issue, one is error message
in case of incorrect where clause specification is somehow hard to debug
and strange to pg_dump .Other issue is it applies the same filter clause to
multiple tables if pattern matching return multiple tables and it seems
undesired behavior to me because mostly we don’t want to applied the same
where clause specification to multiple table. The attached patch contain a
fix for both issue

[1].
https://www.postgresql.org/message-id/flat/CAGiT_HNav5B=OfCdfyFoqTa+oe5W1vG=pxktetcxxg4kcut...@mail.gmail.com


regards

Surafel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 2f0807e912..1c43eaa9de 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1103,6 +1103,21 @@ PostgreSQL documentation
   
  
 
+ 
+  --where=table:filter_clause
+  
+   
+When dumping data for table, only include rows
+that meet the filter_clause condition.
+This option is useful when you want to dump only a subset of a particular table.
+--where can be given more than once to provide different filters for multiple tables.
+Note that if multiple options refer to the same table, only the first filter_clause will be applied.
+If necessary, use quotes in your shell to provide an argument that contains spaces.
+E.g. --where=mytable:"created_at >= '2018-01-01' AND test = 'f'"
+   
+  
+ 
+
  
-?
--help
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 89d598f856..566469cdb7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -121,6 +121,8 @@ static SimpleStringList tabledata_exclude_patterns = {NULL, NULL};
 static SimpleOidList tabledata_exclude_oids = {NULL, NULL};
 static SimpleStringList foreign_servers_include_patterns = {NULL, NULL};
 static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
+static SimpleStringList tabledata_where_patterns = {NULL, NULL};
+static SimpleOidList tabledata_where_oids = {NULL, NULL};
 
 static const CatalogId nilCatalogId = {0, 0};
 
@@ -156,7 +158,8 @@ static void expand_foreign_server_name_patterns(Archive *fout,
 static void expand_table_name_patterns(Archive *fout,
 	   SimpleStringList *patterns,
 	   SimpleOidList *oids,
-	   bool strict_names);
+	   bool strict_names,
+	   bool match_data);
 static NamespaceInfo *findNamespace(Archive *fout, Oid nsoid);
 static void dumpTableData(Archive *fout, TableDataInfo *tdinfo);
 static void refreshMatViewData(Archive *fout, TableDataInfo *tdinfo);
@@ -386,6 +389,7 @@ main(int argc, char **argv)
 		{"on-conflict-do-nothing", no_argument, _nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
 		{"include-foreign-data", required_argument, NULL, 11},
+		{"where", required_argument, NULL, 12},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -603,6 +607,10 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:/* table data WHERE clause */
+simple_string_list_append(_where_patterns, optarg);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -805,17 +813,26 @@ main(int argc, char **argv)
 	{
 		expand_table_name_patterns(fout, _include_patterns,
    _include_oids,
-   strict_names);
+   strict_names, false);
 		if (table_include_oids.head == NULL)
 			fatal("no matching tables were found");
 	}
+
+	if (tabledata_where_patterns.head != NULL)
+	{
+		expand_table_name_patterns(fout, _where_patterns,
+   _where_oids,
+   true, true);
+		if (tabledata_where_oids.head == NULL)
+			fatal("no matching table was found");
+	}
 	expand_table_name_patterns(fout, _exclude_patterns,
 			   _exclude_oids,
-			   false);
+			   false, false);
 
 	expand_table_name_patterns(fout, _exclude_patterns,
 			   _exclude_oids,
-			   false);
+			   false, false);
 
 	expand_foreign_server_name_patterns(fout, _servers_include_patterns,
 		_servers_include_oids);
@@ -1046,6 +1063,7 @@ help(const char *progname)
 	printf(_("  --use-set-session-authorization\n"
 			 "   use SET SESSION AUTHORIZATION commands instead of\n"
 			 "   ALTER OWNER commands to set ownership\n"));
+	printf(_("  --where=TABLE:WHERE_CLAUSE   only dump selected rows for the given table\n"));
 
 	printf(_("\nConnection options:\n"));
 	printf(_("  -d, --dbname=DBNAME  database to dump\n"));
@@ -1393,16 +1411,20 @@ expand_foreign_server_name_patterns(Archive *fout,
 /*
  * Find the OIDs of all tables matching the given list of patterns,
  * and append them to the given OID list. See also 

Re: language cleanups in code and docs

2020-06-15 Thread Daniel Gustafsson
> On 15 Jun 2020, at 20:22, Andres Freund  wrote:

Thanks for picking this up!

> 1) 'postmaster'. As changing that would be somewhat invasive, the word
>   is a bit more ambiguous, and it's largely just internal, I've left
>   this alone for now. I personally would rather see this renamed as
>   supervisor, which'd imo actually would also be a lot more
>   descriptive. I'm willing to do the work, but only if there's at least
>   some agreement.

FWIW, I've never really liked the name postmaster as I don't think it conveys
meaning.  I support renaming to supervisor or a similar term.

cheers ./daniel




Re: [PATCH] Missing links between system catalog documentation pages

2020-06-15 Thread Dagfinn Ilmari Mannsåker
ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:

> ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:
>
>> The attached patch makes the first mention of another system catalog or
>> view (as well as pg_hba.conf in pg_hba_file_lines) a link, for easier
>> navigation.
>
> bump...

Added to the current commitfest:

https://commitfest.postgresql.org/28/2599/

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl




[PATCH] Missing links between system catalog documentation pages

2020-06-15 Thread Dagfinn Ilmari Mannsåker
ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:

> The attached patch makes the first mention of another system catalog or
> view (as well as pg_hba.conf in pg_hba_file_lines) a link, for easier
> navigation.

bump...

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen




Re: Review for GetWALAvailability()

2020-06-15 Thread Fujii Masao



On 2020/06/15 13:42, Kyotaro Horiguchi wrote:

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao  
wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

 some WAL files are definitely lost and this slot cannot be used to
 resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
  wal_segment_size) +
  1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.


Oops!  I don't want to believe I did that but it's definitely wrong.


The above should be the following?

 Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
 wal_keep_segments) + 1


Looks reasonable.


if ((max_slot_wal_keep_size_mb <= 0 ||
 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
  means that the claimed files are within max_wal_size". So whatever
  max_slot_wal_keep_size value is, IMO that "normal" should be
  reported if the WAL files claimed by the slot are within max_wal_size.
  Thought?


It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept.  In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.


Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.


Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.


If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?


In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.


BTW, I read the code of InvalidateObsoleteReplicationSlots() and probably
found some issues in it.

1. Each cycle of the "for" loop in InvalidateObsoleteReplicationSlots()
emits the log message  "terminating walsender ...". This means that
if it takes more than 10ms for walsender to exit after it's signaled,
the second and subsequent cycles would happen and output the same
log message several times. IMO that log message should be output
only once.

2. InvalidateObsoleteReplicationSlots() uses the loop to scan replication
slots array and uses the "for" loop in each scan. Also it calls
ReplicationSlotAcquire() for each "for" loop cycle, and
ReplicationSlotAcquire() uses another loop to scan replication slots
array. I don't think this is good design.

ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
InvalidateObsoleteReplicationSlots() already know the index of the slot
that we want to find. The attached patch does that. Thought?

3. There is a corner case where the termination of walsender cleans up
the temporary replication slot while InvalidateObsoleteReplicationSlots()
is sleeping on ConditionVariableTimedSleep(). In this case,
ReplicationSlotAcquire() is called in the subsequent cycle of the "for"
loop, cannot find the slot and then emits ERROR message. This leads
to the failure of checkpoint by the checkpointer.

To avoid this case, if SAB_Inquire is specified, ReplicationSlotAcquire()
should return the special value instead of emitting ERROR even when
it cannot find the slot. Also 

Re: hashagg slowdown due to spill changes

2020-06-15 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
>  wrote:
>> But just reverting 4cad2534d will make this much worse, I think, as
>> illustrated by the benchmarks I did in [1].

> I share this concern, although I do not know what we should do about it.

Well, it's only June.  Let's put it on the open issues list for v13
and continue to think about it.  I concur that the hashagg spill patch
has made this something that we should worry about for v13, so just
reverting without a better answer isn't very appetizing.

regards, tom lane




Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-15 Thread Ranier Vilela
Attached a log.

I hacked dirmod.c (pgrename), to print GetLastError();

MoveFIleEx from: pg_stat_tmp/global.tmp
MoveFIleEx to: pg_stat_tmp/global.stat
MoveFIleEx win32 error code 5

regards,
Ranier Vilela


logfile
Description: Binary data


Re: Parallel Seq Scan vs kernel read ahead

2020-06-15 Thread Robert Haas
On Sat, Jun 13, 2020 at 2:13 AM Amit Kapila  wrote:
> The performance can vary based on qualification where some workers
> discard more rows as compared to others, with the current system with
> step-size as one, the probability of unequal work among workers is
> quite low as compared to larger step-sizes.

It seems like this would require incredibly bad luck, though. If the
step size is less than 1/1024 of the relation size, and we ramp down
for, say, the last 5% of the relation, then the worst case is that
chunk 972 of 1024 is super-slow compared to all the other blocks, so
that it takes longer to process chunk 972 only than it does to process
chunks 973-1024 combined. It is not impossible, but that chunk has to
be like 50x worse than all the others, which doesn't seem like
something that is going to happen often enough to be worth worrying
about very much. I'm not saying it will never happen. I'm just
skeptical about the benefit of adding a GUC or reloption for a corner
case like this. I think people will fiddle with it when it isn't
really needed, and won't realize it exists in the scenarios where it
would have helped. And then, because we have the setting, we'll have
to keep it around forever, even as we improve the algorithm in other
ways, which could become a maintenance burden. I think it's better to
treat stuff like this as an implementation detail rather than
something we expect users to adjust.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: hashagg slowdown due to spill changes

2020-06-15 Thread Robert Haas
On Mon, Jun 15, 2020 at 9:34 AM Tomas Vondra
 wrote:
> But just reverting 4cad2534d will make this much worse, I think, as
> illustrated by the benchmarks I did in [1].

I share this concern, although I do not know what we should do about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: factorial of negative numbers

2020-06-15 Thread Tom Lane
... oh, one slightly more important nit-pick: per the catalogs and
code, the function is factorial(bigint):

   Schema   |   Name| Result data type | Argument data types | Type 
+---+--+-+--
 pg_catalog | factorial | numeric  | bigint  | func

but you have it documented as factorial(numeric).

regards, tom lane




Re: create database with template doesn't copy database ACL

2020-06-15 Thread Bruce Momjian
On Mon, Jun 15, 2020 at 12:14:55AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Well, I thought we copied everything except things tha can be specified
> > as different in CREATE DATABASE, though I can see why we would not copy
> > them.  Should we document this or issue a notice about not copying
> > non-default database attributes?
> 
> We do not need a notice for behavior that (a) has stood for twenty years
> or so, and (b) is considerably less broken than any alternative would be.
> If you feel the docs need improvement, have at that.

Well, I realize it has been this way for a long time, and that no one
else has complained, but there should be a way for people to know what
is being copied from the template and what is not.  Do we have a clear
description of what is copied and skipped?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Parallel copy

2020-06-15 Thread Ashutosh Sharma
Thanks Amit for the clarifications. Regarding partitioned table, one of the
question was - if we are loading data into a partitioned table using COPY
command, then the input file would contain tuples for different tables
(partitions) unlike the normal table case where all the tuples in the input
file would belong to the same table. So, in such a case, how are we going
to accumulate tuples into the DSM? I mean will the leader process check
which tuple needs to be routed to which partition and accordingly
accumulate them into the DSM. For e.g. let's say in the input data file we
have 10 tuples where the 1st tuple belongs to partition1, 2nd belongs to
partition2 and likewise. So, in such cases, will the leader process
accumulate all the tuples belonging to partition1 into one DSM and tuples
belonging to partition2 into some other DSM and assign them to the worker
process or we have taken some other approach to handle this scenario?

Further, I haven't got much time to look into the links that you have
shared in your previous response. Will have a look into those and will also
slowly start looking into the patches  as and when I get some time. Thank
you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Sat, Jun 13, 2020 at 9:42 AM Amit Kapila  wrote:

> On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma 
> wrote:
> >
> > Hi All,
> >
> > I've spent little bit of time going through the project discussion that
> has happened in this email thread and to start with I have few questions
> which I would like to put here:
> >
> > Q1) Are we also planning to read the input data in parallel or is it
> only about performing the multi-insert operation in parallel? AFAIU, the
> data reading part will be done by the leader process alone so no
> parallelism is involved there.
> >
>
> Yes, your understanding is correct.
>
> > Q2) How are we going to deal with the partitioned tables?
> >
>
> I haven't studied the patch but my understanding is that we will
> support parallel copy for partitioned tables with a few restrictions
> as explained in my earlier email [1].  See, Case-2 (b) in the email.
>
> > I mean will there be some worker process dedicated for each partition or
> how is it?
>
> No, it the split is just based on the input, otherwise each worker
> should insert as we would have done without any workers.
>
> > Q3) Incase of toast tables, there is a possibility of having a single
> tuple in the input file which could be of a very big size (probably in GB)
> eventually resulting in a bigger file size. So, in this case, how are we
> going to decide the number of worker processes to be launched. I mean,
> although the file size is big, but the number of tuples to be processed is
> just one or few of them, so, can we decide the number of the worker
> processes to be launched based on the file size?
> >
>
> Yeah, such situations would be tricky, so we should have an option for
> user to specify the number of workers.
>
> > Q4) Who is going to process constraints (preferably the deferred
> constraint) that is supposed to be executed at the COMMIT time? I mean is
> it the leader process or the worker process or in such cases we won't be
> choosing the parallelism at all?
> >
>
> In the first version, we won't do parallelism for this.  Again, see
> one of my earlier email [1] where I have explained this and other
> cases where we won't be supporting parallel copy.
>
> > Q5) Do we have any risk of table bloating when the data is loaded in
> parallel. I am just asking this because incase of parallelism there would
> be multiple processes performing bulk insert into a table. There is a
> chance that the table file might get extended even if there is some space
> into the file being written into, but that space is locked by some other
> worker process and hence that might result in a creation of a new block for
> that table. Sorry, if I am missing something here.
> >
>
> Hmm, each worker will operate at page level, after first insertion,
> the same worker will try to insert in the same page in which it has
> inserted last, so there shouldn't be such a problem.
>
> > Please note that I haven't gone through all the emails in this thread so
> there is a possibility that I might have repeated the question that has
> already been raised and answered here. If that is the case, I am sorry for
> that, but it would be very helpful if someone could point out that thread
> so that I can go through it. Thank you.
> >
>
> No problem, I understand sometimes it is difficult to go through each
> and every email especially when the discussion is long.  Anyway,
> thanks for showing the interest in the patch.
>
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: factorial of negative numbers

2020-06-15 Thread Tom Lane
Peter Eisentraut  writes:
> Adjacent to the discussion in [0] I wanted to document the factorial() 
> function and expand the tests for that slightly with some edge cases.
> ...
> I propose to change this to error out for negative numbers.

+1 for all of this, with a couple trivial nitpicks about the error
changes:

* I'd have written the error as "factorial of a negative number is
undefined" ... not sure what a grammar stickler would say about it,
but that seems more natural to me.

* I'd leave the "if (num <= 1)" test after the error check as-is;
it's probably a shade cheaper than "if (num == 0 || num == 1)".

regards, tom lane




Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-15 Thread Masahiko Sawada
On Mon, 15 Jun 2020 at 15:20, Amit Kapila  wrote:
>
> On Sun, Jun 14, 2020 at 2:21 PM Tatsuo Ishii  wrote:
> >
> > >> Won't it create an inconsistency in viewing the data from the
> > >> different servers?  Say, such a transaction inserts one row into a
> > >> local server and another into the foreign server.  Now, if we follow
> > >> the above protocol, the user will be able to see the row from the
> > >> local server but not from the foreign server.
> > >
> > > Yes, you're right. This atomic commit feature doesn't guarantee such
> > > consistent visibility so-called atomic visibility.
>
> Okay, I understand that the purpose of this feature is to provide
> atomic commit which means the transaction on all servers involved will
> either commit or rollback.  However, I think we should at least see at
> a high level how the visibility will work because it might influence
> the implementation of this feature.
>
> > > Even the local
> > > server is not modified, since a resolver process commits prepared
> > > foreign transactions one by one another user could see an inconsistent
> > > result. Providing globally consistent snapshots to transactions
> > > involving foreign servers is one of the solutions.
>
> How would it be able to do that?  Say, when it decides to take a
> snapshot the transaction on the foreign server appears to be committed
> but the transaction on the local server won't appear to be committed,
> so the consistent data visibility problem as mentioned above could
> still arise.

There are many solutions. For instance, in Postgres-XC/X2 (and maybe
XL), there is a GTM node that is responsible for providing global
transaction IDs (GXID) and globally consistent snapshots. All
transactions need to access GTM when checking the distributed
transaction status as well as starting transactions and ending
transactions. IIUC if a global transaction accesses a tuple whose GXID
is included in its global snapshot it waits for that transaction to be
committed or rolled back.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: hashagg slowdown due to spill changes

2020-06-15 Thread Tomas Vondra

On Sun, Jun 14, 2020 at 11:09:55PM -0700, Jeff Davis wrote:

On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:

I'm somewhat inclined to think that we should revert 4cad2534da6 and
then look at how precisely to tackle this in 14.


I'm fine with that.



I don't see how we could just revert 4cad2534d and leave this for v14.

The hashagg spilling is IMHO almost guaranteed to be a pain point for
some users, as it will force some queries to serialize large amounts of
data. Yes, some of this is a cost for hashagg enforcing work_mem at
runtime, I'm fine with that. We'd get reports about that too, but we can
justify that cost ...

But just reverting 4cad2534d will make this much worse, I think, as
illustrated by the benchmarks I did in [1]. And no, this is not really
fixable by tweaking the cost parameters - even with the current code
(i.e. 4cad2534d in place) I had to increase random_page_cost to 60 on
the temp tablespace (on SATA RAID) to get good plans with parallelism
enabled. I haven't tried, but I presume without 4cad2534d I'd have to
push r_p_c even further ...

[1] 
https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2%40development


It'd probably make sense to request small tlists when the number of
estimated groups is large, and not otherwise.


That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().



Maybe. It'd certainly better than nothing. It's not clear to me what
would a good threshold be, though. And it's not going to handle cases of
under-estimates.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





回复:回复:how to create index concurrently on partitioned table

2020-06-15 Thread 李杰(慎追)
> That's a problem.  I really think that we should make the steps of the
> concurrent operation consistent across all relations, meaning that all
> the indexes should be created as invalid for all the parts of the
> partition tree, including partitioned tables as well as their
> partitions, in the same transaction.  Then a second new transaction
> gets used for the index build, followed by a third one for the
> validation that switches the indexes to become valid.

This is a good idea.
We should maintain the consistency of the entire partition table.
However, this is not a small change in the code.
May be we need to make a new design for DefineIndex function

But most importantly, if we use three steps to implement CIC, 
we will spend more time than ordinary tables, especially in high concurrency 
cases.
To wait for one of partitions which the users to use  frequently, 
the creation of other partition indexes is delayed.
Is it worth doing this?

 Regards,  Adger





--
发件人:Michael Paquier 
发送时间:2020年6月15日(星期一) 20:37
收件人:李杰(慎追) 
抄 送:Justin Pryzby ; pgsql-hackers 
; 曾文旌(义从) ; 
Alvaro Herrera 
主 题:Re: 回复:how to create index concurrently on partitioned table

On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> As shown above, an error occurred while creating an index in the second 
> partition. 
> It can be clearly seen that the index of the partitioned table is invalid 
> and the index of the first partition is normal, the second partition is 
> invalid, 
> and the Third Partition index does not exist at all.

That's a problem.  I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction.  Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael


Failures with installcheck and low work_mem value in 13~

2020-06-15 Thread Michael Paquier
Hi all,

Attempting to run installcheck with 13~ and a value of work_mem lower
than the default causes two failures, both related to incremental
sorts (here work_mem = 1MB):
1) Test incremental_sort:
@@ -4,12 +4,13 @@
 select * from (select * from tenk1 order by four) t order by four, ten;
 QUERY PLAN 
 ---
- Sort
+ Incremental Sort
Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
->  Sort
  Sort Key: tenk1.four
  ->  Seq Scan on tenk1
-(5 rows)
+(6 rows)

2) Test join:
@@ -2368,12 +2368,13 @@
->  Merge Left Join
  Merge Cond: (x.thousand = y.unique2)
  Join Filter: ((x.twothousand = y.hundred) AND (x.fivethous = 
y.unique2))
- ->  Sort
+ ->  Incremental Sort
Sort Key: x.thousand, x.twothousand, x.fivethous
-   ->  Seq Scan on tenk1 x
+   Presorted Key: x.thousand
+   ->  Index Scan using tenk1_thous_tenthous on tenk1 x
  ->  Materialize
->  Index Scan using tenk1_unique2 on tenk1 y
-(9 rows)
+(10 rows)

There are of course regression failures when changing the relation
page size or such, but we should have tests more portable when it
comes to work_mem (this issue does not exist in ~12) or people running
installcheck on a new instance would be surprised.  Please note that I
have not looked at the problem in details, but a simple solution would
be to enforce work_mem in those code paths to keep the two plans
stable.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Failures with wal_consistency_checking and 13~

2020-06-15 Thread Michael Paquier
Hi all,

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL:  inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT:  WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby.  Please note that I have also checked
12, and installcheck passes.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-15 Thread Ranier Vilela
Em dom., 14 de jun. de 2020 às 23:53, Justin Pryzby 
escreveu:

> On Fri, Jun 12, 2020 at 03:15:52PM -0300, Ranier Vilela wrote:
> > Posgres13_beta1, is consistently writing to the logs, "could not rename
> > temporary statistics file".
> > When analyzing the source that writes the log, I simplified the part that
> > writes the logs a little.
>
> What windows version and compiler ?
>
Windows 10 (2004, msvc 2019 (64 bits)
None configuration, only git clone and build.bat


>
> Please show the full CSV log for this event, and not an excerpt.
> Preferably with several lines of "context" for the stats process PID, with
> log_min_messages=debug or debug2 and log_error_verbosity=verbose, so that
> you
> get the file location where it's erroring, if you don't already know that.
>
> https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
>
> Ok, I will provide.



> > 1. I changed from if else if to if
> > 2. For the user, better to have more errors recorded, which can help in
> > discovering the problem
> > 3. Errors are independent of each other
> > 4. If I can't release tmpfile, there's no way to delete it (unlink)
> > 5. If I can rename, there is no need to delete it (unlink) tmpfile.
> >
> > Attached is the patch that proposes these changes.
> > Now, the problem has not been solved.
>
> It sounds like you haven't yet found the problem, right ?  These are all
> unrelated changes which are confusing the problem report and discussion.
> And introducing behavior regressions, like renaming files with write
> errors on
> top of known good files.
>
It is certainly on pgstat.c
Yes, I have not yet discovered the real cause.
But, while checking the code, I thought I could improve error checking, and
to avoid creating a new thread about it, I took advantage of that thread.
It would certainly be better to separate, but this list is busy, I tried
not to create any more confusion.
If I can't record, I can't close and I can't rename, seeing this in the
logs, certainly, helps to solve the problem, not confuse.
In addition, the flow was cleaner, with fewer instructions and in this
specific case, it continues with the same expected behavior.
Only the rename message is showing, so the expected behavior is the same.


> I think you'll want to 1) identify where the problem is occuring, and
> attach a
> debugger there.
>
I will try.


>
> 2) figure out when the problem was introduced.  If this problem doesn't
> happen
> under v12:
>
I don't think so, but v12 I'm using the official distributed version.


>
> git log --cherry-pick -p origin/REL_12_STABLE...origin/REL_13_STABLE --
> src/backend/postmaster/pgstat.c
> or just:
> git log -p origin/REL_12_STABLE.. src/backend/postmaster/pgstat.c
>
> You could try git-bisecting between v12..v13, but there's only 30 commits
> which
> touched pgstat.c (assuming that's where the ERROR is being thrown)

Thanks for the hit.


> .
>
> Do you have a special value of stats_temp_directory?
>
Not.


> Or a symlink or junction at pg_stat_tmp ?
>
Not. Only C driver (one volume)

You saw the patch regarding the flag, suggested.


> > 1. statfile, is it really closed or does it not exist in the directory?
> >  There is no way to rename a file, which is open and in use.
> > 2. Why delete (pgstat_stat_filename), if permanent is true:
> > const char * statfile = permanent? PGSTAT_STAT_PERMANENT_FILENAME:
> > pgstat_stat_filename;
> > statfile is PGSTAT_STAT_PERMANENT_FILENAME and not pgstat_stat_filename
>
> You can find answers to a lot of questions in the git history.  In this
> case,
> 70d756970 and 187492b6c.
>
Ok,.

Thanks for answer Justin.

regards,
Ranier Vilela


Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-15 Thread Amit Kapila
On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar  wrote:
>
> On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila  wrote:
> >
>
> > >  Basically, this part is still
> > > I have to work upon, once we get the consensus then I can remove those
> > > extra wait event from the patch.
> > >
> >
> > Okay, feel free to send an updated patch with the above change.
>
> Sure, I will do that in the next patch set.
>

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

1.
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)

No need to say "are read-only if the flag is set".  I don't see any
flag passed to function so that part of the comment doesn't seem
appropriate.

2.
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  }

  /* Register our cleanup callback. */
- on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ if (seg)
+ on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }

Add comments atop function to explain when we don't want to register
the dsm detach stuff?

3.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
  break;

FileSize can return negative lengths to indicate failure which we
should handle.  See other places in the code where FileSize is used?
But I have another question here which is why we need to implement
SEEK_END?  How other usages of BufFile interface takes care of this?
I see an API BufFileTell which can give the current read/write
location in the file, isn't that sufficient for your usage?  Also, how
before BufFile usage is this thing handled in the patch?

4.
+ /* Loop over all the  files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)

"the  files", extra space in the above part of the comment.

5.
+ /*
+ * Except the fileno,  we can directly delete other files.

Before 'we', there is extra space.

6.
+ else
+ {
+ FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+ newOffset = offset;
+ }

The wait event passed here doesn't seem to be appropriate.  You might
want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE.  Also,
the error handling for FileTruncate is missing.

7.
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ SharedFileSetDelete(file->fileset, segment_name, true);
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

Similar to the previous comment, I think we should handle the failure
of SharedFileSetDelete.

8. I think the comments related to BufFile shared API usage need to be
expanded in the code to explain the new usage.  For ex., see the below
comments atop buffile.c
* BufFile supports temporary files that can be made read-only and shared with
* other backends, as infrastructure for parallel execution.  Such files need
* to be created as a member of a SharedFileSet that all participants are
* attached to.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-15 Thread Ranier Vilela
Em dom., 14 de jun. de 2020 às 23:08, Michael Paquier 
escreveu:

> On Fri, Jun 12, 2020 at 03:15:52PM -0300, Ranier Vilela wrote:
> > Posgres13_beta1, is consistently writing to the logs, "could not rename
> > temporary statistics file".
> > When analyzing the source that writes the log, I simplified the part that
> > writes the logs a little.
>
> FWIW, I have been running a server on Windows for some time with
> pgbench running in the background, combined with some starts and stops
> but I cannot see that.  This log entry uses LOG, which is the level we
> use for the TAP tests and please note that there are four buildfarm
> animals for Windows able to run the TAP tests and they don't seem to
> show that problem either: drongo, fairywen, jacana and bowerbird.  I
> may be missing something of course

Hi Michael, thsnks for answer.
Yeah, something is wrong with Postgres and Windows 10 (not server) with
msvc 2019 (64 bits)
II already reported on another thread, that vcregress is failing with
(float8 and partitionprune) and now these messages are showing up.
None buildfarm animal, have that combination, but as Postgres officially
supports it ..

regards,
Ranier Vilela


Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: 回复:how to create index concurrently on partitioned table

2020-06-15 Thread Michael Paquier
On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> As shown above, an error occurred while creating an index in the second 
> partition. 
> It can be clearly seen that the index of the partitioned table is invalid 
> and the index of the first partition is normal, the second partition is 
> invalid, 
> and the Third Partition index does not exist at all.

That's a problem.  I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction.  Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael


signature.asc
Description: PGP signature


Re: doc review for v13

2020-06-15 Thread Michael Paquier
On Fri, Jun 12, 2020 at 09:13:02PM +0900, Michael Paquier wrote:
> I have merged 0003 and 0004 together and applied them.  0005 seems to
> have a separate issue as mentioned upthread, and I have not really
> looked at 0001 and 0002.  Thanks.

And committed 0001 and 0002 after some tiny adjustments as of
7a3543c.
--
Michael


signature.asc
Description: PGP signature


回复:how to create index concurrently on partitioned table

2020-06-15 Thread 李杰(慎追)
> My (tentative) understanding is that these types of things should use a
> "subtransaction" internally..  So if the toplevel transaction rolls back, its
> changes are lost.  In some cases, it might be desirable to not roll back, in
> which case the user(client) should first create indexes (concurrently if
> needed) on every child, and then later create index on parent (that has the
> advtantage of working on older servers, too).

Hi Justin, 
I have a case here, you see if it meets your expectations.

`
postgres=# CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE
postgres=# CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (25);
CREATE TABLE
postgres=# CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (25) TO (50);
CREATE TABLE
postgres=# CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (50) TO (75);
CREATE TABLE
postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM') FROM 
generate_series(0, 74) i;
INSERT 0 75
postgres=# insert into prt1 values (26,1,'FM0026');
INSERT 0 1
postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);
ERROR: could not create unique index "prt1_p2_a_idx"
DETAIL: Key (a)=(26) is duplicated.
postgres=# \d+ prt1
 Partitioned table "public.prt1"
 Column |  Type | Collation | Nullable | Default | Storage | Stats target | 
Description
+---+---+--+-+--+--+-
 a | integer  |  | |  | plain | |
 b | integer  |  | |  | plain | |
 c | character varying |  | |  | extended | |
Partition key: RANGE (a)
Indexes:
 "idexpart_cic" UNIQUE, btree (a) INVALID
Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),
 prt1_p2 FOR VALUES FROM (25) TO (50),
 prt1_p3 FOR VALUES FROM (50) TO (75)

postgres=# \d+ prt1_p1
 Table "public.prt1_p1"
 Column |  Type | Collation | Nullable | Default | Storage | Stats target | 
Description
+---+---+--+-+--+--+-
 a | integer  |  | |  | plain | |
 b | integer  |  | |  | plain | |
 c | character varying |  | |  | extended | |
Partition of: prt1 FOR VALUES FROM (0) TO (25)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))
Indexes:
 "prt1_p1_a_idx" UNIQUE, btree (a)
Access method: heap

postgres=# \d+ prt1_p2
 Table "public.prt1_p2"
 Column |  Type | Collation | Nullable | Default | Storage | Stats target | 
Description
+---+---+--+-+--+--+-
 a | integer  |  | |  | plain | |
 b | integer  |  | |  | plain | |
 c | character varying |  | |  | extended | |
Partition of: prt1 FOR VALUES FROM (25) TO (50)
Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))
Indexes:
 "prt1_p2_a_idx" UNIQUE, btree (a) INVALID
Access method: heap

postgres=# \d+ prt1_p3
 Table "public.prt1_p3"
 Column |  Type | Collation | Nullable | Default | Storage | Stats target | 
Description
+---+---+--+-+--+--+-
 a | integer  |  | |  | plain | |
 b | integer  |  | |  | plain | |
 c | character varying |  | |  | extended | |
Partition of: prt1 FOR VALUES FROM (50) TO (75)
Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))
Access method: heap
```
As shown above, an error occurred while creating an index in the second 
partition. 
It can be clearly seen that the index of the partitioned table is invalid 
and the index of the first partition is normal, the second partition is 
invalid, 
and the Third Partition index does not exist at all.

But we do the following tests again:
```
postgres=# truncate table prt1;
TRUNCATE TABLE
postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM') FROM 
generate_series(0, 74) i;
INSERT 0 75
postgres=# insert into prt1 values (51,1,'FM0051');
INSERT 0 1
postgres=# drop index idexpart_cic;
DROP INDEX
postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);
ERROR: could not create unique index "prt1_p3_a_idx"
DETAIL: Key (a)=(51) is duplicated.
postgres=# \d+ prt1
 Partitioned table "public.prt1"
 Column |  Type | Collation | Nullable | Default | Storage | Stats target | 
Description
+---+---+--+-+--+--+-
 a | integer  |  | |  | plain | |
 b | integer  |  | |  | plain | |
 c | character varying |  | |  | extended | |
Partition key: RANGE (a)
Indexes:
 "idexpart_cic" UNIQUE, btree (a) INVALID
Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),
 prt1_p2 FOR VALUES FROM (25) TO (50),
 prt1_p3 FOR VALUES FROM (50) TO (75)

postgres=# \d+ prt1_p1
 Table "public.prt1_p1"
 Column |  Type | Collation | Nullable | Default | Storage | Stats target | 
Description
+---+---+--+-+--+--+-
 a | integer  |  | |  | plain | |
 b | integer  |  | |  | plain | |
 c | character varying |  

Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-15 Thread Inoue, Hiroshi

Sorry for the reply.

On 2020/06/08 15:52, Michael Paquier wrote:

On Fri, Jun 05, 2020 at 10:25:00PM +0900, Inoue, Hiroshi wrote:

Keyset-driven cursors always detect changes made by other applications
(and themselves). currtid() is necessary to detect the changes.
CTIDs are changed by updates unfortunately.

You mean currtid2() here and not currtid(), right?


Yes.


   We have two
problems here then:
1) We cannot actually really remove currtid2() from the backend yet
without removing the dependency in the driver, or that may break some
users.


I think only ODBC driver uses currtid2().


2) The driver does not include tests for that stuff yet.


SQLSetPos(.., .., SQL_REFRESH, ..) call in positioned-update-test passes 
the stuff
 when 'Use Declare/Fetch' option is turned off. In other words, 
keyset-driven cursor
is not supported when 'Use Declare/Fetch' option is turned on. Probably 
keyset-driven
cursor support would be lost regardless of 'Use Declare/Fetch' option 
after the

removal of currtid2().


--
Michael






Re: factorial of negative numbers

2020-06-15 Thread Ashutosh Bapat
On Mon, Jun 15, 2020 at 12:41 PM Peter Eisentraut
 wrote:
>
> Adjacent to the discussion in [0] I wanted to document the factorial()
> function and expand the tests for that slightly with some edge cases.
>
> I noticed that the current implementation returns 1 for the factorial of
> all negative numbers:
>
> SELECT factorial(-4);
>   factorial
> ---
>   1
>
> While there are some advanced mathematical constructions that define
> factorials for negative numbers, they certainly produce different
> answers than this.
>
> Curiously, before the reimplementation of factorial using numeric
> (04a4821adef38155b7920ba9eb83c4c3c29156f8), it returned 0 for negative
> numbers, which is also not correct by any theory I could find.
>
> I propose to change this to error out for negative numbers.

+1.
Here are some comments
I see below in the .out but there's not corresponding SQL in .sql.
+SELECT factorial(-4);
+ factorial
+---
+ 1
+(1 row)
+

Should we also add -4! to cover both function as well as the operator?

+if (num < 0)
+ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),

This looks more of ERRCODE_FEATURE_NOT_SUPPORTED esp. since factorial
of negative numbers is defined but we are not supporting it. I looked
at some other usages of this error code. All of them are really are
out of range value errors.

Otherwise the patches look good to me.




Re: Parallel copy

2020-06-15 Thread Bharath Rupireddy
Hi,

Attached is the patch supporting parallel copy for binary format files.

The performance improvement achieved with different workers is as shown
below. Dataset used has 10million tuples and is of 5.3GB size.

parallel workers test case 1(exec time in sec): copy from binary file, 2
indexes on integer columns and 1 index on text column test case 2(exec time
in sec): copy from binary file, 1 gist index on text column test case
3(exec time in sec): copy from binary file, 3 indexes on integer columns
0 1106.899(1X) 772.758(1X) 171.338(1X)
1 1094.165(1.01X) 757.365(1.02X) 163.018(1.05X)
2 618.397(1.79X) 428.304(1.8X) 117.508(1.46X)
4 320.511(3.45X) 231.938(3.33X) 80.297(2.13X)
8 172.462(6.42X) 150.212(5.14X) *71.518(2.39X)*
16 110.460(10.02X) *124.929(6.18X)* 91.308(1.88X)
20 *98.470(11.24X)* 137.313(5.63X) 95.289(1.79X)
30 109.229(10.13X) 173.54(4.45X) 95.799(1.78X)

Design followed for developing this patch:

Leader reads data from the file into the DSM data blocks each of 64K size.
It also identifies each tuple data block id, start offset, end offset,
tuple size and updates this information in the ring data structure. Workers
parallely read the tuple information from the ring data structure, the
actual tuple data from the data blocks and parallely insert the tuples into
the table.

Please note that this patch can be applied on the series of patches that
were posted previously[1] for parallel copy for csv/text files.
The correct order to apply all the patches is -
0001-Copy-code-readjustment-to-support-parallel-copy.patch

0002-Framework-for-leader-worker-in-parallel-copy.patch

0003-Allow-copy-from-command-to-process-data-from-file-ST.patch

0004-Documentation-for-parallel-copy.patch

and
0005-Parallel-Copy-For-Binary-Format-Files.patch

The above tests were run with the configuration attached config.txt, which
is the same used for performance tests of csv/text files posted earlier in
this mail chain.

Request the community to take this patch up for review along with the
parallel copy for csv/text file patches and provide feedback.

[1] -
https://www.postgresql.org/message-id/CALDaNm3uyHpD9sKoFtB0EnMO8DLuD6H9pReFm%3Dtm%3D9ccEWuUVQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Postgres configuration used for above testing:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

System Configuration:
RAM: 503GB
Disk Type:   SSD
Disk Size:   1.6TB
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:              2
CPU MHz:               1064.000
CPU max MHz:           2129.
CPU min MHz:           1064.
BogoMIPS:              4266.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K


0005-Parallel-Copy-For-Binary-Format-Files.patch
Description: Binary data


Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-15 Thread Amit Kapila
On Mon, Jun 15, 2020 at 12:30 PM Tatsuo Ishii  wrote:
>
> >> Another approach to the atomic visibility problem is to control
> >> snapshot acquisition timing and commit timing (plus using REPEATABLE
> >> READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
> >> assigns a snapshot at the time when the first command is executed in a
> >> transaction. If we could prevent any commit while any transaction is
> >> acquiring snapshot, and we could prevent any snapshot acquisition while
> >> committing, visibility inconsistency which Amit explained can be
> >> avoided.
> >>
> >
> > I think the problem mentioned above can occur with this as well or if
> > I am missing something then can you explain in further detail how it
> > won't create problem in the scenario I have used above?
>
> So the problem you mentioned above is like this? (S1/S2 denotes
> transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
> already committed on N1, S2 sees the row on N1.  However S2 does not
> see the row on N2 since S1 has not committed on N2 yet.
>

Yeah, something on these lines but S2 can execute the query on N1
directly which should fetch the data from both N1 and N2.  Even if
there is a solution using REPEATABLE READ isolation level we might not
prefer to use that as the only level for distributed transactions, it
might be too costly but let us first see how does it solve the
problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: jsonpath versus NaN

2020-06-15 Thread Alexander Korotkov
On Thu, Jun 11, 2020 at 10:00 PM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > On Thu, Jun 11, 2020 at 3:45 PM Tom Lane  wrote:
> >> It is entirely clear from the code, the documentation,
> >> and the relevant RFCs that JSONB does not allow NaNs as numeric
> >> values.
>
> > The JSONB itself doesn't store number NaNs.  It stores the string "NaN".
>
> Yeah, but you have a numeric NaN within the JsonbValue tree between
> executeItemOptUnwrapTarget and convertJsonbScalar.  Who's to say that
> that illegal-per-the-data-type structure won't escape to somewhere else?
> Or perhaps more likely, that we'll need additional warts in other random
> places in the JSON code to keep from spitting up on the transiently
> invalid structure.


I would propose to split two things: user-visible behavior and
internal implementation.  Internal implementation, which allows
numeric NaN within the JsonbValue, isn't perfect and we could improve
it.  But I'd like to determine desired user-visible behavior first,
then we can decide how to fix the implementation.

>
> > I found the relevant part of the standard.  Unfortunately, I can't
> > post the full standard here due to its license, but I think I can cite
> > the relevant part.
>
> I don't think this is very relevant.  The SQL standard has not got the
> concepts of Inf or NaN either (see 4.4.2 Characteristics of numbers),
> therefore their definition is only envisioning that a string representing
> a normal finite number should be castable to DOUBLE PRECISION.  Thus,
> both of the relevant standards think that "numbers" are just finite
> numbers.
>
> So when neither JSON nor SQL consider that "NaN" is an allowed sort
> of number, why are you doing violence to the code to allow it in a
> jsonpath?

Yes, I see.  No standard insists we should support NaN.  However,
standard claims .double() should behave the same as CAST to double.
So, I think if CAST supports NaN, but .double() doesn't, it's still a
violation.

> And if you insist on doing such violence, why didn't you
> do some more and kluge it to the point where "Inf" would work too?


Yep, according to standard .double() should support "Inf" as soon as
CAST to double does.  The reason why it wasn't implemented is that we
use numeric as the internal storage for all the numbers. And numeric
doesn't support Inf yet.

> (It would require slightly less klugery in the wake of the infinities-
> in-numeric patch that I'm going to post soon ... but that doesn't make
> it a good idea.)


If numerics would support infinites, we can follow standard and make
.double() method work the same way as CAST to double does.  Now, I get
that there is no much reason to keep current behaviour, which supports
Nan, but doesn't support Inf.  I think we should either support both
NaN and Inf and don't support any of them.  The latter is a violation
of the standard, but provides us with a simpler and cleaner
implementation.  What do you think?

BTW, we found what the standard says about serialization of SQL/JSON items.

9.37 Serializing an SQL/JSON item (page 695)
ii) Let JV be an implementation-dependent value of type TT and
encoding ENC such that these two conditions hold:
1) JV is a JSON text.
2) When applying the General Rules of Subclause 9.36, “Parsing JSON
text” with JV as JSON TEXT, FO as FORMAT OPTION, and WITHOUT UNIQUE
KEYS as UNIQUENESS CONSTRAINT, the returned STATUS is successful
completion and the returned SQL/JSON ITEM is an SQL/JSON item that is
equivalent to SJI.
If there is no such JV, then let ST be the exception condition: data
exception — invalid JSON text.

Basically it says that the resulting text should result in the same
SQL/JSON item when parsed.  I think this literally means that
serialization of numeric NaN is impossible as soon as it's impossible
to get numeric NaN as the result json parsing.  However, in the same
way this would mean that serialization of datetime is also impossible,
but that seems like nonsense.  So, I think this paragraph of the
standard is ill-conceived.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Review for GetWALAvailability()

2020-06-15 Thread Fujii Masao




On 2020/06/15 13:42, Kyotaro Horiguchi wrote:

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao  
wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

 some WAL files are definitely lost and this slot cannot be used to
 resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
  wal_segment_size) +
  1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.


Oops!  I don't want to believe I did that but it's definitely wrong.


The above should be the following?

 Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
 wal_keep_segments) + 1


Looks reasonable.


if ((max_slot_wal_keep_size_mb <= 0 ||
 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
  means that the claimed files are within max_wal_size". So whatever
  max_slot_wal_keep_size value is, IMO that "normal" should be
  reported if the WAL files claimed by the slot are within max_wal_size.
  Thought?


It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept.  In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.


IMO the direct transition of the state from normal to lost is ok to me
if each state is clearly defined.



Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.


Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.


I don't think this change is enough. For example, when max_slot_wal_keep_size
is smaller than max_wal_size and the amount of WAL files claimed by the slot
is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But which is
inconsistent with the meaning of "reserved" in the docs.

To consider what should be reported in wal_status, could you tell me what
purpose and how the users is expected to use this information?



If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?


In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.  With the current mechanism, there's a
case where once invalidated slot came to revive but we decided to
allow that behavior, but forgot to document that.

Anyway if you see "lost", something bad is being happening.

- lost means that some WAL files are definitely lost and this slot
- cannot be used to resume replication anymore.
+ lost means that some required WAL files are removed and this slot is
+ no longer usable after once disconnected during this status.

If it is crucial that the "lost" state may come back to reserved or
normal state,

+ Note that there are cases where the state moves back to reserved or
+ normal state when all wal senders have left the just removed segment
+ before being terminated.

There is a case where the state moves back to reserved or normal state when wal 
senders leaves the just removed segment before being terminated.


Even if walsender is terminated during the state "lost", unless checkpointer
removes the required WAL files, the state can go back to "reserved" after
new replication connection is established. This is the 

Re: Asynchronous Append on postgres_fdw nodes.

2020-06-15 Thread Andrey V. Lepikhov

On 6/15/20 1:29 PM, Kyotaro Horiguchi wrote:

Thanks for testing, but..

At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" 
 wrote in

The patch has a problem with partitionwise aggregates.

Asynchronous append do not allow the planner to use partial
aggregates. Example you can see in attachment. I can't understand why:
costs of partitionwise join are less.
Initial script and explains of the query with and without the patch
you can see in attachment.


I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).

I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.

Could you tell me how did you get the first plan?


1. Use clear current vanilla master.

2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true

3. Execute query:
explain analyze SELECT sum(parts.b)
FROM parts, second
WHERE parts.a = second.a AND second.b < 100;

That's all.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com


frgn2n.sh
Description: application/shellscript


Re: SEARCH and CYCLE clauses

2020-06-15 Thread Peter Eisentraut
Here is an updated patch that I think fixes all the cases that you 
identified.  (The issue of what kinds of constants or expressions to 
accept for cycle marks has not been touched.)  To fix the star expansion 
I had to add a little bit of infrastructure that could also be used as a 
more general facility "don't include this column in star expansion", so 
this could perhaps use some consideration from a more general angle as well.


More tests and breakage reports welcome.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From f4fc5f63c53ba954066d38f968f54fb8fbf76c59 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jun 2020 11:43:52 +0200
Subject: [PATCH v2] SEARCH and CYCLE clauses

Discussion: 
https://www.postgresql.org/message-id/flat/db80ceee-6f97-9b4a-8ee8-3ba0c58e5...@2ndquadrant.com
---
 doc/src/sgml/queries.sgml| 204 +++-
 doc/src/sgml/ref/select.sgml |  41 ++
 src/backend/nodes/copyfuncs.c|  39 ++
 src/backend/nodes/equalfuncs.c   |  35 ++
 src/backend/nodes/nodeFuncs.c|   6 +
 src/backend/nodes/outfuncs.c |  35 ++
 src/backend/nodes/readfuncs.c|  43 ++
 src/backend/parser/gram.y|  56 ++-
 src/backend/parser/parse_cte.c   | 117 +
 src/backend/parser/parse_relation.c  |  54 ++-
 src/backend/parser/parse_target.c|  21 +-
 src/backend/rewrite/rewriteHandler.c | 673 +++
 src/backend/utils/adt/ruleutils.c|  47 ++
 src/include/nodes/nodes.h|   2 +
 src/include/nodes/parsenodes.h   |  28 +-
 src/include/parser/kwlist.h  |   2 +
 src/include/parser/parse_node.h  |   1 +
 src/test/regress/expected/with.out   | 364 +++
 src/test/regress/sql/with.sql| 179 +++
 19 files changed, 1917 insertions(+), 30 deletions(-)

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index 572e968273..f98d8d3dc1 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1979,6 +1979,10 @@ SELECT in 
WITH
but we'd have needed two levels of nested sub-SELECTs.  
It's a bit
easier to follow this way.
   
+ 
+
+ 
+  Recursive Queries
 
   

@@ -2082,6 +2086,144 @@ Recursive Query Evaluation
 
   
 
+  
+   Search Order
+
+   
+When computing a tree traversal using a recursive query, you might want to
+order the results in either depth-first or breadth-first order.  This can
+be done by computing an ordering column alongside the other data columns
+and using that to sort the results at the end.  Note that this does not
+actually control in which order the query evaluation visits the rows; that
+is as always in SQL implementation-dependent.  This approach merely
+provides a convenient way to order the results afterwards.
+   
+
+   
+To create a depth-first order, we compute for each result row an array of
+rows that we have visited so far.  For example, consider the following
+query that searches a table tree using a
+link field:
+
+
+WITH RECURSIVE search_tree(id, link, data) AS (
+SELECT t.id, t.link, t.data
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data
+FROM tree t, search_tree st
+WHERE t.id = st.link
+)
+SELECT * FROM search_tree;
+
+
+To add depth-first ordering information, you can write this:
+
+
+WITH RECURSIVE search_tree(id, link, data, path) AS (
+SELECT t.id, t.link, t.data, ARRAY[t.id]
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data, path || t.id
+FROM tree t, search_tree st
+WHERE t.id = st.link
+)
+SELECT * FROM search_tree ORDER BY path;
+
+   
+
+  
+   In the general case where more than one field needs to be used to identify
+   a row, use an array of rows.  For example, if we needed to track fields
+   f1 and f2:
+
+
+WITH RECURSIVE search_tree(id, link, data, path) AS (
+SELECT t.id, t.link, t.data, ARRAY[ROW(g.f1, g.f2)]
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data, path || ROW(g.f1, g.f2)
+FROM tree t, search_tree st
+WHERE t.id = st.link
+)
+SELECT * FROM search_tree ORDER BY path;
+
+  
+
+  
+   
+Omit the ROW() syntax in the common case where only one
+field needs to be tracked.  This allows a simple array rather than a
+composite-type array to be used, gaining efficiency.
+   
+  
+
+  
+   To create a breadth-first order, you can add a column that tracks the depth
+   of the search, for example:
+
+
+WITH RECURSIVE search_tree(id, link, data, level) AS (
+SELECT t.id, t.link, t.data, 0
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data, level + 1
+FROM tree t, search_tree st
+WHERE t.id = st.link
+)
+SELECT * FROM search_tree ORDER BY level;
+
+
+   To get a stable sort, add data columns as secondary sorting columns.
+  
+
+  
+   
+The recursive query evaluation algorithm produces its output in
+breadth-first search order.  

Re: Infinities in type numeric

2020-06-15 Thread Dean Rasheed
On Fri, 12 Jun 2020 at 02:16, Tom Lane  wrote:
>
> * I had to invent some semantics for non-standardized functions,
> particularly numeric_mod, numeric_gcd, numeric_lcm.  This area
> could use review to be sure that I chose desirable behaviors.
>

I think the semantics you've chosen for numeric_mod() are reasonable,
and I think they're consistent with POSIX fmod().

However, it looks like you've chosen gcd('Inf', x) = x, whereas I'd
say that the result should be 'NaN'.

One way to look at it is that the GCD result should exactly divide
both inputs with no remainder, but the remainder when you divide 'Inf'
by x is undefined, so you can't say that x exactly divides 'Inf'.

Another way to look at it is that gcd('Inf', x) is limit(n -> 'Inf',
gcd(n, x)), but that limit isn't well-defined. For example, suppose
x=10, then gcd('Inf', 10) = limit(n -> 'Inf', gcd(n, 10)), but gcd(n,
10) is either 1,2,5 or 10 depending on n, and it does not converge to
any particular value in the limit n -> 'Inf'.

A third way to look at it would be to apply one round of Euclid's
algorithm to it: gcd('Inf', x) = gcd(x, mod('Inf', x)) = gcd(x, 'NaN')
= 'NaN'.

Now you could argue that x=0 is a special case, and gcd('Inf', 0) =
'Inf' on the grounds that gcd(a, 0) = a for all finite 'a'. However, I
don't think that's particularly useful, and it fails the first test
that the result exactly divides both inputs because mod('Inf', 'Inf')
is undefined ('NaN').

Similar arguments apply to lcm(), so I'd say that both gcd() and lcm()
should return 'NaN' if either input is 'Inf' or 'NaN'.

Regards,
Dean




Re: [bug?] Is the format of tables in the documentation broken in PG 13?

2020-06-15 Thread Daniel Gustafsson
> On 15 Jun 2020, at 09:49, tsunakawa.ta...@fujitsu.com wrote:

> Is this intentional?

Yes, this was a deliberate change made to be able to fit more expansive
descriptions of columns etc.

>  It has become a bit unfriendly to read for me, a visually impaired user who 
> uses screen reader software.

That's less good.  The W3C Web Accessibility Initiative has guidance for multi-
level header tables which might be useful here.

https://www.w3.org/WAI/tutorials/tables/multi-level/

Maybe if we use the id and headers attributes we can give screen readers enough
clues to make sense of the information?

cheers ./daniel



Re: Asynchronous Append on postgres_fdw nodes.

2020-06-15 Thread Kyotaro Horiguchi
Thanks for testing, but..

At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" 
 wrote in 
> The patch has a problem with partitionwise aggregates.
> 
> Asynchronous append do not allow the planner to use partial
> aggregates. Example you can see in attachment. I can't understand why:
> costs of partitionwise join are less.
> Initial script and explains of the query with and without the patch
> you can see in attachment.

I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).

I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.

Could you tell me how did you get the first plan?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




[bug?] Is the format of tables in the documentation broken in PG 13?

2020-06-15 Thread tsunakawa.ta...@fujitsu.com
Hello,


The tables for pg_stat_ views in the following page, starting from Table 27.3, 
have only one column in PG 13.  They had 3 columns in PG 12 and earlier.

https://www.postgresql.org/docs/13/monitoring-stats.html

Is this intentional?  It has become a bit unfriendly to read for me, a visually 
impaired user who uses screen reader software.

The tables for functions in Chapter 9 are similar.


Regards
Takayuki Tsunakawa






Re: POC and rebased patch for CSN based snapshots

2020-06-15 Thread Fujii Masao




On 2020/06/12 18:41, movead...@highgo.ca wrote:

Hello hackers,

Currently, I do some changes based on the last version:
1. Catch up to the current  commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.


Andrey also seems to be proposing the similar patch [1] that introduces CSN
into core. Could you tell me what the difference between his patch and yours?
If they are almost the same, we should focus on one together rather than
working separately?

Regards,

[1]
https://www.postgresql.org/message-id/9964cf46-9294-34b9-4858-971e9029f...@postgrespro.ru

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: doc examples for pghandler

2020-06-15 Thread Michael Paquier
On Sun, Jun 14, 2020 at 08:45:17PM -0700, Mark Wong wrote:
> Sounds good to me.  Something more like the attached patch?

That's the idea.  I have not gone in details into what you have here,
but perhaps it would make sense to do a bit more and show how things
are done in the context of a PL function called in a trigger?  Your
patch removes from the docs a code block that outlined that.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] SERIALIZABLE on standby servers

2020-06-15 Thread Thomas Munro
On Mon, Jun 15, 2020 at 5:00 PM Michael Paquier  wrote:
> On Fri, Dec 28, 2018 at 02:21:44PM +1300, Thomas Munro wrote:
> > Just to be clear, although this patch is registered in the commitfest
> > and currently applies and tests pass, it is prototype/WIP code with
> > significant problems that remain to be resolved.  I sort of wish there
> > were a way to indicate that in the CF but since there isn't, I'm
> > saying that here.  What I hope to get from Kevin, Simon or other
> > reviewers is some feedback on the general approach and problems
> > discussed upthread (and other problems and better ideas I might have
> > missed).  So it's not seriously proposed for commit in this CF.
>
> No feedback has actually come, so I have moved it to next CF.

Having been nerd-sniped by SSI again, I spent some time this weekend
rebasing this old patch, making a few improvements, and reformulating
the problems to be solved as I see them.  It's very roughly based on
Kevin Grittner and Dan Ports' description of how you could give
SERIALIZABLE a useful meaning on hot standbys.  The short version of
the theory is that you can make it work like SERIALIZABLE READ ONLY
DEFERRABLE by adding a bit of extra information into the WAL stream.

Problems:

1.  As a prerequisite, we'd need to teach primary servers to make
transactions visible in the same order that they log commits.
Otherwise, we permit nonsense like seeing TX1 but not TX2 on the
primary, and TX2 but not TX1 on the replica.  You can probably argue
that our read replicas don't satisfy the lower isolation levels, let
alone serializable.

2.  Similarly, it's probably not OK that
PreCommit_CheckForSerializationFailure() determines
MySerializableXact->snapshotSafetyAfterThisCommit.  That may not
happen in exactly the same order as commits are logged.  Or maybe
there is some argument for why that is OK, based on what we're doing
with prepareSeqNo, or maybe we can do something with that to detect
disorder.

3.  The patch doesn't yet attempt to checkpoint the snapshot safety
state.  That's needed to start up in a sane state, without having to
wait for WAL activity.

4.  XactLogSnapshotSafetyRecord() flushes the WAL an extra time after
a commit is flushed, which I put in for testing; that's silly...
somehow it needs to be better integrated so we don't generate two sync
I/Os in a row.

5.  You probably want a way to turn off the extra WAL records and
SERIALIZABLEXACT consumption if you're using SERIALIZABLE on a primary
but not on the standby.  Or maybe there is some way to make it come on
automatically.

I think I have cleared up the matter of xmin tracking for
"hypothetical" SERIALIZABLEXACTs mentioned earlier.  It's not needed,
so should be set to InvalidTransactionId, and I added a comment to
explain why.

I also wrote a TAP test to exercise this thing.  It is the same
schedule as src/test/isolation/specs/read-only-anomaly-3.spec, except
that transaction 3 runs on a streaming replica.

One thing to point out is that this patch only aims to make it so that
streaming replicas can't observe a state that would have caused a
transaction to abort if it had been observed on the primary.  The TAP
test still has to insert its own wait-for-LSN loop to make sure step
"s1c" is replayed before "s3r" runs.  We could use
synchronous_commit=remote_apply, and that'd probably work just as well
for this particular test, but I'm not sure how to square that with
fixing problem #1 above.

The perl hackery I used to do overlapping transactions in a TAP test
is pretty crufty.  I guess we'd ideally have the isolation tester
support per-session connection strings, and somehow get some perl code
to orchestrate the cluster setup but then run the real isolation
tester.  Or something like that.
From 30b25251964b494e75a8ad2d48cd49583f32ebc5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 14 Jun 2020 21:46:02 +1200
Subject: [PATCH v5 1/3] SERIALIZABLE READ ONLY DEFERRABLE on streaming
 replicas.

Allow streaming replicas to wait for "safe" snapshots.  Do this by
injecting extra information into the WAL, so that replica servers can
determine when a safe snapshot can be taken.

WORK IN PROGRESS -- see thread for list of problems

Discussion: https://postgr.es/m/CAEepm%3D2b9TV%2BvJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ%40mail.gmail.com
Discussion: https://postgr.es/m/4d3735e3022500039...@gw.wicourts.gov
---
 doc/src/sgml/high-availability.sgml   |   9 -
 doc/src/sgml/mvcc.sgml|  21 --
 doc/src/sgml/ref/set_transaction.sgml |  12 +
 src/backend/access/rmgrdesc/xactdesc.c|  56 
 src/backend/access/transam/xact.c |  69 -
 src/backend/access/transam/xlog.c |   7 +
 src/backend/commands/variable.c   |   8 -
 src/backend/storage/lmgr/predicate.c  | 337 --
 src/backend/storage/lmgr/proc.c   |   5 +
 src/include/access/xact.h |  15 +-
 src/include/access/xlogdefs.h |  12 +
 

Re: min_safe_lsn column in pg_replication_slots view

2020-06-15 Thread Kyotaro Horiguchi
At Mon, 15 Jun 2020 13:44:31 +0900, Michael Paquier  wrote 
in 
> On Mon, Jun 15, 2020 at 12:40:03PM +0900, Fujii Masao wrote:
> > BTW, I just wonder why each row in pg_replication_slots needs to have
> > min_safe_lsn column? Basically min_safe_lsn should be the same between
> > every replication slots.
> 
> Indeed, that's confusing in its current shape.  I would buy putting
> this value into pg_replication_slots if there were some consistency of
> the data to worry about because of locking issues, but here this data
> is controlled within info_lck, which is out of the the repslot LW
> lock.  So I think that it is incorrect to put this data in this view
> and that we should remove it, and that instead we had better push for
> a system view that maps with the contents of XLogCtl.

It was once the difference from the safe_lsn to restart_lsn which is
distinct among slots. Then it was changed to the safe_lsn. I agree to
the discussion above, but it is needed anywhere since no one can know
the margin until the slot goes to the "lost" state without it.  (Note
that currently even wal_status and min_safe_lsn can be inconsistent in
a line.)

Just for the need for table-consistency and in-line consistency, we
could just remember the result of XLogGetLastRemovedSegno() around
taking info_lock in the function.  That doesn't make a practical
difference but makes the view look consistent.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d6fe205eb4..f57d67a900 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9498,7 +9498,7 @@ CreateRestartPoint(int flags)
  * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
  */
 WALAvailability
-GetWALAvailability(XLogRecPtr targetLSN)
+GetWALAvailability(XLogRecPtr targetLSN, XLogSegNo last_removed_seg)
 {
 	XLogRecPtr	currpos;		/* current write LSN */
 	XLogSegNo	currSeg;		/* segid of currpos */
@@ -9524,7 +9524,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 * the first WAL segment file since startup, which causes the status being
 	 * wrong under certain abnormal conditions but that doesn't actually harm.
 	 */
-	oldestSeg = XLogGetLastRemovedSegno() + 1;
+	oldestSeg = last_removed_seg + 1;
 
 	/* calculate oldest segment by max_wal_size and wal_keep_segments */
 	XLByteToSeg(currpos, currSeg, wal_segment_size);
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 1b929a603e..bd2e3e84ed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -243,6 +243,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 	MemoryContext per_query_ctx;
 	MemoryContext oldcontext;
 	int			slotno;
+	XLogSegNo	last_removed_seg;
 
 	/* check to see if caller supports us returning a tuplestore */
 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 	rsinfo->setResult = tupstore;
 	rsinfo->setDesc = tupdesc;
 
+	/*
+	 * Remember the last removed segment at this point for the consistency in
+	 * this table. Since there's no interlock between slot data and
+	 * checkpointer, the segment can be removed in-between, but that doesn't
+	 * make any practical difference.
+	 */
+	last_removed_seg = XLogGetLastRemovedSegno();
+
 	MemoryContextSwitchTo(oldcontext);
 
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
@@ -282,7 +291,6 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		Datum		values[PG_GET_REPLICATION_SLOTS_COLS];
 		bool		nulls[PG_GET_REPLICATION_SLOTS_COLS];
 		WALAvailability walstate;
-		XLogSegNo	last_removed_seg;
 		int			i;
 
 		if (!slot->in_use)
@@ -342,7 +350,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		else
 			nulls[i++] = true;
 
-		walstate = GetWALAvailability(slot_contents.data.restart_lsn);
+		walstate = GetWALAvailability(slot_contents.data.restart_lsn,
+	  last_removed_seg);
 
 		switch (walstate)
 		{
@@ -368,7 +377,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 
 		if (max_slot_wal_keep_size_mb >= 0 &&
 			(walstate == WALAVAIL_NORMAL || walstate == WALAVAIL_RESERVED) &&
-			((last_removed_seg = XLogGetLastRemovedSegno()) != 0))
+			(last_removed_seg != 0))
 		{
 			XLogRecPtr	min_safe_lsn;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..c05a18148d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -326,7 +326,8 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
-extern WALAvailability GetWALAvailability(XLogRecPtr restart_lsn);
+extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN,
+		  XLogSegNo last_removed_seg);
 extern XLogRecPtr CalculateMaxmumSafeLSN(void);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);


Re: Review for GetWALAvailability()

2020-06-15 Thread Kyotaro Horiguchi
At Mon, 15 Jun 2020 13:42:25 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Oops!  I don't want to believe I did that but it's definitely wrong.

Hmm. Quite disappointing. The patch was just a crap.
This is the right patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..199053dd4a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11240,18 +11240,25 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx

 
  normal means that the claimed files
-  are within max_wal_size.
+  are within max_wal_size. If
+  max_slot_wal_keep_size is smaller than
+  max_wal_size, this state will not appear.
 
 
  reserved means
   that max_wal_size is exceeded but the files are
   still held, either by some replication slot or
-  by wal_keep_segments.
+  by wal_keep_segments.
+  
 
 
- lost means that some WAL files are
-  definitely lost and this slot cannot be used to resume replication
-  anymore.
+  
+lost means that some required WAL files are
+			removed and this slot is no longer usable after once disconnected
+			during this state. Note that there are cases where the state moves
+			back to reserved or normal state when all wal senders have left
+			the just removed segment before being terminated.
+  
 

The last two states are seen only when
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..d6fe205eb4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9528,8 +9528,8 @@ GetWALAvailability(XLogRecPtr targetLSN)
 
 	/* calculate oldest segment by max_wal_size and wal_keep_segments */
 	XLByteToSeg(currpos, currSeg, wal_segment_size);
-	keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
-			  wal_segment_size) + 1;
+	keepSegs = Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
+   wal_keep_segments) + 1;
 
 	if (currSeg > keepSegs)
 		oldestSegMaxWalSize = currSeg - keepSegs;


factorial of negative numbers

2020-06-15 Thread Peter Eisentraut
Adjacent to the discussion in [0] I wanted to document the factorial() 
function and expand the tests for that slightly with some edge cases.


I noticed that the current implementation returns 1 for the factorial of 
all negative numbers:


SELECT factorial(-4);
 factorial
---
 1

While there are some advanced mathematical constructions that define 
factorials for negative numbers, they certainly produce different 
answers than this.


Curiously, before the reimplementation of factorial using numeric 
(04a4821adef38155b7920ba9eb83c4c3c29156f8), it returned 0 for negative 
numbers, which is also not correct by any theory I could find.


I propose to change this to error out for negative numbers.

See attached patches for test and code changes.


[0]: 
https://www.postgresql.org/message-id/flat/38ca86db-42ab-9b48-2902-337a0d6b8311%402ndquadrant.com


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 700f6618fdbd72defcfcd597a00690ce30e7dba1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jun 2020 08:51:46 +0200
Subject: [PATCH 1/3] doc: Document factorial function

This has existed for a very long time, equivalent to the ! and !!
operators, but it was never documented.
---
 doc/src/sgml/func.sgml | 17 +
 1 file changed, 17 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b65aa28f34..5006c35fb7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1347,6 +1347,23 @@ Mathematical Functions

   
 
+  
+   
+
+ factorial
+
+factorial ( numeric )
+numeric
+   
+   
+Factorial
+   
+   
+factorial(5)
+120
+   
+  
+
   

 
-- 
2.27.0

From 4e8b05d209f61be222c7eb776edffa206b706299 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jun 2020 08:59:19 +0200
Subject: [PATCH 2/3] Expand tests for factorial

Move from int4 to numeric test.  (They were originally int4 functions,
but were reimplemented for numeric in
04a4821adef38155b7920ba9eb83c4c3c29156f8.)  Add some tests for edge
cases.
---
 src/test/regress/expected/int4.out| 12 -
 src/test/regress/expected/numeric.out | 35 +++
 src/test/regress/sql/int4.sql |  4 ---
 src/test/regress/sql/numeric.sql  | 10 
 4 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/src/test/regress/expected/int4.out 
b/src/test/regress/expected/int4.out
index c384af18ee..77f43739a7 100644
--- a/src/test/regress/expected/int4.out
+++ b/src/test/regress/expected/int4.out
@@ -299,18 +299,6 @@ SELECT int4 '1000' < int4 '999' AS false;
  f
 (1 row)
 
-SELECT 4! AS twenty_four;
- twenty_four 
--
-  24
-(1 row)
-
-SELECT !!3 AS six;
- six 
--
-   6
-(1 row)
-
 SELECT 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 AS ten;
  ten 
 -
diff --git a/src/test/regress/expected/numeric.out 
b/src/test/regress/expected/numeric.out
index c7fe63d037..05e9a2b9e9 100644
--- a/src/test/regress/expected/numeric.out
+++ b/src/test/regress/expected/numeric.out
@@ -2315,3 +2315,38 @@ FROM (VALUES (0::numeric, 0::numeric),
 
 SELECT lcm( * (10::numeric)^131068 + (10::numeric^131068 - 1), 2); -- 
overflow
 ERROR:  value overflows numeric format
+--
+-- Tests for factorial
+--
+SELECT 4!;
+ ?column? 
+--
+   24
+(1 row)
+
+SELECT !!3;
+ ?column? 
+--
+6
+(1 row)
+
+SELECT factorial(15);
+   factorial   
+---
+ 1307674368000
+(1 row)
+
+SELECT 10!;
+ERROR:  value overflows numeric format
+SELECT 0!;
+ ?column? 
+--
+1
+(1 row)
+
+SELECT factorial(-4);
+ factorial 
+---
+ 1
+(1 row)
+
diff --git a/src/test/regress/sql/int4.sql b/src/test/regress/sql/int4.sql
index a9e90a96c4..b00c9dea2a 100644
--- a/src/test/regress/sql/int4.sql
+++ b/src/test/regress/sql/int4.sql
@@ -114,10 +114,6 @@ CREATE TABLE INT4_TBL(f1 int4);
 
 SELECT int4 '1000' < int4 '999' AS false;
 
-SELECT 4! AS twenty_four;
-
-SELECT !!3 AS six;
-
 SELECT 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 AS ten;
 
 SELECT 2 + 2 / 2 AS three;
diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql
index 41475a9a24..483c81b9cd 100644
--- a/src/test/regress/sql/numeric.sql
+++ b/src/test/regress/sql/numeric.sql
@@ -,3 +,13 @@ CREATE TABLE num_input_test (n1 numeric);
  (4232.820::numeric, 132.72000::numeric)) AS v(a, b);
 
 SELECT lcm( * (10::numeric)^131068 + (10::numeric^131068 - 1), 2); -- 
overflow
+
+--
+-- Tests for factorial
+--
+SELECT 4!;
+SELECT !!3;
+SELECT factorial(15);
+SELECT 10!;
+SELECT 0!;
+SELECT factorial(-4);
-- 
2.27.0

From 9ae0d1135d0a75b4560f2a4c22be8b8bbfde0803 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Jun 2020 09:02:14 +0200
Subject: [PATCH 3/3] Disallow factorial of negative numbers

The 

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-15 Thread Tatsuo Ishii
>> Another approach to the atomic visibility problem is to control
>> snapshot acquisition timing and commit timing (plus using REPEATABLE
>> READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
>> assigns a snapshot at the time when the first command is executed in a
>> transaction. If we could prevent any commit while any transaction is
>> acquiring snapshot, and we could prevent any snapshot acquisition while
>> committing, visibility inconsistency which Amit explained can be
>> avoided.
>>
> 
> I think the problem mentioned above can occur with this as well or if
> I am missing something then can you explain in further detail how it
> won't create problem in the scenario I have used above?

So the problem you mentioned above is like this? (S1/S2 denotes
transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
already committed on N1, S2 sees the row on N1.  However S2 does not
see the row on N2 since S1 has not committed on N2 yet.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN
S1/N2: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN
S2/N1: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN
S1/N1: INSERT INTO t1 VALUES (1);
INSERT 0 1
S1/N2: INSERT INTO t1 VALUES (1);
INSERT 0 1
S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S2/N1: SELECT * FROM t1; -- see the row
 i 
---
 1
(1 row)

S2/N2: SELECT * FROM t1; -- doesn't see the row
 i 
---
(0 rows)

S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-15 Thread Michael Paquier
On Mon, Jun 15, 2020 at 12:44:02AM -0400, Tom Lane wrote:
> FWIW, I'd have included a catversion bump in this, to enforce that
> the modified backend functions are used with matching pg_proc entries.
> It's not terribly important at this phase of the devel cycle, but still
> somebody might wonder why the regression tests are failing for them
> (if they tried to skip an initdb).

Thanks, I am fixing that now.

(It may be effective to print a T-shirt of that at some point and give
it to people scoring the most in this area..)
--
Michael


signature.asc
Description: PGP signature


Re: TAP tests and symlinks on Windows

2020-06-15 Thread Michael Paquier
On Sat, Jun 13, 2020 at 03:00:54PM +0900, Michael Paquier wrote:
> My take would be to actually enforce that as a requirement for 14~ if
> that works reliably, and of course not backpatch that change as that's
> clearly an improvement and not a bug fix.  It would be good to check
> the status of each buildfarm member first though.  And I would need to
> also check my own stuff to begin with..

So, I have been looking at that.  And indeed as Peter said we are
visibly missing one call to perl2host in 010_pg_basebackup.pl.

Another thing I spotted is that Win32::Symlink does not allow to
detect properly if a path is a symlink using -l, causing one of the
tests of pg_basebackup to fail when checking if a tablespace path has
been updted.  It would be good to get more people to test this patch
with different environments than mine.  I am also adding Andrew
Dunstan in CC as the owner of the buildfarm animals running currently
TAP tests for confirmation about the presence of Win32::Symlink
there as I am afraid it would cause failures: drongo, fairywen,
jacana and bowerbird.
--
Michael
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 208df557b8..17643489d1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -17,7 +17,7 @@ my $tempdir = TestLib::tempdir;
 my $node = get_new_node('main');
 
 # Set umask so test directories and files are created with default permissions
-umask(0077);
+umask(0077) if (!$windows_os);
 
 # Initialize node without replication settings
 $node->init(extra => ['--data-checksums']);
@@ -211,154 +211,168 @@ $node->command_fails(
 	'pg_basebackup tar with long name fails');
 unlink "$pgdata/$superlongname";
 
-# The following tests test symlinks. Windows doesn't have symlinks, so
-# skip on Windows.
-SKIP:
+# Move pg_replslot out of $pgdata and create a symlink to it.
+$node->stop;
+
+# Set umask so test directories and files are created with group permissions
+if (!$windows_os)
 {
-	skip "symlinks not supported on Windows", 18 if ($windows_os);
-
-	# Move pg_replslot out of $pgdata and create a symlink to it.
-	$node->stop;
-
-	# Set umask so test directories and files are created with group permissions
-	umask(0027);
+	umask(0027) if (!$windows_os);
 
 	# Enable group permissions on PGDATA
 	chmod_recursive("$pgdata", 0750, 0640);
+}
 
-	rename("$pgdata/pg_replslot", "$tempdir/pg_replslot")
-	  or BAIL_OUT "could not move $pgdata/pg_replslot";
-	symlink("$tempdir/pg_replslot", "$pgdata/pg_replslot")
-	  or BAIL_OUT "could not symlink to $pgdata/pg_replslot";
+rename("$pgdata/pg_replslot", "$tempdir/pg_replslot")
+  or BAIL_OUT "could not move $pgdata/pg_replslot";
+symlink("$tempdir/pg_replslot", "$pgdata/pg_replslot")
+  or BAIL_OUT "could not symlink to $pgdata/pg_replslot";
 
-	$node->start;
+$node->start;
 
-	# Create a temporary directory in the system location and symlink it
-	# to our physical temp location.  That way we can use shorter names
-	# for the tablespace directories, which hopefully won't run afoul of
-	# the 99 character length limit.
-	my $shorter_tempdir = TestLib::tempdir_short . "/tempdir";
-	symlink "$tempdir", $shorter_tempdir;
+# Create a temporary directory in the system location and symlink it
+# to our physical temp location.  That way we can use shorter names
+# for the tablespace directories, which hopefully won't run afoul of
+# the 99 character length limit.
+my $shorter_tempdir = TestLib::tempdir_short . "/tempdir";
+my $realTsDir   = TestLib::perl2host("$shorter_tempdir/tblspc1");
+symlink "$tempdir", $shorter_tempdir;
 
-	mkdir "$tempdir/tblspc1";
-	$node->safe_psql('postgres',
-		"CREATE TABLESPACE tblspc1 LOCATION '$shorter_tempdir/tblspc1';");
-	$node->safe_psql('postgres',
-		"CREATE TABLE test1 (a int) TABLESPACE tblspc1;");
-	$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/tarbackup2", '-Ft' ],
-		'tar format with tablespaces');
-	ok(-f "$tempdir/tarbackup2/base.tar", 'backup tar was created');
-	my @tblspc_tars = glob "$tempdir/tarbackup2/[0-9]*.tar";
-	is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
-	rmtree("$tempdir/tarbackup2");
+mkdir "$tempdir/tblspc1";
+$node->safe_psql('postgres',
+	"CREATE TABLESPACE tblspc1 LOCATION '$realTsDir';");
+$node->safe_psql('postgres',
+	"CREATE TABLE test1 (a int) TABLESPACE tblspc1;");
+$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/tarbackup2", '-Ft' ],
+	'tar format with tablespaces');
+ok(-f "$tempdir/tarbackup2/base.tar", 'backup tar was created');
+my @tblspc_tars = glob "$tempdir/tarbackup2/[0-9]*.tar";
+is(scalar(@tblspc_tars), 1, 'one tablespace tar was created');
+rmtree("$tempdir/tarbackup2");
 
-	# Create an unlogged table to test that forks other than init are not copied.
-	$node->safe_psql('postgres',
-		'CREATE UNLOGGED TABLE tblspc1_unlogged (id int) TABLESPACE tblspc1;'
-	);
+# Create an unlogged table to test that forks other than init are not 

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-15 Thread Amit Kapila
On Sun, Jun 14, 2020 at 2:21 PM Tatsuo Ishii  wrote:
>
> >> Won't it create an inconsistency in viewing the data from the
> >> different servers?  Say, such a transaction inserts one row into a
> >> local server and another into the foreign server.  Now, if we follow
> >> the above protocol, the user will be able to see the row from the
> >> local server but not from the foreign server.
> >
> > Yes, you're right. This atomic commit feature doesn't guarantee such
> > consistent visibility so-called atomic visibility.

Okay, I understand that the purpose of this feature is to provide
atomic commit which means the transaction on all servers involved will
either commit or rollback.  However, I think we should at least see at
a high level how the visibility will work because it might influence
the implementation of this feature.

> > Even the local
> > server is not modified, since a resolver process commits prepared
> > foreign transactions one by one another user could see an inconsistent
> > result. Providing globally consistent snapshots to transactions
> > involving foreign servers is one of the solutions.

How would it be able to do that?  Say, when it decides to take a
snapshot the transaction on the foreign server appears to be committed
but the transaction on the local server won't appear to be committed,
so the consistent data visibility problem as mentioned above could
still arise.

>
> Another approach to the atomic visibility problem is to control
> snapshot acquisition timing and commit timing (plus using REPEATABLE
> READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
> assigns a snapshot at the time when the first command is executed in a
> transaction. If we could prevent any commit while any transaction is
> acquiring snapshot, and we could prevent any snapshot acquisition while
> committing, visibility inconsistency which Amit explained can be
> avoided.
>

I think the problem mentioned above can occur with this as well or if
I am missing something then can you explain in further detail how it
won't create problem in the scenario I have used above?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: hashagg slowdown due to spill changes

2020-06-15 Thread Jeff Davis
On Sun, 2020-06-14 at 11:14 -0700, Andres Freund wrote:
> I'm somewhat inclined to think that we should revert 4cad2534da6 and
> then look at how precisely to tackle this in 14.

I'm fine with that.

> It'd probably make sense to request small tlists when the number of
> estimated groups is large, and not otherwise.

That seems like a nice compromise that would be non-invasive, at least
for create_agg_plan().

Regards,
Jeff Davis