date:20200421

Re: [IBM z Systems] Getting server crash when jit_above_cost =0

2020-04-21 Thread tushar


On 4/22/20 2:40 AM, Thomas Munro wrote:

I'm not sure if you're seeing the same problem or another similar one,
but I know that Andres got a patch along those lines into llvm.  Maybe
you could try on a more recent llvm release?
Thanks a lot Thomas. it is working fine with llvm-toolset-7.0. look 
like  issue is with llvm-toolset-6.0 .
Yesterday when we installed llvm-toolset-7  (yum install 
llvm-toolset-7.0), there was no llvm-config available under 
/opt/rh/llvm-toolset-7.0/root/usr/bin/

so we ,chosen  llvm-toolset-6 with PG v12.
today , we again fired this same yum command using asterisk , now all 
the required file have been placed under llvm-toolset-7 directory and 
things look fine.


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

[IBM z Systems] Rpm package issues.

2020-04-21 Thread tushar


HI,

I tried to install PG v11 and v12 on IBM z/OS using YUM command , 
following -https://wiki.postgresql.org/wiki/YUM_Installation_on_z_Systems


Found that 2 issues

1) rpm packages  are failing  due to " Package .  is not signed "

PG v12 -
Package postgresql12-contrib-12.1-2PGDG.rhel7.s390x.rpm is not signed
PG v11
Package postgresql11-libs-11.6-2PGDG.rhel7.s390x.rpm is not signed

2) Rpm packages are NOT updated . still showing 11.6 and 12.1 version 
whereas latest released version is 11.7 and 12.2.


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-21 Thread Dilip Kumar

On Tue, Apr 21, 2020 at 4:52 PM Dilip Kumar  wrote:
>
> On Tue, Apr 21, 2020 at 3:44 PM Thomas Munro  wrote:
> >
> > On Tue, Apr 21, 2020 at 2:05 PM Thomas Munro  wrote:
> > > As before, these two apply on top of Robert's patches (or at least his
> > > 0001 and 0002).
> >
> > While trying to figure out if Robert's 0003 patch was correct, I added
> > yet another patch to this stack to test it.  0006 does basic xid map
> > maintenance that exercises the cases 0003 fixes, and I think it
> > demonstrates that they now work correctly.
>
> +1,  I think we should also add a way to test the case, where we
> advance the timestamp by multiple slots.  I see that you have such
> case
> e.g
> +# test adding minutes while the map is not full
> +set_time('3000-01-01 02:01:00Z');
> +is(summarize_mapping(), "2|02:00:00|02:01:00");
> +set_time('3000-01-01 02:05:00Z');
> +is(summarize_mapping(), "6|02:00:00|02:05:00");
> +set_time('3000-01-01 02:19:00Z');
> +is(summarize_mapping(), "20|02:00:00|02:19:00");
>
> But, I think we should try to extend it to test that we have put the
> new xid only in those slots where we suppose to and not in other
> slots?.

I feel that we should. probably fix this check as well?  Because if ts
> update_ts then it will go to else part then there it will finally
end up in the last slot only so I think we can use this case also as
fast exit.

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 93a0c04..644d9b1 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1831,7 +1831,7 @@
TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,

if (!same_ts_as_threshold)
{
-   if (ts == update_ts)
+   if (ts >= update_ts)
{
xlimit = latest_xmin;
if (NormalTransactionIdFollows(xlimit,
recentXmin))

This patch can be applied on top of other v5 patches.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v5-0007-Fix-check-while-computing-transaction-xid-limit.patch
Description: Binary data

Re: HEAPDEBUGALL is broken

2020-04-21 Thread Alexander Lakhin

21.04.2020 21:01, Peter Eisentraut wrote:
> On 2020-04-19 22:00, Alexander Lakhin wrote:
>> To the point, I've tried to use HAVE_ALLOCINFO on master today and it
>> failed too:
>
> Do you have a proposed patch?
>
As this is broken at least since the invention of the generational
allocator (2017-11-23, a4ccc1ce), I believe than no one uses this (and
slab is broken too). Nonetheless, HAVE_ALLOCINFO in aset.c is still
working, so it could be leaved alone, though the output too chatty for
general use (`make check` produces postmaster log of size 3.8GB). I
think someone would still need to insert some extra conditions to use
that or find another way to debug memory allocations.

So I would just remove this debug macro. The proposed patch is attached.

Best regards,
Alexander
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index c0623f106d2..dd8d7a33a4f 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -50,9 +50,6 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
-/* Define this to detail debug alloc information */
-/* #define HAVE_ALLOCINFO */
-
 /*
  * Chunk freelist k holds chunks of size 1 << (k + ALLOC_MINBITS),
  * for k = 0 .. ALLOCSET_NUM_FREELISTS-1.
@@ -298,21 +295,6 @@ static const MemoryContextMethods AllocSetMethods = {
 #endif
 };
 
-/* --
- * Debug macros
- * --
- */
-#ifdef HAVE_ALLOCINFO
-#define AllocFreeInfo(_cxt, _chunk) \
-			fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
-(_cxt)->header.name, (_chunk), (_chunk)->size)
-#define AllocAllocInfo(_cxt, _chunk) \
-			fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
-(_cxt)->header.name, (_chunk), (_chunk)->size)
-#else
-#define AllocFreeInfo(_cxt, _chunk)
-#define AllocAllocInfo(_cxt, _chunk)
-#endif
 
 /* --
  * AllocSetFreeIndex -
@@ -796,8 +778,6 @@ AllocSetAlloc(MemoryContext context, Size size)
 			set->blocks = block;
 		}
 
-		AllocAllocInfo(set, chunk);
-
 		/* Ensure any padding bytes are marked NOACCESS. */
 		VALGRIND_MAKE_MEM_NOACCESS((char *) AllocChunkGetPointer(chunk) + size,
    chunk_size - size);
@@ -835,8 +815,6 @@ AllocSetAlloc(MemoryContext context, Size size)
 		randomize_mem((char *) AllocChunkGetPointer(chunk), size);
 #endif
 
-		AllocAllocInfo(set, chunk);
-
 		/* Ensure any padding bytes are marked NOACCESS. */
 		VALGRIND_MAKE_MEM_NOACCESS((char *) AllocChunkGetPointer(chunk) + size,
    chunk->size - size);
@@ -996,8 +974,6 @@ AllocSetAlloc(MemoryContext context, Size size)
 	randomize_mem((char *) AllocChunkGetPointer(chunk), size);
 #endif
 
-	AllocAllocInfo(set, chunk);
-
 	/* Ensure any padding bytes are marked NOACCESS. */
 	VALGRIND_MAKE_MEM_NOACCESS((char *) AllocChunkGetPointer(chunk) + size,
 			   chunk_size - size);
@@ -1021,8 +997,6 @@ AllocSetFree(MemoryContext context, void *pointer)
 	/* Allow access to private part of chunk header. */
 	VALGRIND_MAKE_MEM_DEFINED(chunk, ALLOCCHUNK_PRIVATE_LEN);
 
-	AllocFreeInfo(set, chunk);
-
 #ifdef MEMORY_CONTEXT_CHECKING
 	/* Test for someone scribbling on unused space in chunk */
 	if (chunk->requested_size < chunk->size)
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 56651d06931..af52616e575 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -178,22 +178,6 @@ static const MemoryContextMethods GenerationMethods = {
 #endif
 };
 
-/* --
- * Debug macros
- * --
- */
-#ifdef HAVE_ALLOCINFO
-#define GenerationFreeInfo(_cxt, _chunk) \
-			fprintf(stderr, "GenerationFree: %s: %p, %lu\n", \
-(_cxt)->name, (_chunk), (_chunk)->size)
-#define GenerationAllocInfo(_cxt, _chunk) \
-			fprintf(stderr, "GenerationAlloc: %s: %p, %lu\n", \
-(_cxt)->name, (_chunk), (_chunk)->size)
-#else
-#define GenerationFreeInfo(_cxt, _chunk)
-#define GenerationAllocInfo(_cxt, _chunk)
-#endif
-
 
 /*
  * Public routines
@@ -383,8 +367,6 @@ GenerationAlloc(MemoryContext context, Size size)
 		/* add the block to the list of allocated blocks */
 		dlist_push_head(&set->blocks, &block->node);
 
-		GenerationAllocInfo(set, chunk);
-
 		/* Ensure any padding bytes are marked NOACCESS. */
 		VALGRIND_MAKE_MEM_NOACCESS((char *) GenerationChunkGetPointer(chunk) + size,
    chunk_size - size);
@@ -460,8 +442,6 @@ GenerationAlloc(MemoryContext context, Size size)
 	randomize_mem((char *) GenerationChunkGetPointer(chunk), size);
 #endif
 
-	GenerationAllocInfo(set, chunk);
-
 	/* Ensure any padding bytes are marked NOACCESS. */
 	VALGRIND_MAKE_MEM_NOACCESS((char *) GenerationChunkGetPointer(chunk) + size,
 			   chunk_size - size);
diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c
index c928476c479..e9e962d7674 100644
--- a/src/backend/utils/mmgr/slab.c
+++ b/src/backend/utils/mmgr/slab.c
@@ -157,22 +157,6 @@ static const MemoryContextMethods SlabMethods = {
 #endif
 };
 
-/* --
- * Debug macros
- * --
- *

Re: WAL usage calculation patch

2020-04-21 Thread Justin Pryzby

On Wed, Apr 22, 2020 at 09:15:08AM +0530, Amit Kapila wrote:
> > > > And add the acronym to the docs:
> > > >
> > > > $ git grep 'full page' '*/explain.sgml'
> > > > doc/src/sgml/ref/explain.sgml:  number of records, number of full 
> > > > page writes and amount of WAL bytes
> > > >
> > > > "..full page writes (FPW).."
> > >
> > > Indeed!  Fixed (using lowercase to match current output).
> >
> > I searched through the documentation and AFAICS most of occurances of
> > "full page" are follwed by "image" and full_page_writes is used only
> > as the parameter name.
> >
> > I'm fine with fpw as the acronym, but "fpw means the number of full
> > page images" looks odd..
> >
> 
> I don't understand this.  Where are we using such a description of fpw?

I suggested to add " (FPW)" to the new docs for "explain(wal)"
But, the documentation before this commit mostly refers to "full page images".
So the implication is that maybe we should use that language (and FPI acronym).

The only pre-existing use of "full page writes" seems to be here:
$ git grep -iC2 'full page write' origin doc 
origin:doc/src/sgml/wal.sgml-  Internal data structures such as 
pg_xact, pg_subtrans, 
pg_multixact,
origin:doc/src/sgml/wal.sgml-  pg_serial, 
pg_notify, pg_stat, 
pg_snapshots are not directly
origin:doc/src/sgml/wal.sgml:  checksummed, nor are pages protected by full 
page writes. However, where

And we're not using either acronym.

-- 
Justin

Re: WAL usage calculation patch

2020-04-21 Thread Amit Kapila

On Mon, Apr 20, 2020 at 1:17 PM Kyotaro Horiguchi
 wrote:
>
> At Sun, 19 Apr 2020 16:22:26 +0200, Julien Rouhaud  wrote 
> in
> > Hi Justin,
> >
> > Thanks for the review!
> >
> > On Sat, Apr 18, 2020 at 10:41 PM Justin Pryzby  wrote:
> > >
> > > Should capitalize at least the non-text one ?  And maybe the text one for
> > > consistency ?
> > >
> > > +   ExplainPropertyInteger("WAL fpw", NULL,
> >
> > I think we should keep both version consistent, whether lower or upper
> > case.  The uppercase version is probably more correct, but it's a
> > little bit weird to have it being the only upper case label in all
> > output, so I kept it lower case.

I think we can keep upper-case for all non-text ones in case of WAL
usage, something like WAL Records, WAL FPW, WAL Bytes.  The buffer
usage seems to be following a similar convention.

>
> One space follwed by an acronym looks perfect.  I'd prefer capital
> letters but small-letters also works well.
>
> > > And add the acronym to the docs:
> > >
> > > $ git grep 'full page' '*/explain.sgml'
> > > doc/src/sgml/ref/explain.sgml:  number of records, number of full 
> > > page writes and amount of WAL bytes
> > >
> > > "..full page writes (FPW).."
> >
> > Indeed!  Fixed (using lowercase to match current output).
>
> I searched through the documentation and AFAICS most of occurances of
> "full page" are follwed by "image" and full_page_writes is used only
> as the parameter name.
>
> I'm fine with fpw as the acronym, but "fpw means the number of full
> page images" looks odd..
>

I don't understand this.  Where are we using such a description of fpw?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Aggregation push-down

2020-04-21 Thread Andy Fan

>
> > 1) v14-0001-Introduce-RelInfoList-structure.patch
> > -
> >
> > - I'm not entirely sure why we need this change. We had the list+hash
> > before, so I assume we do this because we need the output functions?
>
> I believe that this is what Tom proposed in [1], see "Maybe an appropriate
> preliminary patch is to refactor the support code ..." near the end of that
> message. The point is that now we need two lists: one for "plain" relations
> and one for grouped ones.
>
>
I guess what Toms suggested[1] is to store the the grouped ones into
root->upper_rels rather than a separated member, see fetch_upper_rel
or UpperRelationKind.  If we do need the list+hash method for long list
lookup,
we can merge it into upper_rels.  If we want this benefits at other place
rather
than root->upper_rels, we can store a generic type in RelInfoList, looks
currently
it is bounded to RelAggInfo besides RelOptInfo.   But overall, personally I
think we can
ignore such optimization at the first stage to save the attention of the
core reviewers
since they are too precious :)Just FYI

[1] https://www.postgresql.org/message-id/9726.1542577...@sss.pgh.pa.us

Best Regards
Andy Fan

Re: Parallel Append can break run-time partition pruning

2020-04-21 Thread David Rowley

On Tue, 21 Apr 2020 at 15:03, David Rowley  wrote:
> I wonder if the fix should be more something along the lines of trying
> to merge things do we only generate a single partial path.  That way
> we wouldn't be at the mercy of the logic in add_partial_path() to
> accept or reject the path based on the order the paths are added.

I took a shot at doing things this way.

First, I'll recap on the problem this is trying to solve:

add_paths_to_append_rel() attempts to create two separate partial
Append paths. I'll describe both of these below:

Path1: This path is generated regardless of if Parallel Append is
enabled and contains all the cheapest partial paths from each child
relation.   If parallel append is enabled this will become a Parallel
Append. If it's not then a non-parallel append will be created
containing the list of partial subpaths. Here's an example from
select_parallel.out:

  QUERY PLAN
--
 Finalize Aggregate
   ->  Gather
 Workers Planned: 1
 ->  Partial Aggregate
   ->  Append
 ->  Parallel Seq Scan on a_star a_star_1
 ->  Parallel Seq Scan on b_star a_star_2
 ->  Parallel Seq Scan on c_star a_star_3
 ->  Parallel Seq Scan on d_star a_star_4
 ->  Parallel Seq Scan on e_star a_star_5
 ->  Parallel Seq Scan on f_star a_star_6

Path2: We only ever consider this one when enable_parallel_append ==
true and the append rel's consider_parallel == true.  When this path
is generated, it'll always be for a Parallel Append.  This path may
contain a mix of partial paths for child rels and parallel_safe child
paths, of which will only be visited by a single worker.

The problem is that path1 does not pullup child Appends when the child
append path contains a mix of partial and parallel safe paths (i.e a
sub-path2, per above). Since we create path2 in addition to path1, the
costs come out the same even though path1 couldn't pullup the
sub-Append paths. Unfortunately, the costs are the same so path1 is
prefered since it's added first. add_partial_path() just rejects path2
based on it being too similar in cost to the existing path1.

In the attached, I'm trying to solve this by only created 1 partial
Append path in the first place. This path will always try to use the
cheapest partial path, or the cheapest parallel safe path, if parallel
append is allowed and that path is cheaper than the cheapest partial
path.

I believe the attached gives us what we want and additionally, since
it should now always pullup the sub-Appends, then there's no need to
consider adjusting partitioned_rels based on if the pull-up occurred
or not. Those should now be right in all cases. This should also fix
the run-time pruning issue too since in my original test case it'll
pullup the sub-Append which means that the code added in 8edd0e794 is
no longer going to do anything with it as the top-level Append will
never contain just 1 subpath.

I'm reasonably certain that this is correct, but I did find it a bit
mind-bending considering all the possible cases, so it could do with
some more eyes on it.   I've not really done a final polish of the
comments yet. I'll do that if the patch is looking promising.

The output of the original test with the attached is as follows:

postgres=# explain (analyze on, costs off, timing off, summary off)
postgres-# select * from list where a = (select 1) and b > 0;
QUERY PLAN
---
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
 ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
 ->  Parallel Seq Scan on list_12_2 list_2 (never executed)
   Filter: ((b > 0) AND (a = $0))
 ->  Parallel Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
   Filter: ((b > 0) AND (a = $0))
(11 rows)

postgres=# -- force the 2nd subnode of the Append to be non-parallel.
postgres=# alter table list_12_1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain (analyze on, costs off, timing off, summary off)
postgres-# select * from list where a = (select 1) and b > 0;
 QUERY PLAN

 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
 ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
 ->  Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
   Filter: ((b > 0) AND (a = $0))
 ->  Parallel Seq Scan on list_12_2 list_2 (never executed)
   Filter: ((b > 0) AND (a = $0))
(11 rows)

Notice we get "(never ex

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Kyotaro Horiguchi

At Wed, 22 Apr 2020 11:51:42 +0900, Fujii Masao  
wrote in 
> > I meant that we always have EOR at the end of recovery.  So in the
> > missing latest checkpoint (and crash recovery) case, we insert EOR
> > after the immediate checkpoint. That also means we no longer set
> > CHECKPOINT_END_OF_RECOVERY to the checkpoint, too.
> 
> Could you tell me what the benefit by this change is? Even with this
> change,
> the server still needs to wait for the checkpoint to complete before
> becoming the master and starting the service, unlike fast
> promotion. No?

There's no benefit of performance.  It's just for simplicity by
signalling end-of-recovery in a unified way.

Long ago, we had only non-fast promotion, which is marked by
CHECKPOINT_END_OF_RECOVERY.  When we introduced fast-promotion, it is
marked by the END_OF_RECOVERY record since checkpoint record is not
inserted at the promotion time. However, we internally fall back to
non-fast promotion when we need to make a checkpoint immediately.
If we remove non-fast checkpoint, we don't need two means to signal
end-of-recovery.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Fujii Masao





On 2020/04/22 9:13, Kyotaro Horiguchi wrote:

At Tue, 21 Apr 2020 22:08:56 +0900, Fujii Masao  
wrote in



On 2020/04/21 17:15, Kyotaro Horiguchi wrote:

At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao
 wrote in

Patch attached. I will add this into the first CF for v14.

-   if (!fast_promoted)
+   if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
  
CHECKPOINT_IMMEDIATE |
  
CHECKPOINT_WAIT);
If we don't find the checkpoint record just before, we don't insert
End-Of-Recovery record then run an immediate chekpoint.  I think if we
nuke the non-fast promotion, shouldn't we insert the EOR record even
in that case?


I'm not sure if that's safe. What if the server crashes before the
checkpoint
completes in that case? Since the last checkpoint record is not
available,
the subsequent crash recovery will fail. This would lead to that the
server
will never start up. Right? Currently ISTM that


Yes, that's right.


end-of-recovery-checkpoint
is executed to avoid such trouble in that case.


I meant that we always have EOR at the end of recovery.  So in the
missing latest checkpoint (and crash recovery) case, we insert EOR
after the immediate checkpoint. That also means we no longer set
CHECKPOINT_END_OF_RECOVERY to the checkpoint, too.


Could you tell me what the benefit by this change is? Even with this change,
the server still needs to wait for the checkpoint to complete before
becoming the master and starting the service, unlike fast promotion. No?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Fujii Masao





On 2020/04/22 10:53, Kyotaro Horiguchi wrote:

At Wed, 22 Apr 2020 10:28:07 +0900, Ian Barwick  
wrote in

On 2020/04/22 6:53, Alvaro Herrera wrote:

On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote:


On Tue, 21 Apr 2020 15:36:22 +0900
Michael Paquier  wrote:



Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?


It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.


FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.

AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0
(released
in mid 2015).  It was only 3.3.2 (mid 2017) that supported Postgres
10,
so it seems fairly safe to assume that the removal won't be a problem.


Correct, repmgr uses "pg_ctl promote" or pg_promote() (if available),
and
won't be affected by this change.


For the record, the pgsql resource agent uses "pg_ctl promote" and
working with fast-promote.  Auxiliary tools for it is assuming
fast-promote.


Thanks all for checking whether the change affects each HA solution!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Fujii Masao




On 2020/04/22 6:57, Alvaro Herrera wrote:

On 2020-Apr-20, Fujii Masao wrote:


+   /*
+* In 9.1 and 9.2 the postmaster unlinked the promote file inside the
+* signal handler. It now leaves the file in place and lets the
+* Startup process do the unlink.
+*/
+   if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0)
{
-   /*
-* In 9.1 and 9.2 the postmaster unlinked the promote file 
inside the
-* signal handler. It now leaves the file in place and lets the
-* Startup process do the unlink. This allows Startup to know 
whether
-* it should create a full checkpoint before starting up 
(fallback
-* mode). Fast promotion takes precedence.
-*/


It seems pointless to leave a very old comment that documents what the
code no longer does.  I thikn it would be better to reword it indicating
what the code does do, ie. something like "Leave the signal file in
place; it will be removed by the startup process when ..."


Agreed. And, while reading the related code, I thought that it's more proper
to place this comment in CheckPromoteSignal() rather than
CheckForStandbyTrigger(). Because CheckPromoteSignal() actually does
what the comment says, i.e., leaves the promote signal file in place and
lets the startup process do the unlink.

Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 11e32733c4..43b6d8da69 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -298,9 +298,6 @@ boolwal_receiver_create_temp_slot = false;
 /* are we currently in standby mode? */
 bool   StandbyMode = false;
 
-/* whether request for fast promotion has been made yet */
-static bool fast_promote = false;
-
 /*
  * if recoveryStopsBefore/After returns true, it saves information of the stop
  * point here
@@ -6311,7 +6308,7 @@ StartupXLOG(void)
DBState dbstate_at_startup;
XLogReaderState *xlogreader;
XLogPageReadPrivate private;
-   boolfast_promoted = false;
+   boolpromoted = false;
struct stat st;
 
/*
@@ -7698,14 +7695,14 @@ StartupXLOG(void)
 * the rule that TLI only changes in shutdown checkpoints, which
 * allows some extra error checking in xlog_redo.
 *
-* In fast promotion, only create a lightweight end-of-recovery 
record
+* In promotion, only create a lightweight end-of-recovery 
record
 * instead of a full checkpoint. A checkpoint is requested 
later,
 * after we're fully out of recovery mode and already accepting
 * queries.
 */
if (bgwriterLaunched)
{
-   if (fast_promote)
+   if (LocalPromoteIsTriggered)
{
checkPointLoc = ControlFile->checkPoint;
 
@@ -7716,7 +7713,7 @@ StartupXLOG(void)
record = ReadCheckpointRecord(xlogreader, 
checkPointLoc, 1, false);
if (record != NULL)
{
-   fast_promoted = true;
+   promoted = true;
 
/*
 * Insert a special WAL record to mark 
the end of
@@ -7733,7 +7730,7 @@ StartupXLOG(void)
}
}
 
-   if (!fast_promoted)
+   if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
  
CHECKPOINT_IMMEDIATE |
  
CHECKPOINT_WAIT);
@@ -7924,12 +7921,12 @@ StartupXLOG(void)
WalSndWakeup();
 
/*
-* If this was a fast promotion, request an (online) checkpoint now. 
This
+* If this was a promotion, request an (online) checkpoint now. This
 * isn't required for consistency, but the last restartpoint might be 
far
 * back, and in case of a crash, recovering from it might take a longer
 * than is appropriate now that we're not in standby mode anymore.
 */
-   if (fast_promoted)
+   if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -12532,29 +12529,10 @@ CheckForStandbyTrigger(void)
if (LocalPromoteIsTriggered)
return true;
 
-   if (IsPromoteSignaled())
+   if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &

Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)

2020-04-21 Thread Tom Lane

Alvaro Herrera  writes:
> Do we intend to see this done in the current cycle?

I don't have an objection to doing it now.  It's just a new compiler
warning option, it shouldn't be able to break any code.  (Plus there's
plenty of time to revert, if somehow it causes a problem.)

regards, tom lane

Re: [PATCH] Implement INSERT SET syntax

2020-04-21 Thread movead li

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

It builds failed by applying to the latest code version, and I try head
'73025140885c889410b9bfc4a30a3866396fc5db' which work well.

The new status of this patch is: Waiting on Author

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Kyotaro Horiguchi

At Wed, 22 Apr 2020 10:28:07 +0900, Ian Barwick  
wrote in 
> On 2020/04/22 6:53, Alvaro Herrera wrote:
> > On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote:
> > 
> >> On Tue, 21 Apr 2020 15:36:22 +0900
> >> Michael Paquier  wrote:
> > 
>  Also in that case, non-fast promotion is triggered. Since my patch
>  tries to remove non-fast promotion, it's intentional to prevent them
>  from doing that. But you think that we should not drop that because
>  there are still some users for that?
> >>>
> >>> It would be good to ask around to folks maintaining HA solutions about
> >>> that change at least, as there could be a point in still letting
> >>> promotion to happen in this case, but switch silently to the fast
> >>> path.
> >>
> >> FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.
> > AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0
> > (released
> > in mid 2015).  It was only 3.3.2 (mid 2017) that supported Postgres
> > 10,
> > so it seems fairly safe to assume that the removal won't be a problem.
> 
> Correct, repmgr uses "pg_ctl promote" or pg_promote() (if available),
> and
> won't be affected by this change.

For the record, the pgsql resource agent uses "pg_ctl promote" and
working with fast-promote.  Auxiliary tools for it is assuming
fast-promote.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Ian Barwick


On 2020/04/22 6:53, Alvaro Herrera wrote:

On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote:


On Tue, 21 Apr 2020 15:36:22 +0900
Michael Paquier  wrote:



Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?


It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.


FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.


AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 (released
in mid 2015).  It was only 3.3.2 (mid 2017) that supported Postgres 10,
so it seems fairly safe to assume that the removal won't be a problem.


Correct, repmgr uses "pg_ctl promote" or pg_promote() (if available), and
won't be affected by this change.


Regards

Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: MYSQL_FDW trigger BEFORE UPDATE changes to NEW on a col not in the update statement don't go through

2020-04-21 Thread Etsuro Fujita

Hi Francois,

On Wed, Apr 22, 2020 at 8:09 AM Francois Payette
 wrote:
> I create a trigger on an imported foreign table. In the procedure, I change 
> the value of a column that is not in the triggering update statement. This 
> change does not make it to the mysql side.

I'm not an expert on mysql_fdw, so maybe I'm missing something, but I
think we had the same issue in postgres_fdw.  See this:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8b6da83d162cb0ac9f6d21082727bbd45c972c53;hp=7dc6ae37def50b5344c157eee5e029a09359f8ee

Best regards,
Etsuro Fujita

Re: Remove page-read callback from XLogReaderState.

2020-04-21 Thread Kyotaro Horiguchi

At Tue, 21 Apr 2020 17:04:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
>

Mmm. The message body seems disappearing for uncertain reason.

cd12323440 conflicts with this. Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b3b780c2143ae70b82fb1e8256e771f11276af31 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 5 Sep 2019 20:21:55 +0900
Subject: [PATCH v9 1/4] Move callback-call from ReadPageInternal to
 XLogReadRecord.

The current WAL record reader reads page data using a call back
function.  Although it is not so problematic alone, it would be a
problem if we are going to do add tasks like encryption which is
performed on page data before WAL reader reads them. To avoid that the
record reader facility has to have a new code path corresponds to
every new callback, this patch separates page reader from WAL record
reading facility by modifying the current WAL record reader to a state
machine.

As the first step of that change, this patch moves the page reader
function out of ReadPageInternal, then the remaining tasks of the
function are taken over by the new function XLogNeedData. As the
result XLogPageRead directly calls the page reader callback function
according to the feedback from XLogNeedData.
---
 src/backend/access/transam/xlog.c   |  16 +-
 src/backend/access/transam/xlogreader.c | 278 ++--
 src/backend/access/transam/xlogutils.c  |  12 +-
 src/backend/replication/walsender.c |  10 +-
 src/bin/pg_rewind/parsexlog.c   |  21 +-
 src/bin/pg_waldump/pg_waldump.c |   8 +-
 src/include/access/xlogreader.h |  23 +-
 src/include/access/xlogutils.h  |   2 +-
 8 files changed, 218 insertions(+), 152 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..b0ad9376d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -908,7 +908,7 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
-static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
+static bool	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 		 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool fetching_ckpt, XLogRecPtr tliRecPtr);
@@ -4328,7 +4328,6 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	XLogRecord *record;
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
-	/* Pass through parameters to XLogPageRead */
 	private->fetching_ckpt = fetching_ckpt;
 	private->emode = emode;
 	private->randAccess = (xlogreader->ReadRecPtr == InvalidXLogRecPtr);
@@ -11810,7 +11809,7 @@ CancelBackup(void)
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
  */
-static int
+static bool
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			 XLogRecPtr targetRecPtr, char *readBuf)
 {
@@ -11869,7 +11868,8 @@ retry:
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
-			return -1;
+			xlogreader->readLen = -1;
+			return false;
 		}
 	}
 
@@ -11964,7 +11964,8 @@ retry:
 		goto next_record_is_invalid;
 	}
 
-	return readLen;
+	xlogreader->readLen = readLen;
+	return true;
 
 next_record_is_invalid:
 	lastSourceFailed = true;
@@ -11978,8 +11979,9 @@ next_record_is_invalid:
 	/* In standby-mode, keep trying */
 	if (StandbyMode)
 		goto retry;
-	else
-		return -1;
+
+	xlogreader->readLen = -1;
+	return false;
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..6250093dd9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -36,8 +36,8 @@
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
-static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-			 int reqLen);
+static bool XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr,
+		 int reqLen, bool header_inclusive);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -272,7 +272,6 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		targetRecOff;
 	uint32		pageHeaderSize;
 	bool		gotheader;
-	int			readOff;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -322,14 +321,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * byte to cover the whole record header, or at le

Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-21 Thread Justin Pryzby

On Tue, Apr 21, 2020 at 07:03:30PM -0400, Alvaro Herrera wrote:
> On 2020-Apr-20, Justin Pryzby wrote:
> 
> > On Mon, Apr 20, 2020 at 06:35:44PM +0900, Amit Langote wrote:
> 
> > > Also, how about, for consistency, making the parent table labeling of
> > > the trigger look similar to that for the foreign constraint, so
> > > Triggers:
> > > TABLE "f1" TRIGGER "trig" BEFORE INSERT ON f11 FOR EACH ROW EXECUTE 
> > > FUNCTION trigfunc()
> > 
> > I'll leave that for committer to decide.
> 
> Pushed.  Many thanks for this!

Thanks for polishing it.

I was just about to convince myself of the merits of doing it Amit's way :)

I noticed a few issues:

 - should put \n's around Amit's subquery to make psql -E look pretty;
 - maybe should quote the TABLE, like \"%s\" ?

#3 is that *if* we did it Amit's way, I *think* maybe we should show the
parent's triggerdef, not the childs.
It seems strange to me to say "TABLE trigpart .* INSERT ON trigpart3"

-TABLE "trigpart" TRIGGER trg1 AFTER INSERT ON trigpart3 FOR EACH ROW 
EXECUTE FUNCTION trigger_nothing()
+TABLE "trigpart" TRIGGER trg1 AFTER INSERT ON trigpart FOR EACH ROW 
EXECUTE FUNCTION trigger_nothing()


-- 
Justin
>From 86ff4e592afe56d2611e22b63fa3f1268b583e58 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 3 Apr 2020 22:43:26 -0500
Subject: [PATCH v6 1/2] fixups: c33869cc3bfc42bce822251f2fa1a2a346f86cc5

---
 src/bin/psql/describe.c| 14 +++---
 src/test/regress/expected/triggers.out |  2 +-
 src/test/regress/sql/triggers.sql  |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 8dca6d8bb4..5ef56f7a9d 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2947,12 +2947,12 @@ describeOneTableDetails(const char *schemaname,
 		   pset.sversion >= 80300 ?
 		   "t.tgconstraint <> 0 AS tgisinternal" :
 		   "false AS tgisinternal"),
-		  (pset.sversion >= 13 ?
-		   "(SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass"
-		   " FROM pg_catalog.pg_trigger AS u, "
-		   "  pg_catalog.pg_partition_ancestors(t.tgrelid) AS a"
-		   " WHERE u.tgname = t.tgname AND u.tgrelid = a.relid"
-		   "   AND u.tgparentid = 0) AS parent" :
+		  (pset.sversion >= 13 ? "\n"
+		   "  (SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass\n"
+		   "   FROM pg_catalog.pg_trigger AS u,\n"
+		   "   pg_catalog.pg_partition_ancestors(t.tgrelid) AS a\n"
+		   "   WHERE u.tgname = t.tgname AND u.tgrelid = a.relid\n"
+		   "AND u.tgparentid = 0) AS parent" :
 		   "NULL AS parent"),
 		  oid);
 		if (pset.sversion >= 11)
@@ -3073,7 +3073,7 @@ describeOneTableDetails(const char *schemaname,
 
 	/* Visually distinguish inherited triggers */
 	if (!PQgetisnull(result, i, 4))
-		appendPQExpBuffer(&buf, ", ON TABLE %s",
+		appendPQExpBuffer(&buf, ", ON TABLE \"%s\"",
 PQgetvalue(result, i, 4));
 
 	printTableAddFooter(&cont, buf.data);
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 5e76b3a47e..4104711c29 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -2065,7 +2065,7 @@ create trigger trg1 after insert on trigpart3 for each row execute procedure tri
 Triggers:
 trg1 AFTER INSERT ON trigpart3 FOR EACH ROW EXECUTE FUNCTION trigger_nothing()
 
-alter table trigpart attach partition trigpart3 FOR VALUES FROM (2000) to (3000); -- fail
+alter table trigpart attach partition trigpart3 for values from (2000) to (3000); -- fail
 ERROR:  trigger "trg1" for relation "trigpart3" already exists
 drop table trigpart3;
 drop table trigpart;
diff --git a/src/test/regress/sql/triggers.sql b/src/test/regress/sql/triggers.sql
index e228d0a8a5..c359c6d3fa 100644
--- a/src/test/regress/sql/triggers.sql
+++ b/src/test/regress/sql/triggers.sql
@@ -1398,7 +1398,7 @@ select tgrelid::regclass::text, tgname, tgfoid::regproc, tgenabled, tgisinternal
 create table trigpart3 (like trigpart);
 create trigger trg1 after insert on trigpart3 for each row execute procedure trigger_nothing();
 \d trigpart3
-alter table trigpart attach partition trigpart3 FOR VALUES FROM (2000) to (3000); -- fail
+alter table trigpart attach partition trigpart3 for values from (2000) to (3000); -- fail
 drop table trigpart3;
 
 drop table trigpart;
-- 
2.17.0

>From 4712253f66e03fd57a7a7a971a9f00c492c6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 20 Apr 2020 13:46:06 -0500
Subject: [PATCH v6 2/2] show inherited triggers Amit's way..

..this also updates the query to avoid saying things like:
TABLE trigpart .* INSERT ON trigpart3
---
 src/bin/psql/describe.c| 42 +-
 src/test/regress/expected/triggers.out |  2 +-
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.

Re: forgotten initalization of a variable

2020-04-21 Thread Kyotaro Horiguchi

At Wed, 22 Apr 2020 08:13:02 +0900, Michael Paquier  wrote 
in 
> On Tue, Apr 21, 2020 at 06:09:30PM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 21 Apr 2020 17:34:26 +0900, Michael Paquier  
> > wrote in 
> >> a checkpoint record now, but this routine could be called elsewhere in
> >> the future.  Please see the attached.
> > 
> > It looks fine to me.
> 
> Fixed this way, then.  Thanks for the report!

Thans for fixing this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Kyotaro Horiguchi

At Tue, 21 Apr 2020 22:08:56 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/04/21 17:15, Kyotaro Horiguchi wrote:
> > At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao
> >  wrote in
> >> Patch attached. I will add this into the first CF for v14.
> > -   if (!fast_promoted)
> > +   if (!promoted)
> > RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
> >   
> > CHECKPOINT_IMMEDIATE |
> >   
> > CHECKPOINT_WAIT);
> > If we don't find the checkpoint record just before, we don't insert
> > End-Of-Recovery record then run an immediate chekpoint.  I think if we
> > nuke the non-fast promotion, shouldn't we insert the EOR record even
> > in that case?
> 
> I'm not sure if that's safe. What if the server crashes before the
> checkpoint
> completes in that case? Since the last checkpoint record is not
> available,
> the subsequent crash recovery will fail. This would lead to that the
> server
> will never start up. Right? Currently ISTM that

Yes, that's right.

> end-of-recovery-checkpoint
> is executed to avoid such trouble in that case.

I meant that we always have EOR at the end of recovery.  So in the
missing latest checkpoint (and crash recovery) case, we insert EOR
after the immediate checkpoint. That also means we no longer set
CHECKPOINT_END_OF_RECOVERY to the checkpoint, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ALTER TABLE ... SET STORAGE does not propagate to indexes

2020-04-21 Thread Alvaro Herrera

I'm surprised that this hasn't applied yet, because:

On 2020-Apr-09, Peter Eisentraut wrote:

> One thing to remember is that the current situation is broken.  While you
> can set index columns to have different storage than the corresponding table
> columns, pg_dump does not preserve that, because it dumps indexes after
> ALTER TABLE commands.  So at the moment, having these two things different
> isn't really supported.

So I have to ask -- are you planning to get this patch pushed and
backpatched?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: 2pc leaks fds

2020-04-21 Thread Alvaro Herrera

On 2020-Apr-08, Antonin Houska wrote:

> Specifically for 2PC, isn't it better to close some OS-level FD of an
> unrelated table scan and then succeed than to ERROR immediately? Anyway,
> 0dc8ead46 hasn't changed this.

I think for full generality of the interface, we pass a "close" callback
in addition to the "open" callback.  But if we were to pass it only for
WALRead, then there would be no way to call it during XLogReaderFree.

I think the fix Andres applied is okay as far as it goes, but for the
long term we may want to change the interface even further -- maybe by
having these functions be part of the XLogReader state struct.  I have
this code paged out of my head ATM, but maybe tomorrow I can give it a
little think.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add "-Wimplicit-fallthrough" to default flags (was Re: pgsql: Support FETCH FIRST WITH TIES)

2020-04-21 Thread Alvaro Herrera

Do we intend to see this done in the current cycle?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: forgotten initalization of a variable

2020-04-21 Thread Michael Paquier

On Tue, Apr 21, 2020 at 06:09:30PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 21 Apr 2020 17:34:26 +0900, Michael Paquier  
> wrote in 
>> a checkpoint record now, but this routine could be called elsewhere in
>> the future.  Please see the attached.
> 
> It looks fine to me.

Fixed this way, then.  Thanks for the report!
--
Michael


signature.asc
Description: PGP signature

MYSQL_FDW trigger BEFORE UPDATE changes to NEW on a col not in the update statement don't go through

2020-04-21 Thread Francois Payette

Hi All,
 I was pleasantly surprised to see that triggers can be created on FDW tables. 
I'm running into a problem.

I create a trigger on an imported foreign table. In the procedure, I change the 
value of a column that is not in the triggering update statement. This change 
does not make it to the mysql side.

CREATE OR REPLACE FUNCTION aatrigger_up() returns trigger
AS $$
DECLARE
BEGIN

IF NOT(row_to_json(NEW)->'pgrti' is NULL) THEN
NEW.pgrti = 20*random();
END IF;
RAISE NOTICE 'aarigger_up %', row_to_json(NEW)::text;
  return NEW;

END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER aarigger_up BEFORE UPDATE ON mysql.users FOR EACH ROW EXECUTE 
PROCEDURE aarigger_up();
update mysql.users set email = 'ad...@example.com' where id = 1;
I can see that the value for pgrti is updated in the NOTICE in postgres. In 
mysql the value is not updated. If I add the target col to the statement it 
does go through

update mysql.users set email = 'ad...@example.com', pgrti=0 where id = 1;   
 I need this to work to be able to detect CRUD coming from PG in a little 
deamon that calls pg_triggers for updates coming from mysqld; without a means 
to detect changes originating from pg the triggers would fire twice. Any idea 
where I'd change MYSQL_FDW to do this (also add fields that are updated in the 
trigger before firing off to mysql)?

I’m seeing in https://github.com/EnterpriseDB/mysql_fdw/blob/master/deparse.c 
 in 
mysql_deparse_update

That the actual update statement is used to generate the mapping, so any col 
referred to in triggers would be ignored…


TIA, stay safe!
Francois Payette

Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-21 Thread Alvaro Herrera

On 2020-Apr-20, Justin Pryzby wrote:

> On Mon, Apr 20, 2020 at 06:35:44PM +0900, Amit Langote wrote:

> > Also, how about, for consistency, making the parent table labeling of
> > the trigger look similar to that for the foreign constraint, so
> > Triggers:
> > TABLE "f1" TRIGGER "trig" BEFORE INSERT ON f11 FOR EACH ROW EXECUTE 
> > FUNCTION trigfunc()
> 
> I'll leave that for committer to decide.

Pushed.  Many thanks for this!

Changes: I thought that printing the "ON TABLE" bit when it's defined in
the same table is pointless and ugly, so I added a NULLIF to prevent it
in that case (it's not every day that you can put NULLIF to work).  I
also changed the empty string to NULL for the case with older servers,
so that it doesn't print a lame "ON TABLE " clause for them.  Lastly,
added pg_catalog qualifications everywhere needed.

Contrary to what I had said, I decided to leave the output as submitted;
the constraint lines are not really precedent against it:

55432 13devel 24286=# \d lev3
  Partitioned table "public.lev3"
 Column │  Type   │ Collation │ Nullable │ Default 
┼─┼───┼──┼─
 a  │ integer │   │ not null │ 
Partition of: lev2 FOR VALUES IN (3)
Partition key: LIST (a)
Indexes:
"lev3_pkey" PRIMARY KEY, btree (a)
Foreign-key constraints:
TABLE "lev1" CONSTRAINT "lev1_a_fkey" FOREIGN KEY (a) REFERENCES lev1(a)
Referenced by:
TABLE "lev1" CONSTRAINT "lev1_a_fkey" FOREIGN KEY (a) REFERENCES lev1(a)
Triggers:
tt AFTER UPDATE ON lev3 FOR EACH ROW EXECUTE FUNCTION trigger_nothing(), ON 
TABLE lev2
Number of partitions: 1 (Use \d+ to list them.)

In the "FK constraints" and "referenced by" entries, it looks natural
since the constraint refers to a table.  Not so in the trigger case.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: design for parallel backup

2020-04-21 Thread Andres Freund

Hi,

On 2020-04-21 17:09:50 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 4:14 PM Andres Freund  wrote:
> > It was local TCP. The speeds I can reach are faster than the 10GiB/s
> > (unidirectional) I can do between the laptop & workstation, so testing
> > it over "actual" network isn't informative - I basically can reach line
> > speed between them with any method.
>
> Is that really a conclusive test, though? In the case of either local
> TCP or a fast local interconnect, you'll have negligible latency. It
> seems at least possible that saturating the available bandwidth is
> harder on a higher-latency connection. Cross-region data center
> connections figure to have way higher latency than a local wired
> network, let alone the loopback interface.

Sure. But that's what the TCP window etc should take care of. You might
have to tune the OS if you have a high latency multi-GBit link, but
you'd have to do that regardless of whether a single process or multiple
processes are used.  And the number of people with high-latency
multi-gbit links isn't that high, compared to the number taking backups
within a datacenter.


> > It was in kernel buffer cache. But I can reach 100% utilization of
> > storage too (which is slightly slower than what I can do over unix
> > socket).
> >
> > pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv 
> > -B16M -r -a > /dev/null
> > 2.59GiB/s
> > find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; 
> > |pv -B16M -r -a > /dev/null
> > 2.53GiB/s
> > find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > 
> > /dev/null
> > 2.42GiB/s
> >
> > I tested this with a -s 5000 DB, FWIW.
>
> But that's not a real test either, because you're not writing the data
> anywhere. It's going to be a whole lot easier to saturate the read
> side if the write side is always zero latency.

I also stored data elsewhere in separate threads. But the bottleneck of
that is lower (my storage is faster on reads than on writes, at least
after the ram on the nvme is exhausted)...


> > > It seems to me that the interesting cases may involve having lots of
> > > available CPUs and lots of disk spindles, but a comparatively slow
> > > pipe between the machines.
> >
> > Hm, I'm not sure I am following. If network is the bottleneck, we'd
> > immediately fill the buffers, and that'd be that?
> >
> > ISTM all of this is only really relevant if either pg_basebackup or
> > walsender is the bottleneck?
>
> I agree that if neither pg_basebackup nor walsender is the bottleneck,
> parallelism is unlikely to be very effective. I have realized as a
> result of your comments that I actually don't care intrinsically about
> parallel backup; what I actually care about is making backups very,
> very fast. I suspect that parallelism is a useful means to that end,
> but I interpret your comments as questioning that, and specifically
> drawing attention to the question of where the bottlenecks might be.
> So I'm trying to think about that.

I agree that trying to make backups very fast is a good goal (or well, I
think not very slow would be a good descriptor for the current
situation). I am just trying to make sure we tackle the right problems
for that. My gut feeling is that we have to tackle compression first,
because without addressing that "all hope is lost" ;)

FWIW, here's the base backup from pgbench -i -s 5000 compressed a number
of ways. The uncompressed backup is 64622701911 bytes. Unfortunately
pgbench -i -s 5000 is not a particularly good example, it's just too
compressible.


method  level   parallelism wall-time   cpu-user-time   cpu-kernel-time 
sizerateformat
gzip1   1   380.79  368.46  12.15   
3892457816  16.6.gz
gzip6   1   976.05  963.10  12.84   
3594605389  18.0.gz
pigz1   10  34.35   364.14  23.55   
3892401867  16.6.gz
pigz6   10  101.27  1056.85 28.98   
3620724251  17.8.gz
zstd-gz 1   1   278.14  265.31  12.81   
3897174342  15.6.gz
zstd-gz 1   6   906.67  893.58  12.52   
3598238594  18.0.gz
zstd1   1   82.95   67.97   11.82   
2853193736  22.6.zstd
zstd1   6   228.58  214.65  13.92   
2687177334  24.0.zstd
zstd1   10  25.05   151.84  13.35   
2847414913  22.7.zstd
zstd6   10  43.47   374.30  12.37   
2745211100  23.5.zstd
zstd6   20  32.50   468.18  13.44   
2745211100  23.5.zstd
zstd9   20  57.99   949.91  14.13   
260

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-21 Thread Bruce Momjian

On Tue, Apr 21, 2020 at 04:03:53PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Tue, Apr 21, 2020 at 01:52:46PM +0900, Michael Paquier wrote:
> >> On Mon, Apr 20, 2020 at 10:35:15PM -0400, Bruce Momjian wrote:
> >>> On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
>  If we were going to go down the path of periodically logging warnings
>  about old prepared transactions, some single-instance background task
>  like the checkpointer would be a better place to do the work in.  But
>  I'm not really recommending that, because I agree with Robert that
>  we just plain don't want this functionality.
> 
> > Sorry, I meant something in the Postgres logs at postmaster start.
> 
> That seems strictly worse than periodic logging as far as the probability
> that somebody will notice the log entry goes.  In any case it would only
> help people when they restart their postmaster, which ought to be pretty
> infrequent in a production situation.

I thought if something was wrong, they might look at the server logs
after a restart, or they might have a higher probability of having
orphaned prepared transactions after a restart.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Alvaro Herrera

On 2020-Apr-20, Fujii Masao wrote:

> + /*
> +  * In 9.1 and 9.2 the postmaster unlinked the promote file inside the
> +  * signal handler. It now leaves the file in place and lets the
> +  * Startup process do the unlink.
> +  */
> + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0)
>   {
> - /*
> -  * In 9.1 and 9.2 the postmaster unlinked the promote file 
> inside the
> -  * signal handler. It now leaves the file in place and lets the
> -  * Startup process do the unlink. This allows Startup to know 
> whether
> -  * it should create a full checkpoint before starting up 
> (fallback
> -  * mode). Fast promotion takes precedence.
> -  */

It seems pointless to leave a very old comment that documents what the
code no longer does.  I thikn it would be better to reword it indicating
what the code does do, ie. something like "Leave the signal file in
place; it will be removed by the startup process when ..."

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Alvaro Herrera

On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote:

> On Tue, 21 Apr 2020 15:36:22 +0900
> Michael Paquier  wrote:

> > > Also in that case, non-fast promotion is triggered. Since my patch
> > > tries to remove non-fast promotion, it's intentional to prevent them
> > > from doing that. But you think that we should not drop that because
> > > there are still some users for that?  
> > 
> > It would be good to ask around to folks maintaining HA solutions about
> > that change at least, as there could be a point in still letting
> > promotion to happen in this case, but switch silently to the fast
> > path.
> 
> FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.

AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 (released
in mid 2015).  It was only 3.3.2 (mid 2017) that supported Postgres 10,
so it seems fairly safe to assume that the removal won't be a problem.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Jehan-Guillaume de Rorthais

Hello,

On Tue, 21 Apr 2020 15:36:22 +0900
Michael Paquier  wrote:

> On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote:
> > Yeah, but that's not documented. So I don't think that we need to keep
> > the backward-compatibility for that.
> > 
> > Also in that case, non-fast promotion is triggered. Since my patch
> > tries to remove non-fast promotion, it's intentional to prevent them
> > from doing that. But you think that we should not drop that because
> > there are still some users for that?  
> 
> It would be good to ask around to folks maintaining HA solutions about
> that change at least, as there could be a point in still letting
> promotion to happen in this case, but switch silently to the fast
> path.

FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.

Regards,

Re: [IBM z Systems] Getting server crash when jit_above_cost =0

2020-04-21 Thread Thomas Munro

On Wed, Apr 22, 2020 at 2:34 AM tushar  wrote:
> (gdb) bt
> #0  0x03ffa9841220 in raise () from /lib64/libc.so.6
> #1  0x03ffa9842aa8 in abort () from /lib64/libc.so.6
> #2  0x03ff9f7881b4 in __gnu_cxx::__verbose_terminate_handler() () from 
> /lib64/libstdc++.so.6
> #3  0x03ff9f785c7e in ?? () from /lib64/libstdc++.so.6
> #4  0x03ff9f785cb6 in std::terminate() () from /lib64/libstdc++.so.6
> #5  0x03ff9f785f60 in __cxa_throw () from /lib64/libstdc++.so.6
> #6  0x03ff9f7e4468 in std::__throw_bad_function_call() () from 
> /lib64/libstdc++.so.6
> #7  0x03ffa139e5c4 in 
> std::function std::default_delete > ()>::operator()() 
> const () from /opt/rh/llvm-toolset-6.0/root/usr/lib64/libLLVM-6.0.so
> #8  0x03ffa139f2a8 in LLVMOrcCreateInstance () from 
> /opt/rh/llvm-toolset-6.0/root/usr/lib64/libLLVM-6.0.so
> #9  0x03ffa9c8a984 in llvm_session_initialize () at llvmjit.c:670
> #10 llvm_create_context (jitFlags=) at llvmjit.c:146
> #11 0x03ffa9c98992 in llvm_compile_expr (state=0xa8c52218) at 
> llvmjit_expr.c:131
> #12 0x80219986 in ExecReadyExpr (state=state@entry=0xa8c52218) at 
> execExpr.c:628
> #13 0x8021cd6e in ExecBuildProjectionInfo (targetList= out>, econtext=, slot=, 
> parent=parent@entry=0xa8c51e30, inputDesc=inputDesc@entry=0x0) at 
> execExpr.c:472
> #14 0x80232ed6 in ExecAssignProjectionInfo 
> (planstate=planstate@entry=0xa8c51e30, inputDesc=inputDesc@entry=0x0) at 
> execUtils.c:504
> #15 0x80250178 in ExecInitResult (node=node@entry=0xa8c4fb98, 
> estate=estate@entry=0xa8c51bf0, eflags=eflags@entry=16) at nodeResult.c:221
> #16 0x8022c72c in ExecInitNode (node=node@entry=0xa8c4fb98, 
> estate=estate@entry=0xa8c51bf0, eflags=eflags@entry=16) at execProcnode.c:164
> #17 0x8022675e in InitPlan (eflags=16, queryDesc=0xa8c4f7d0) at 
> execMain.c:1020
> #18 standard_ExecutorStart (queryDesc=0xa8c4f7d0, eflags=16) at execMain.c:266
> #19 0x80388868 in PortalStart (portal=portal@entry=0xa8c91c80, 
> params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) 
> at pquery.c:518
> #20 0x80384b2e in exec_simple_query 
> (query_string=query_string@entry=0xa8c06170 "select 5;") at postgres.c:1176
> #21 0x803852e0 in PostgresMain (argc=, 
> argv=argv@entry=0xa8c55db8, dbname=0xa8c55c80 "postgres", username= out>) at postgres.c:4247
> #22 0x8008007e in BackendRun (port=, port= out>) at postmaster.c:4437
> #23 BackendStartup (port=0xa8c4dc10) at postmaster.c:4128
> #24 ServerLoop () at postmaster.c:1704
> #25 0x8030c89e in PostmasterMain (argc=argc@entry=3, 
> argv=argv@entry=0xa8c00cc0) at postmaster.c:1377
> #26 0x800811f4 in main (argc=, argv=0xa8c00cc0) at 
> main.c:228

Hi Tushar,

When testing this stuff on a few different platforms, I ran into a
switch statement in llvm that returned an empty std::function<> that
would throw std::bad_function_call like that, on architectures other
than (IIRC) x86 and ARM:

https://www.postgresql.org/message-id/CAEepm%3D39F_B3Ou8S3OrUw%2BhJEUP3p%3DwCu0ug-TTW67qKN53g3w%40mail.gmail.com

I'm not sure if you're seeing the same problem or another similar one,
but I know that Andres got a patch along those lines into llvm.  Maybe
you could try on a more recent llvm release?

Re: design for parallel backup

2020-04-21 Thread Robert Haas

On Tue, Apr 21, 2020 at 4:14 PM Andres Freund  wrote:
> It was local TCP. The speeds I can reach are faster than the 10GiB/s
> (unidirectional) I can do between the laptop & workstation, so testing
> it over "actual" network isn't informative - I basically can reach line
> speed between them with any method.

Is that really a conclusive test, though? In the case of either local
TCP or a fast local interconnect, you'll have negligible latency. It
seems at least possible that saturating the available bandwidth is
harder on a higher-latency connection. Cross-region data center
connections figure to have way higher latency than a local wired
network, let alone the loopback interface.

> It was in kernel buffer cache. But I can reach 100% utilization of
> storage too (which is slightly slower than what I can do over unix
> socket).
>
> pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv 
> -B16M -r -a > /dev/null
> 2.59GiB/s
> find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; 
> |pv -B16M -r -a > /dev/null
> 2.53GiB/s
> find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > 
> /dev/null
> 2.42GiB/s
>
> I tested this with a -s 5000 DB, FWIW.

But that's not a real test either, because you're not writing the data
anywhere. It's going to be a whole lot easier to saturate the read
side if the write side is always zero latency.

> > How do you expect to take advantage of I/O parallelism without
> > multiple processes/connections?
>
> Which kind of I/O parallelism are you thinking of? Independent
> tablespaces? Or devices that can handle multiple in-flight IOs? WRT the
> latter, at least linux will keep many IOs in-flight for sequential
> buffered reads.

Both. I know that the kernel will prefetch for sequential reads, but
it won't know what file you're going to access next, so I think you'll
tend to stall when you reach the end of each file. It also seems
possible that on a large disk array, you could read N files at a time
with greater aggregate bandwidth than you can read a single file.

> > It seems to me that the interesting cases may involve having lots of
> > available CPUs and lots of disk spindles, but a comparatively slow
> > pipe between the machines.
>
> Hm, I'm not sure I am following. If network is the bottleneck, we'd
> immediately fill the buffers, and that'd be that?
>
> ISTM all of this is only really relevant if either pg_basebackup or
> walsender is the bottleneck?

I agree that if neither pg_basebackup nor walsender is the bottleneck,
parallelism is unlikely to be very effective. I have realized as a
result of your comments that I actually don't care intrinsically about
parallel backup; what I actually care about is making backups very,
very fast. I suspect that parallelism is a useful means to that end,
but I interpret your comments as questioning that, and specifically
drawing attention to the question of where the bottlenecks might be.
So I'm trying to think about that.

> I think it's fairly obvious that we need faster compression - and that
> while we clearly can win a lot by just using a faster
> algorithm/implementation than standard zlib, we'll likely also need
> parallelism in some form.  I'm doubtful that using multiple connections
> and multiple backends is the best way to achieve that, but it'd be a
> way.

I think it has a good chance of being pretty effective, but it's
certainly worth casting about for other possibilities that might
deliver more benefit or be less work. In terms of better compression,
I did a little looking around and it seems like LZ4 is generally
agreed to be a lot faster than gzip, and also significantly faster
than most other things that one might choose to use. On the other
hand, the compression ratio may not be as good; e.g.
https://facebook.github.io/zstd/ cites a 2.1 ratio (on some data set)
for lz4 and a 2.9 ratio for zstd. While the compression and
decompression speeds are slower, they are close enough that you might
be able to make up the difference by using 2x the cores for
compression and 3x for decompression. I don't know if that sort of
thing is worth considering. If your limitation is the line speed, and
you have have CPU cores to burn, a significantly higher compression
ratio means significantly faster backups. On the other hand, if you're
backing up over the LAN and the machine is heavily taxed, that's
probably not an appealing trade.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-21 Thread Tom Lane

Bruce Momjian  writes:
> On Tue, Apr 21, 2020 at 01:52:46PM +0900, Michael Paquier wrote:
>> On Mon, Apr 20, 2020 at 10:35:15PM -0400, Bruce Momjian wrote:
>>> On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
 If we were going to go down the path of periodically logging warnings
 about old prepared transactions, some single-instance background task
 like the checkpointer would be a better place to do the work in.  But
 I'm not really recommending that, because I agree with Robert that
 we just plain don't want this functionality.

> Sorry, I meant something in the Postgres logs at postmaster start.

That seems strictly worse than periodic logging as far as the probability
that somebody will notice the log entry goes.  In any case it would only
help people when they restart their postmaster, which ought to be pretty
infrequent in a production situation.

regards, tom lane

Re: More efficient RI checks - take 2

2020-04-21 Thread Tom Lane

Alvaro Herrera  writes:
> I do wonder if the RI stuff would actually end up being faster without
> SPI.  If not, we'd only end up writing more code to do the same thing.
> Now that tables can be partitioned, it is much more of a pain than when
> only regular tables could be supported.  Obviously without SPI you
> wouldn't *have* to go through the planner, which might be a win in
> itself if the execution tree to use were always perfectly clear ... but
> now that the queries get more complex per partitioning and this
> optimization, is it?

AFAIK, we do not have any code besides the planner that is capable of
building a plan tree at all, and I'd be pretty hesitant to try to create
such; those things are complicated.

It'd likely only make sense to bypass the planner if the required work
is predictable enough that you don't need a plan tree at all, but can
just hard-wire what has to be done.  That seems a bit unlikely in the
presence of partitioning.

Instead of a plan tree, you could build a parse tree to pass through the
planner, rather than building a SQL statement string that has to be
parsed.  The code jumps through enough hoops to make sure the string will
be parsed "just so" that this might net out to about an equal amount of
code in ri_triggers.c, and it'd save a nontrivial amount of parsing work.
But you'd have to abandon SPI, probably, or at least it's not clear how
much that'd be doing for you anymore.

regards, tom lane

Re: design for parallel backup

2020-04-21 Thread Andres Freund

Hi,

On 2020-04-21 14:01:28 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 11:36 AM Andres Freund  wrote:
> > It's all CRC overhead. I don't see a difference with
> > --manifest-checksums=none anymore. We really should look for a better
> > "fast" checksum.
>
> Hmm, OK. I'm wondering exactly what you tested here. Was this over
> your 20GiB/s connection between laptop and workstation, or was this
> local TCP?

It was local TCP. The speeds I can reach are faster than the 10GiB/s
(unidirectional) I can do between the laptop & workstation, so testing
it over "actual" network isn't informative - I basically can reach line
speed between them with any method.

> Also, was the database being read from persistent storage, or was it
> RAM-cached?

It was in kernel buffer cache. But I can reach 100% utilization of
storage too (which is slightly slower than what I can do over unix
socket).

pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv 
-B16M -r -a > /dev/null
2.59GiB/s
find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv 
-B16M -r -a > /dev/null
2.53GiB/s
find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null
2.42GiB/s

I tested this with a -s 5000 DB, FWIW.

> How do you expect to take advantage of I/O parallelism without
> multiple processes/connections?

Which kind of I/O parallelism are you thinking of? Independent
tablespaces? Or devices that can handle multiple in-flight IOs? WRT the
latter, at least linux will keep many IOs in-flight for sequential
buffered reads.

> - UNIX socket was slower than a local TCP socket, and about the same
> speed as a TCP socket with SSL.

Hm. Interesting. Wonder if that a question of the unix socket buffer
size?

> - CRC-32C is about 10% slower than no manifest and/or no checksums in
> the manifest. SHA256 is 1.5-2x slower, but less when compression is
> also used (see below).
> - Plain format is a little slower than tar format; tar with gzip is
> typically >~5x slower, but less when the checksum algorithm is SHA256
> (again, see below).

I see about 250MB/s with -Z1 (from the source side). If I hack
pg_basebackup.c to specify a deflate level of 0 to gzsetparams, which
zlib docs says should disable compression, I get up to 700MB/s. Which
still is a factor of ~3.7 to uncompressed.

This seems largely due to zlib's crc32 computation not being hardware
accelerated:
-   99.75% 0.05%  pg_basebackup  pg_basebackup   [.] BaseBackup
   - 99.95% BaseBackup
  - 81.60% writeTarData
 - gzwrite
 - gz_write
- gz_comp.constprop.0
   - 85.11% deflate
  - 97.66% deflate_stored
 + 87.45% crc32_z
 + 9.53% __memmove_avx_unaligned_erms
 + 3.02% _tr_stored_block
2.27% __memmove_avx_unaligned_erms
   + 14.86% __libc_write
  + 18.40% pqGetCopyData3

> It seems to me that the interesting cases may involve having lots of
> available CPUs and lots of disk spindles, but a comparatively slow
> pipe between the machines.

Hm, I'm not sure I am following. If network is the bottleneck, we'd
immediately fill the buffers, and that'd be that?

ISTM all of this is only really relevant if either pg_basebackup or
walsender is the bottleneck?

> I mean, if it takes 36 hours to read the
> data from disk, you can't realistically expect to complete a full
> backup in less than 36 hours. Incremental backup might help, but
> otherwise you're just dead. On the other hand, if you can read the
> data from the disk in 2 hours but it takes 36 hours to complete a
> backup, it seems like you have more justification for thinking that
> the backup software could perhaps do better. In such cases efficient
> server-side compression may help a lot, but even then, I wonder
> whether you can you read the data at maximum speed with only a single
> process? I tend to doubt it, but I guess you only have to be fast
> enough to saturate the network. Hmm.

Well, I can do >8GByte/s of buffered reads in a single process
(obviously cached, because I don't have storage quite that fast -
uncached I can read at nearly 3GByte/s, the disk's speed). So sure,
there's a limit to what a single process can do, but I think we're
fairly far away from it.

I think it's fairly obvious that we need faster compression - and that
while we clearly can win a lot by just using a faster
algorithm/implementation than standard zlib, we'll likely also need
parallelism in some form.  I'm doubtful that using multiple connections
and multiple backends is the best way to achieve that, but it'd be a
way.

Greetings,

Andres Freund

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-21 Thread Bruce Momjian

On Tue, Apr 21, 2020 at 01:52:46PM +0900, Michael Paquier wrote:
> On Mon, Apr 20, 2020 at 10:35:15PM -0400, Bruce Momjian wrote:
> > On Thu, Apr 16, 2020 at 03:11:51PM -0400, Tom Lane wrote:
> >> If we were going to go down the path of periodically logging warnings
> >> about old prepared transactions, some single-instance background task
> >> like the checkpointer would be a better place to do the work in.  But
> >> I'm not really recommending that, because I agree with Robert that
> >> we just plain don't want this functionality.
> > 
> > I thought we would just emit a warning at boot time.
> 
> That's more tricky than boot time (did you mean postmaster context?),
> especially if you are starting a cluster from a base backup as you
> have no guarantee that the 2PC information is consistent by just
> looking at what's on disk (some of the 2PC files may still be in WAL
> records to-be-replayed), so a natural candidate to gather the
> information wanted here would be RecoverPreparedTransactions() for a
> primary, and StandbyRecoverPreparedTransactions() for a standby.

Sorry, I meant something in the Postgres logs at postmaster start.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: HEAPDEBUGALL is broken

2020-04-21 Thread Tom Lane

Peter Eisentraut  writes:
> On 2020-04-19 15:37, Tom Lane wrote:
>> +1 for removing both.  There are a lot of such debug "features"
>> in the code, and few of them are worth anything IME.

> removed

I don't see a commit?

regards, tom lane

Re: HEAPDEBUGALL is broken

2020-04-21 Thread Peter Eisentraut


On 2020-04-19 15:37, Tom Lane wrote:

Peter Eisentraut  writes:

The HEAPDEBUGALL define has been broken since PG12 due to tableam
changes.  Should we just remove this?  It doesn't look very useful.
It's been around since Postgres95.
If we opt for removing: PG12 added an analogous HEAPAMSLOTDEBUGALL
(which still compiles correctly).  Would we want to keep that?


+1 for removing both.  There are a lot of such debug "features"
in the code, and few of them are worth anything IME.


removed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: design for parallel backup

2020-04-21 Thread Robert Haas

On Tue, Apr 21, 2020 at 11:36 AM Andres Freund  wrote:
> It's all CRC overhead. I don't see a difference with
> --manifest-checksums=none anymore. We really should look for a better
> "fast" checksum.

Hmm, OK. I'm wondering exactly what you tested here. Was this over
your 20GiB/s connection between laptop and workstation, or was this
local TCP? Also, was the database being read from persistent storage,
or was it RAM-cached? How do you expect to take advantage of I/O
parallelism without multiple processes/connections?

Meanwhile, I did some local-only testing on my new 16GB MacBook Pro
laptop with all combinations of:

- UNIX socket, local TCP socket, local TCP socket with SSL
- Plain format, tar format, tar format with gzip
- No manifest ("omit"), manifest with no checksums, manifest with
CRC-32C checksums, manifest with SHA256 checksums.

The database is a fresh scale-factor 1000 pgbench database. No
concurrent database load. Observations:

- UNIX socket was slower than a local TCP socket, and about the same
speed as a TCP socket with SSL.
- CRC-32C is about 10% slower than no manifest and/or no checksums in
the manifest. SHA256 is 1.5-2x slower, but less when compression is
also used (see below).
- Plain format is a little slower than tar format; tar with gzip is
typically >~5x slower, but less when the checksum algorithm is SHA256
(again, see below).
- SHA256 + tar format with gzip is the slowest combination, but it's
"only" about 15% slower than no manifest, and about 3.3x slower than
no compression, presumably because the checksumming is slowing down
the server and the compression is slowing down the client.
- Fastest speeds I see in any test are ~650MB/s, and slowest are
~65MB/s, obviously benefiting greatly from the fact that this is a
local-only test.
- The time for a raw cp -R of the backup directory is about 10s, and
the fastest time to take a backup (tcp+tar+m:omit) is about 22s.
- In all cases I've checked so far both pg_basebackup and the server
backend are pegged at 98-100% CPU usage. I haven't looked into where
that time is going yet.

Full results and test script attached. I and/or my colleagues will try
to test out some other environments, but I'm not sure we have easy
access to anything as high-powered as a 20GiB/s interconnect.

It seems to me that the interesting cases may involve having lots of
available CPUs and lots of disk spindles, but a comparatively slow
pipe between the machines. I mean, if it takes 36 hours to read the
data from disk, you can't realistically expect to complete a full
backup in less than 36 hours. Incremental backup might help, but
otherwise you're just dead. On the other hand, if you can read the
data from the disk in 2 hours but it takes 36 hours to complete a
backup, it seems like you have more justification for thinking that
the backup software could perhaps do better. In such cases efficient
server-side compression may help a lot, but even then, I wonder
whether you can you read the data at maximum speed with only a single
process? I tend to doubt it, but I guess you only have to be fast
enough to saturate the network. Hmm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
unix+plain+m:omit 45.330231 39.744768 42.264874
unix+tar+m:omit 32.767868 36.227577 35.742676
unix+targzip+m:omit 189.888759 188.50978 195.031579
tcp+plain+m:omit 29.758328 29.752194 30.242722
tcp+tar+m:omit 22.723285 24.207906 25.760343
tcp+targzip+m:omit 169.138959 169.258639 175.542641
tcpssl+plain+m:omit 39.846619 39.846117 39.813978
tcpssl+tar+m:omit 33.796067 33.26529 36.273268
tcpssl+targzip+m:omit 176.152361 188.631385 188.13747
unix+plain+m:none 39.791435 41.266778 41.276581
unix+tar+m:none 34.730879 35.743352 35.228618
unix+targzip+m:none 187.237583 196.353209 191.903126
tcp+plain+m:none 29.486182 31.252909 30.75195
tcp+tar+m:none 23.328765 24.725305 24.726108
tcp+targzip+m:none 165.605943 174.549782 176.558254
tcpssl+plain+m:none 39.403247 40.286618 40.27026
tcpssl+tar+m:none 34.288039 35.298867 32.776008
tcpssl+targzip+m:none 176.625429 186.062193 188.132955
unix+plain+m:crc32c 44.262171 47.808442 46.307146
unix+tar+m:crc32c 38.256322 41.773116 39.2641
unix+targzip+m:crc32c 191.790014 200.942023 197.461197
tcp+plain+m:crc32c 29.945689 30.731167 30.757778
tcp+tar+m:crc32c 23.712764 24.246618 25.229781
tcp+targzip+m:crc32c 167.55035 174.614658 172.695148
tcpssl+plain+m:crc32c 39.334947 39.818292 40.317529
tcpssl+tar+m:crc32c 34.786403 33.2 35.283009
tcpssl+targzip+m:crc32c 175.731637 188.610379 186.590181
unix+plain+m:sha256 70.567034 74.90739 72.873767
unix+tar+m:sha256 67.808388 68.342512 70.912017
unix+targzip+m:sha256 219.002669 229.611438 230.585041
tcp+plain+m:sha256 46.369327 46.8214 47.857381
tcp+tar+m:sha256 46.843969 46.356686 46.847939
tcp+targzip+m:sha256 172.02733 174.575485 177.553958
tcpssl+plain+m:sha256 54.344198 54.362139 55.845257
tcpssl+tar+m:sha256 53.341627 54.377519 56.382529
tcpssl+targzip+m:sha256 180.57967

Re: HEAPDEBUGALL is broken

2020-04-21 Thread Peter Eisentraut


On 2020-04-19 22:00, Alexander Lakhin wrote:
To the point, I've tried to use HAVE_ALLOCINFO on master today and it 
failed too:
$ CPPFLAGS="-DHAVE_ALLOCINFO" ./configure --enable-tap-tests 
--enable-debug --enable-cassert  >/dev/null && make -j16 >/dev/null

generation.c: In function ‘GenerationAlloc’:
generation.c:191:11: error: ‘GenerationContext {aka struct 
GenerationContext}’ has no member named ‘name’

  (_cxt)->name, (_chunk), (_chunk)->size)
    ^
generation.c:386:3: note: in expansion of macro ‘GenerationAllocInfo’
    GenerationAllocInfo(set, chunk);
    ^~~
generation.c:191:11: error: ‘GenerationContext {aka struct 
GenerationContext}’ has no member named ‘name’

  (_cxt)->name, (_chunk), (_chunk)->size)
    ^
generation.c:463:2: note: in expansion of macro ‘GenerationAllocInfo’
   GenerationAllocInfo(set, chunk);
   ^~~


Do you have a proposed patch?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-21 Thread Justin Pryzby

On Tue, Apr 21, 2020 at 12:20:38PM -0400, Alvaro Herrera wrote:
> diff --git a/doc/src/sgml/ref/alter_table.sgml 
> b/doc/src/sgml/ref/alter_table.sgml
> index 7595e609b5..233905552c 100644
> --- a/doc/src/sgml/ref/alter_table.sgml
> +++ b/doc/src/sgml/ref/alter_table.sgml
> @@ -941,13 +943,14 @@ WITH ( MODULUS  class="parameter">numeric_literal, REM
>  DETACH PARTITION  class="parameter">partition_name
>  
>   
>This form detaches specified partition of the target table.  The 
> detached
>partition continues to exist as a standalone table, but no longer has 
> any
>ties to the table from which it was detached.  Any indexes that were
> -  attached to the target table's indexes are detached.
> +  attached to the target table's indexes are detached.  Any triggers that
> +  were created to mirror those in the target table are removed.

Can you say "cloned" here instead of mirror ?
> +  attached to the target table's indexes are detached.  Any triggers that
> +  were created as clones of triggers in the target table are removed.

Also, I see in the surrounding context a missing word? 
This form detaches THE specified partition of the target table.

-- 
Justin

Re: Autovacuum on partitioned table (autoanalyze)

2020-04-21 Thread yuzuko

Hello,

On Sat, Apr 18, 2020 at 2:08 PM Justin Pryzby  wrote:
>
> On Fri, Apr 17, 2020 at 10:09:07PM +0900, Amit Langote wrote:
> > On Thu, Apr 16, 2020 at 11:19 PM Justin Pryzby  wrote:
> > > On Thu, Apr 16, 2020 at 06:16:45PM +0900, yuzuko wrote:
> > > I don't think that adequately allows what's needed.
> ...(paragraph with my typos elided)...
> > > For example, say a new customer has bunch of partitioned tables which each
> > > currently have only one partition (for the current month), and that's 
> > > expected
> > > to grow to at least 20+ partitions (2+ years of history).  How does one 
> > > set the
> > > partitioned table's auto-analyze parameters to analyze whenever any child 
> > > is
> > > analyzed ?  I don't think it should be needed to update it every month 
> > > after
> > > computing sum(child tuples).
> > >
> > > Possibly you could allow that behavior for some special values of the
> > > threshold.  Like if autovacuum_analyze_threshold=-2, then analyze the 
> > > parent
> > > whenever any of its children are analyzed.
> > >
> > > I think that use case and that need would be common, but I'd like to hear 
> > > what
> > > others think.
> >
> > Having to constantly pay attention to whether a parent's
> > analyze_threshold/scale_factor is working as intended would surely be
> > an annoyance, so I tend to agree that we might need more than just the
> > ability to set analyze_threshold/scale_factor on parent tables.
> > However, I think we can at least start with being able to do
> > *something* here. :)  Maybe others think that this shouldn't be
> > considered committable until we figure out a good analyze threshold
> > calculation formula to apply to parent tables.
> >
> > Considering that, how about having, say, a
> > autovacuum_analyze_partition_parent_frequency, with string values
> > 'default', 'partition'? -- 'default' assumes the same formula as
> > regular tables, whereas with 'partition', parent is analyzed as soon
> > as a partition is.
>
> I assume you mean a reloption to be applied only to partitioned tables,
>
> Your "partition" setting would mean that the scale/threshold values would have
> no effect, which seems kind of unfortunate.
>
> I think it should be called something else, and done differently, like maybe:
> autovacuum_analyze_mode = {off,sum,max,...}
>
The above reloption you suggested will be applied all tables?
Users might not use it for partitions, so I think we should add "parent"
to reloption's name, like Amit's suggestion.

> The threshold would be threshold + scale*tuples, as always, but would be
> compared with f(changes) as determined by the relopt.
>
> sum(changes) would do what you called "default", comparing the sum(changes)
> across all partitions to the threshold, which is itself computed using
> sum(reltuples) AS reltuples.
>
> max(changes) would compute max(changes) compared to the threshold, and the
> threshold would be computed separately for each partition's reltuples:
> threshold_N = parent_threshold + parent_scale * part_N_tuples.  If *any*
> partition exceeds that threshold, the partition itself is analyzed.  This
> allows what I want for time-series.  Maybe this would have an alias called
> "any".
>
I may be wrong but I think the fomula,
> threshold_N = parent_threshold + parent_scale * part_N_tuples
would use orginary table's threshold, not parent's.  If it use parent_threshold,
parent might not be analyzed even if its any partition is analyzed when
parent_threshold is larger than normal threshold.  I'm worried that this case
meets requirements for time-series.

> I'm not sure if there's any other useful modes, like avg(changes)?  I guess we
> can add them later if someone thinks of a good use case.
>
> Also, for me, the v7 patch warns:
> |src/backend/postmaster/autovacuum.c:3117:70: warning: ‘reltuples’ may be 
> used uninitialized in this function [-Wmaybe-uninitialized]
> |   vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * 
> reltuples;
> ..which seems to be a false positive, but easily avoided.
>
Thank you for testing the patch.
I got it.  I'll update the patch soon.

>
> This patch includes partitioned tables in pg_stat_*_tables, which is great; I
> complained awhile ago that they were missing [0].  It might be useful if that
> part was split out into a separate 0001 patch (?).
>
If partitioned table's statistics is used for other purposes,  I think
it would be
better to split the patch. Does anyone have any opinion?

---
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-21 Thread Alvaro Herrera

I think I also owe the attached doc updates.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 7595e609b5..233905552c 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -869,13 +869,15 @@ WITH ( MODULUS numeric_literal, REM
   one will be created in the attached table; or, if an equivalent
   index already exists, it will be attached to the target table's index,
   as if ALTER INDEX ATTACH PARTITION had been executed.
   Note that if the existing table is a foreign table, it is currently not
   allowed to attach the table as a partition of the target table if there
   are UNIQUE indexes on the target table.  (See also
-  .)
+  .)  For each user-defined
+  row-level trigger that exists in the target table, a corresponding one
+  is created in the attached table.
  
 
  
   A partition using FOR VALUES uses same syntax for
   partition_bound_spec as
   .  The partition bound specification
@@ -941,13 +943,14 @@ WITH ( MODULUS numeric_literal, REM
 DETACH PARTITION partition_name
 
  
   This form detaches specified partition of the target table.  The detached
   partition continues to exist as a standalone table, but no longer has any
   ties to the table from which it was detached.  Any indexes that were
-  attached to the target table's indexes are detached.
+  attached to the target table's indexes are detached.  Any triggers that
+  were created to mirror those in the target table are removed.
  
 

 
   
   
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 155866c7c8..c49c770dd0 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -396,13 +396,15 @@ WITH ( MODULUS numeric_literal, REM
 PARTITION OF parent_table { FOR VALUES partition_bound_spec | DEFAULT }
 
  
   Creates the table as a partition of the specified
   parent table. The table can be created either as a partition for specific
   values using FOR VALUES or as a default partition
-  using DEFAULT.
+  using DEFAULT.  Any indexes, constraints and
+  user-defined row-level triggers that exist in the parent table are cloned
+  on the new partition.
  
 
  
   The partition_bound_spec
   must correspond to the partitioning method and partition key of the
   parent table, and must not overlap with any existing partition of that

Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables

2020-04-21 Thread Alvaro Herrera

On 2020-Apr-20, Alvaro Herrera wrote:

> + while (HeapTupleIsValid(trigtup = systable_getnext(scan)))
> + {
> + Form_pg_trigger pg_trigger = (Form_pg_trigger) 
> GETSTRUCT(trigtup);
> + ObjectAddress trig;
> +
> + /* Ignore triggers that weren't cloned */
> + if (!OidIsValid(pg_trigger->tgparentid) ||
> + !pg_trigger->tgisinternal ||
> + !TRIGGER_FOR_ROW(pg_trigger->tgtype))
> + continue;

Actually, shouldn't we be checking just "!OidIsValid(pg_trigger->tgparentid)"
here?  Surely the other two conditions should already not matter either
way if tgparentid is set.  I can't see us starting to clone
for-statement triggers, but I'm not sure I trust the internal marking to
remain one way or the other.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: design for parallel backup

2020-04-21 Thread Andres Freund

Hi,

On 2020-04-21 07:18:20 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 2:44 AM Andres Freund  wrote:
> > FWIW, I just tested pg_basebackup locally.
> >
> > Without compression and a stock postgres I get:
> > unixtcp  tcp+ssl:
> > 1.74GiB/s   1.02GiB/s699MiB/s
> >
> > That turns out to be bottlenecked by the backup manifest generation.
> 
> Whoa. That's unexpected, at least for me. Is that because of the
> CRC-32C overhead, or something else? What do you get with
> --manifest-checksums=none?

It's all CRC overhead. I don't see a difference with
--manifest-checksums=none anymore. We really should look for a better
"fast" checksum.

Regards,

Andres

Re: More efficient RI checks - take 2

2020-04-21 Thread Alvaro Herrera

On 2020-Apr-20, Corey Huinker wrote:

> > I can imagine removal of the SPI from the current implementation (and
> > constructing the plans "manually"), but note that the queries I use in my
> > patch are no longer that trivial. So the SPI makes sense to me because it
> > ensures regular query planning.
> 
> As an intermediate step, in the case where we have one row, it should be
> simple enough to extract that row manually, and do an SPI call with fixed
> values rather than the join to the ephemeral table, yes?

I do wonder if the RI stuff would actually end up being faster without
SPI.  If not, we'd only end up writing more code to do the same thing.
Now that tables can be partitioned, it is much more of a pain than when
only regular tables could be supported.  Obviously without SPI you
wouldn't *have* to go through the planner, which might be a win in
itself if the execution tree to use were always perfectly clear ... but
now that the queries get more complex per partitioning and this
optimization, is it?

You could remove the crosscheck_snapshot feature from SPI, I suppose,
but that's not that much code.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] distinct aggregates within a window function WIP

2020-04-21 Thread Andreas Karlsson


On 4/21/20 5:06 PM, Eugen Konkov wrote:

Hi. I read the thread.

Probably this fiddle will be helpful for testing:

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=abe845142a5099d921d3729043fb8491

I recently encountered a problem:
Why Window-specific functions do not allow DISTINCT to be used within the 
function argument list?

sum( DISTINCT order_cost ) OVER ( PARTITION BY invoice_id ORDER BY invoice_id, 
group_id RANGE unbound preceeding and unbound following )

behavior is quite deterministic:

ORDER BY will create peers in partition
DISTINCT will get only one peer

I  resolve  my problem via two subqueries, but it seems this logic may
be applied to window functions (did not check this for other functions thought)


Sorry, I do not follow. What problem did you encounter?

Andreas

[PATCH] distinct aggregates within a window function WIP

2020-04-21 Thread Eugen Konkov

Hi. I read the thread. 

Probably this fiddle will be helpful for testing:

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=abe845142a5099d921d3729043fb8491

I recently encountered a problem:
Why Window-specific functions do not allow DISTINCT to be used within the 
function argument list?

sum( DISTINCT order_cost ) OVER ( PARTITION BY invoice_id ORDER BY invoice_id, 
group_id RANGE unbound preceeding and unbound following )

behavior is quite deterministic:

ORDER BY will create peers in partition
DISTINCT will get only one peer

I  resolve  my problem via two subqueries, but it seems this logic may
be applied to window functions (did not check this for other functions thought)

-- 
Best regards,
Eugen Konkov

[IBM z Systems] Getting server crash when jit_above_cost =0

2020-04-21 Thread tushar


Hi,

We are  getting a server crash on zlinux machine  if we set 
jit_above_cost=0 in postgresql.conf file after configuring  PG v12 
server  with --with-llvm ( llvm-ttoolset-6.0)


We configured  PG v12 sources with switch --with-llvm  ( after setting 
these variables on command prompt )


 export 
LD_LIBRARY_PATH=/opt/rh/llvm-toolset-6.0/root/usr/lib64:$LD_LIBRARY_PATH

 export LLVM_CONFIG=/opt/rh/llvm-toolset-6.0/root/usr/bin/llvm-config
 export CLANG=/opt/rh/llvm-toolset-6.0/root/usr/bin/clang
 export LDFLAGS="-Wl,-rpath,/opt/rh/llvm-toolset-6.0/root/lib64 
${LDFLAGS}"; export LDFLAGS


postgresql.conf file -

"
shared_preload_libraries=$libdir/llvmjit' ,
jit_provider = 'llvmjit'  ,
jit_above_cost = 0
jit=on,
"

able to see the crash  against any sql query

psql (12.2)
Type "help" for help.

postgres=# select 5;
2020-04-21 07:33:15.980 CDT [48149] DEBUG:  StartTransaction(1) name: 
unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2020-04-21 07:33:15.980 CDT [48149] DEBUG:  probing availability of JIT 
provider at /home/edb/pg/edb/edbpsql/lib/postgresql/llvmjit.so
2020-04-21 07:33:15.980 CDT [48149] DEBUG:  successfully loaded JIT 
provider in current session
2020-04-21 07:33:15.981 CDT [48149] DEBUG:  LLVMJIT detected CPU "z13", 
with features ""

terminate called after throwing an instance of 'std::bad_function_call'
  what():  bad_function_call
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: 2020-04-21 
07:33:16.476 CDT [48137] DEBUG:  reaping dead processes


*Stack trace*

[edb@etpgabc bin]$ gdb -q -c data/core.31542 postgres
Reading symbols from /home/edb/pg/edb/edbpsql/bin/postgres...done.
[New LWP 31542]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: edb postgres [local] SELECT '.
Program terminated with signal 6, Aborted.
#0  0x03ffa9841220 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
glibc-2.17-260.el7.s390x libedit-3.0-12.20121213cvs.el7.s390x 
libffi-3.0.13-18.el7.s390x libgcc-4.8.5-39.el7.s390x 
libstdc++-4.8.5-39.el7.s390x 
llvm-toolset-6.0-llvm-libs-6.0.1-5.el7.s390x 
ncurses-libs-5.9-14.20130511.el7_4.s390x zlib-1.2.7-18.el7.s390x

(gdb) bt
#0  0x03ffa9841220 in raise () from /lib64/libc.so.6
#1  0x03ffa9842aa8 in abort () from /lib64/libc.so.6
#2  0x03ff9f7881b4 in __gnu_cxx::__verbose_terminate_handler() () 
from /lib64/libstdc++.so.6

#3  0x03ff9f785c7e in ?? () from /lib64/libstdc++.so.6
#4  0x03ff9f785cb6 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x03ff9f785f60 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x03ff9f7e4468 in std::__throw_bad_function_call() () from 
/lib64/libstdc++.so.6
#7  0x03ffa139e5c4 in 
std::functionstd::default_delete > ()>::operator()() 
const () from /opt/rh/llvm-toolset-6.0/root/usr/lib64/libLLVM-6.0.so
#8  0x03ffa139f2a8 in LLVMOrcCreateInstance () from 
/opt/rh/llvm-toolset-6.0/root/usr/lib64/libLLVM-6.0.so

#9  0x03ffa9c8a984 in llvm_session_initialize () at llvmjit.c:670
#10 llvm_create_context (jitFlags=) at llvmjit.c:146
#11 0x03ffa9c98992 in llvm_compile_expr (state=0xa8c52218) at 
llvmjit_expr.c:131
#12 0x80219986 in ExecReadyExpr (state=state@entry=0xa8c52218) 
at execExpr.c:628
#13 0x8021cd6e in ExecBuildProjectionInfo (targetList=out>, econtext=, slot=, 
parent=parent@entry=0xa8c51e30, inputDesc=inputDesc@entry=0x0) at 
execExpr.c:472
#14 0x80232ed6 in ExecAssignProjectionInfo 
(planstate=planstate@entry=0xa8c51e30, inputDesc=inputDesc@entry=0x0) at 
execUtils.c:504
#15 0x80250178 in ExecInitResult (node=node@entry=0xa8c4fb98, 
estate=estate@entry=0xa8c51bf0, eflags=eflags@entry=16) at nodeResult.c:221
#16 0x8022c72c in ExecInitNode (node=node@entry=0xa8c4fb98, 
estate=estate@entry=0xa8c51bf0, eflags=eflags@entry=16) at 
execProcnode.c:164
#17 0x8022675e in InitPlan (eflags=16, queryDesc=0xa8c4f7d0) at 
execMain.c:1020
#18 standard_ExecutorStart (queryDesc=0xa8c4f7d0, eflags=16) at 
execMain.c:266
#19 0x80388868 in PortalStart (portal=portal@entry=0xa8c91c80, 
params=params@entry=0x0, eflags=eflags@entry=0, 
snapshot=snapshot@entry=0x0) at pquery.c:518
#20 0x80384b2e in exec_simple_query 
(query_string=query_string@entry=0xa8c06170 "select 5;") at postgres.c:1176
#21 0x803852e0 in PostgresMain (argc=, 
argv=argv@entry=0xa8c55db8, dbname=0xa8c55c80 "postgres", 
username=) at postgres.c:4247
#22 0x8008007e in BackendRun (port=, 
port=) at postmaster.c:4437

#23 BackendStartup (port=0xa8c4dc10) at postmaster.c:4128
#24 ServerLoop () at postmaster.c:1704
#25 0x8030c89e in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0xa8c00cc0) at postmaster.c:1377
#26 0x800811f4 in main (argc=, argv=

Re: WIP/PoC for parallel backup

2020-04-21 Thread Amit Kapila

On Tue, Apr 21, 2020 at 5:26 PM Ahsan Hadi  wrote:
>
> On Tue, Apr 21, 2020 at 4:50 PM Amit Kapila  wrote:
>>
>> On Tue, Apr 21, 2020 at 5:18 PM Amit Kapila  wrote:
>> >
>> > On Tue, Apr 21, 2020 at 1:00 PM Asif Rehman  wrote:
>> > >
>> > > I did some tests a while back, and here are the results. The tests were 
>> > > done to simulate
>> > > a live database environment using pgbench.
>> > >
>> > > machine configuration used for this test:
>> > > Instance Type:t2.xlarge
>> > > Volume Type  :io1
>> > > Memory (MiB) :16384
>> > > vCPU #   :4
>> > > Architecture:X86_64
>> > > IOP :16000
>> > > Database Size (GB) :102
>> > >
>> > > The setup consist of 3 machines.
>> > > - one for database instances
>> > > - one for pg_basebackup client and
>> > > - one for pgbench with some parallel workers, simulating SELECT loads.
>> > >
>> > >basebackup | 4 workers | 8 Workers  | 
>> > > 16 workers
>> > > Backup Duration(Min):   69.25|  20.44  | 19.86  | 
>> > > 20.15
>> > > (pgbench running with 50 parallel client simulating SELECT load)
>> > >
>> > > Backup Duration(Min):   154.75   |  49.28 | 45.27 | 20.35
>> > > (pgbench running with 100 parallel client simulating SELECT load)
>> > >
>> >
>> > Thanks for sharing the results, these show nice speedup!  However, I
>> > think we should try to find what exactly causes this speed up.  If you
>> > see the recent discussion on another thread related to this topic,
>> > Andres, pointed out that he doesn't think that we can gain much by
>> > having multiple connections[1].  It might be due to some internal
>> > limitations (like small buffers) [2] due to which we are seeing these
>> > speedups.  It might help if you can share the perf reports of the
>> > server-side and pg_basebackup side.
>> >
>>
>> Just to be clear, we need perf reports both with and without patch-set.
>
>
> These tests were done a while back, I think it would be good to run the 
> benchmark again with the latest patches of parallel backup and share the 
> results and perf reports.
>

Sounds good. I think we should also try to run the test with 1 worker
as well.  The reason it will be good to see the results with 1 worker
is that we can know if the technique to send file by file as is done
in this patch is better or worse than the current HEAD code.  So, it
will be good to see the results of an unpatched code, 1 worker, 2
workers, 4 workers, etc.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Fix for pg_statio_all_tables

2020-04-21 Thread Tom Lane

Alexander Korotkov  writes:
> On Tue, Apr 21, 2020 at 7:58 AM Tom Lane  wrote:
>> Yeah, but that was for a security hole.  I am doubtful that the
>> severity of this problem is bad enough to justify jumping through
>> similar hoops.  Even if we fixed it and documented it, how many
>> users would bother to apply the manual correction?

> Sure, only most conscious users will do the manual correction.  But if
> there are only two option: backpatch it this way or don't backpatch at
> all, then I would choose the first one.

Well, if it were something that you could just do and forget, then
maybe.  But actually, you are proposing to invest a lot of *other*
people's time --- notably me, as the likely author of the next
set of release notes --- so it's not entirely up to you.

Given the lack of field complaints, I'm still of the opinion that
this isn't really worth back-patching.

regards, tom lane

[SSPI] Windows group support

2020-04-21 Thread The Dude

Hi,

I have some code that I've been using in production that supports adding
and authenticating Windows groups via the pg_ident file.  It has a new
indicator (+), that signifies the identifier is a Windows group, as in the
following example:

# MAPNAME   SYSTEM-USERNAME   PG-USERNAME
"Users" "+User group"   postgres

A new function was added to test if a user token is in the windows group:

/*
 * Check if the user (sspiToken) is a member of the specified group
 */
static BOOL
sspi_user_is_in_group(HANDLE sspiToken, LPCTSTR groupName)

I wanted to share this as a patch for the latest, as soon as I port it to
v12.  Does this sound reasonable?

thanks,
Russell

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Fujii Masao





On 2020/04/21 17:15, Kyotaro Horiguchi wrote:

At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao  
wrote in

Patch attached. I will add this into the first CF for v14.


-   if (!fast_promoted)
+   if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
  
CHECKPOINT_IMMEDIATE |
  
CHECKPOINT_WAIT);

If we don't find the checkpoint record just before, we don't insert
End-Of-Recovery record then run an immediate chekpoint.  I think if we
nuke the non-fast promotion, shouldn't we insert the EOR record even
in that case?


I'm not sure if that's safe. What if the server crashes before the checkpoint
completes in that case? Since the last checkpoint record is not available,
the subsequent crash recovery will fail. This would lead to that the server
will never start up. Right? Currently ISTM that end-of-recovery-checkpoint
is executed to avoid such trouble in that case.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-21 Thread Juan José Santamaría Flecha

On Tue, Apr 21, 2020 at 2:22 PM Ranier Vilela  wrote:

> More few comments.
>
> 1. Comments about order:
> /*
>  * Callback function for EnumSystemLocalesEx.
>  * Stop enumerating if a match is found for a locale with the format
>  * _.
>  * The order for search locale is essential:
>  * Find LCType first as LOCALE_SNAME, if not found try
> LOCALE_SENGLISHLANGUAGENAME and
>  * finally LOCALE_SENGLISHCOUNTRYNAME, before return.
>  */
>
> Typo "enumarating".
>

I would not call the order essential, is just meant to try the easier ways
first: is already "ISO" formatted !-> is just a "language" !-> is a full
"language_country" tag.

I take note about  "enumarating".

2. Maybe the fail has here:
>
> if (hyphen == NULL || underscore == NULL)
>
> Change || to &&, the logical is wrong?
>

If the Windows locale does not have a hyphen ("aa") *or*  the lc_message
does not have an underscore ("Afar"), only a comparison on language is
needed.

3. Why iso_lc_messages[0] = '\0'?
>
> If we go call strchr, soon after, it's a waste.
>

Less code churn, and  strchr() againts an empty string did not look too
awful.

I would like to find were the errors come from before sending a new
version, can you reproduce them?

Regards,

Juan José Santamaría Flecha

>
>

Re: Concurrency bug in amcheck

2020-04-21 Thread Alexander Korotkov

On Tue, Apr 21, 2020 at 12:54 PM Alexander Korotkov
 wrote:
> I found concurrency bug in amcheck running on replica.  When
> btree_xlog_unlink_page() replays changes to replica, deleted page is
> left with no items.  But if amcheck steps on such deleted page
> palloc_btree_page() expects it would have items.

I forgot to mention that I've reproduced it on master.  Quick check
shows bug should exist since 11.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-21 Thread Ranier Vilela

Em ter., 21 de abr. de 2020 às 09:02, Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> escreveu:

>
> On Tue, Apr 21, 2020 at 12:41 PM Amit Kapila 
> wrote:
>
>> On Tue, Apr 21, 2020 at 12:51 PM Amit Kapila 
>> wrote:
>> >
>> > I have tried a simple test with the latest patch and it failed for me.
>> >
>> > Set LC_MESSAGES="English_United Kingdom";
>> > -- returns en-GB, then code changes it to en_NZ when _create_locale()
>> > is used whereas with the patch it returns "" (empty string).
>> >
>> > There seem to be two problems here (a) The input to enum_locales_fn
>> > doesn't seem to get the input name as "English_United Kingdom" due to
>> > which it can't find a match even if the same exists. (b) After
>> > executing EnumSystemLocalesEx, there is no way the patch can detect if
>> > it is successful in finding the passed name due to which it appends
>> > empty string in such cases.
>>
>> Few more comments:
>> 1. I have tried the first one in the list provided by you and that
>> also didn't work. Basically, I got  empty string when I tried Set
>> LC_MESSAGES='Afar';
>>
>
> I cannot reproduce any of these errors on my end. When using
> _create_locale(),  returning "en_NZ" is also a wrong result.
>
>
>> 2. Getting below warning
>> pg_locale.c(1072): warning C4133: 'function': incompatible types -
>> from 'const char *' to 'const wchar_t *'
>>
>
> Yes, that is a regression.
>
>
>> 3.
>> + if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHCOUNTRYNAME,
>> + test_locale + len, LOCALE_NAME_MAX_LENGTH - len) > 0)
>>
>> All > or <= 0 checks should be changed to "!" types which mean to
>> check whether the call toGetLocaleInfoEx is success or not.
>>
>
> MSVC does not recommend "!" in all cases, but GetLocaleInfoEx() looks
> fine, so agreed.
>
> 4. In the patch, first, we try to get with LCType as LOCALE_SNAME and
>> then with LOCALE_SENGLISHLANGUAGENAME and LOCALE_SENGLISHCOUNTRYNAME.
>> I think we should add comments indicating why we try to get the locale
>> information with three LCTypes and why the specific order of trying
>> those types is required.
>>
>
> Agreed.
>
>
>> 5. In one of the previous emails, you asked whether we have a list of
>> supported locales.  I don't find any such list. I think it depends on
>> Windows locales for which you can get the information from
>>
>> https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c
>
>
> Yes, that is the information we get from EnumSystemLocalesEx(), without
> the additional entries _create_locale() has.
>
> Please find attached a new version addressing the above mentioned, and so
> adding a debug message for trying to get more information on the failed
> cases.
>
More few comments.

1. Comments about order:
/*
 * Callback function for EnumSystemLocalesEx.
 * Stop enumerating if a match is found for a locale with the format
 * _.
 * The order for search locale is essential:
 * Find LCType first as LOCALE_SNAME, if not found try
LOCALE_SENGLISHLANGUAGENAME and
 * finally LOCALE_SENGLISHCOUNTRYNAME, before return.
 */

Typo "enumarating".

2. Maybe the fail has here:

if (hyphen == NULL || underscore == NULL)

Change || to &&, the logical is wrong?

3. Why iso_lc_messages[0] = '\0'?

If we go call strchr, soon after, it's a waste.

regards,
Ranier Vilela

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-21 Thread Juan José Santamaría Flecha

On Tue, Apr 21, 2020 at 12:41 PM Amit Kapila 
wrote:

> On Tue, Apr 21, 2020 at 12:51 PM Amit Kapila 
> wrote:
> >
> > I have tried a simple test with the latest patch and it failed for me.
> >
> > Set LC_MESSAGES="English_United Kingdom";
> > -- returns en-GB, then code changes it to en_NZ when _create_locale()
> > is used whereas with the patch it returns "" (empty string).
> >
> > There seem to be two problems here (a) The input to enum_locales_fn
> > doesn't seem to get the input name as "English_United Kingdom" due to
> > which it can't find a match even if the same exists. (b) After
> > executing EnumSystemLocalesEx, there is no way the patch can detect if
> > it is successful in finding the passed name due to which it appends
> > empty string in such cases.
>
> Few more comments:
> 1. I have tried the first one in the list provided by you and that
> also didn't work. Basically, I got  empty string when I tried Set
> LC_MESSAGES='Afar';
>

I cannot reproduce any of these errors on my end. When using
_create_locale(),  returning "en_NZ" is also a wrong result.


> 2. Getting below warning
> pg_locale.c(1072): warning C4133: 'function': incompatible types -
> from 'const char *' to 'const wchar_t *'
>

Yes, that is a regression.


> 3.
> + if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHCOUNTRYNAME,
> + test_locale + len, LOCALE_NAME_MAX_LENGTH - len) > 0)
>
> All > or <= 0 checks should be changed to "!" types which mean to
> check whether the call toGetLocaleInfoEx is success or not.
>

MSVC does not recommend "!" in all cases, but GetLocaleInfoEx() looks fine,
so agreed.

4. In the patch, first, we try to get with LCType as LOCALE_SNAME and
> then with LOCALE_SENGLISHLANGUAGENAME and LOCALE_SENGLISHCOUNTRYNAME.
> I think we should add comments indicating why we try to get the locale
> information with three LCTypes and why the specific order of trying
> those types is required.
>

Agreed.


> 5. In one of the previous emails, you asked whether we have a list of
> supported locales.  I don't find any such list. I think it depends on
> Windows locales for which you can get the information from
>
> https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c


Yes, that is the information we get from EnumSystemLocalesEx(), without the
additional entries _create_locale() has.

Please find attached a new version addressing the above mentioned, and so
adding a debug message for trying to get more information on the failed
cases.

Regards,

Juan José Santamaría Flecha
.

>
>


0001-PG-compilation-error-with-VS-2015-2017-2019_v10.patch
Description: Binary data

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-21 Thread Dilip Kumar

On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers  wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
>  Hi Erik,
> 
>  While setting up the cascading replication I have hit one issue on
>  base code[1].  After fixing that I have got one crash with streaming
>  on patch.  I am not sure whether you are facing any of these 2
>  issues
>  or any other issue.  If your issue is not any of these then plese
>  share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh.  (sorry)
>
> It turns out I must have been mistaken somewhere.  I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
> (by the way: this build's regression tests  'ddl', 'toast', and
> 'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed.  I have a plan to work on
this part.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: WIP/PoC for parallel backup

2020-04-21 Thread Ahsan Hadi

On Tue, Apr 21, 2020 at 4:50 PM Amit Kapila  wrote:

> On Tue, Apr 21, 2020 at 5:18 PM Amit Kapila 
> wrote:
> >
> > On Tue, Apr 21, 2020 at 1:00 PM Asif Rehman 
> wrote:
> > >
> > > I did some tests a while back, and here are the results. The tests
> were done to simulate
> > > a live database environment using pgbench.
> > >
> > > machine configuration used for this test:
> > > Instance Type:t2.xlarge
> > > Volume Type  :io1
> > > Memory (MiB) :16384
> > > vCPU #   :4
> > > Architecture:X86_64
> > > IOP :16000
> > > Database Size (GB) :102
> > >
> > > The setup consist of 3 machines.
> > > - one for database instances
> > > - one for pg_basebackup client and
> > > - one for pgbench with some parallel workers, simulating SELECT loads.
> > >
> > >basebackup | 4 workers | 8 Workers
> | 16 workers
> > > Backup Duration(Min):   69.25|  20.44  | 19.86  |
> 20.15
> > > (pgbench running with 50 parallel client simulating SELECT load)
> > >
> > > Backup Duration(Min):   154.75   |  49.28 | 45.27 |
> 20.35
> > > (pgbench running with 100 parallel client simulating SELECT load)
> > >
> >
> > Thanks for sharing the results, these show nice speedup!  However, I
> > think we should try to find what exactly causes this speed up.  If you
> > see the recent discussion on another thread related to this topic,
> > Andres, pointed out that he doesn't think that we can gain much by
> > having multiple connections[1].  It might be due to some internal
> > limitations (like small buffers) [2] due to which we are seeing these
> > speedups.  It might help if you can share the perf reports of the
> > server-side and pg_basebackup side.
> >
>
> Just to be clear, we need perf reports both with and without patch-set.
>

These tests were done a while back, I think it would be good to run the
benchmark again with the latest patches of parallel backup and share the
results and perf reports.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

-- 
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.h...@highgo.ca

Re: WIP/PoC for parallel backup

2020-04-21 Thread Amit Kapila

On Tue, Apr 21, 2020 at 5:18 PM Amit Kapila  wrote:
>
> On Tue, Apr 21, 2020 at 1:00 PM Asif Rehman  wrote:
> >
> > I did some tests a while back, and here are the results. The tests were 
> > done to simulate
> > a live database environment using pgbench.
> >
> > machine configuration used for this test:
> > Instance Type:t2.xlarge
> > Volume Type  :io1
> > Memory (MiB) :16384
> > vCPU #   :4
> > Architecture:X86_64
> > IOP :16000
> > Database Size (GB) :102
> >
> > The setup consist of 3 machines.
> > - one for database instances
> > - one for pg_basebackup client and
> > - one for pgbench with some parallel workers, simulating SELECT loads.
> >
> >basebackup | 4 workers | 8 Workers  | 16 
> > workers
> > Backup Duration(Min):   69.25|  20.44  | 19.86  | 20.15
> > (pgbench running with 50 parallel client simulating SELECT load)
> >
> > Backup Duration(Min):   154.75   |  49.28 | 45.27 | 20.35
> > (pgbench running with 100 parallel client simulating SELECT load)
> >
>
> Thanks for sharing the results, these show nice speedup!  However, I
> think we should try to find what exactly causes this speed up.  If you
> see the recent discussion on another thread related to this topic,
> Andres, pointed out that he doesn't think that we can gain much by
> having multiple connections[1].  It might be due to some internal
> limitations (like small buffers) [2] due to which we are seeing these
> speedups.  It might help if you can share the perf reports of the
> server-side and pg_basebackup side.
>

Just to be clear, we need perf reports both with and without patch-set.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP/PoC for parallel backup

2020-04-21 Thread Amit Kapila

On Tue, Apr 21, 2020 at 1:00 PM Asif Rehman  wrote:
>
> I did some tests a while back, and here are the results. The tests were done 
> to simulate
> a live database environment using pgbench.
>
> machine configuration used for this test:
> Instance Type:t2.xlarge
> Volume Type  :io1
> Memory (MiB) :16384
> vCPU #   :4
> Architecture:X86_64
> IOP :16000
> Database Size (GB) :102
>
> The setup consist of 3 machines.
> - one for database instances
> - one for pg_basebackup client and
> - one for pgbench with some parallel workers, simulating SELECT loads.
>
>basebackup | 4 workers | 8 Workers  | 16 
> workers
> Backup Duration(Min):   69.25|  20.44  | 19.86  | 20.15
> (pgbench running with 50 parallel client simulating SELECT load)
>
> Backup Duration(Min):   154.75   |  49.28 | 45.27 | 20.35
> (pgbench running with 100 parallel client simulating SELECT load)
>

Thanks for sharing the results, these show nice speedup!  However, I
think we should try to find what exactly causes this speed up.  If you
see the recent discussion on another thread related to this topic,
Andres, pointed out that he doesn't think that we can gain much by
having multiple connections[1].  It might be due to some internal
limitations (like small buffers) [2] due to which we are seeing these
speedups.  It might help if you can share the perf reports of the
server-side and pg_basebackup side.  We don't need pgbench type
workload to see what caused speed up.

[1] - 
https://www.postgresql.org/message-id/20200420201922.55ab7ovg6535suyz%40alap3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/20200421064420.z7eattzqbunbutz3%40alap3.anarazel.de

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-21 Thread Dilip Kumar

On Tue, Apr 21, 2020 at 3:44 PM Thomas Munro  wrote:
>
> On Tue, Apr 21, 2020 at 2:05 PM Thomas Munro  wrote:
> > As before, these two apply on top of Robert's patches (or at least his
> > 0001 and 0002).
>
> While trying to figure out if Robert's 0003 patch was correct, I added
> yet another patch to this stack to test it.  0006 does basic xid map
> maintenance that exercises the cases 0003 fixes, and I think it
> demonstrates that they now work correctly.

+1,  I think we should also add a way to test the case, where we
advance the timestamp by multiple slots.  I see that you have such
case
e.g
+# test adding minutes while the map is not full
+set_time('3000-01-01 02:01:00Z');
+is(summarize_mapping(), "2|02:00:00|02:01:00");
+set_time('3000-01-01 02:05:00Z');
+is(summarize_mapping(), "6|02:00:00|02:05:00");
+set_time('3000-01-01 02:19:00Z');
+is(summarize_mapping(), "20|02:00:00|02:19:00");

But, I think we should try to extend it to test that we have put the
new xid only in those slots where we suppose to and not in other
slots?.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: design for parallel backup

2020-04-21 Thread Robert Haas

On Tue, Apr 21, 2020 at 2:44 AM Andres Freund  wrote:
> FWIW, I just tested pg_basebackup locally.
>
> Without compression and a stock postgres I get:
> unixtcp  tcp+ssl:
> 1.74GiB/s   1.02GiB/s699MiB/s
>
> That turns out to be bottlenecked by the backup manifest generation.

Whoa. That's unexpected, at least for me. Is that because of the
CRC-32C overhead, or something else? What do you get with
--manifest-checksums=none?

> Without compression and a stock postgres I get, and --no-manifest
> unixtcp  tcp+ssl:
> 2.51GiB/s   1.63GiB/s1.00GiB/s
>
> I.e. all of them area already above 10Gbit/s network.
>
> Looking at a profile it's clear that our small output buffer is the
> bottleneck:
> 64kB Buffers + --no-manifest:
> unixtcp  tcp+ssl:
> 2.99GiB/s   2.56GiB/s1.18GiB/s
>
> At this point the backend is not actually the bottleneck anymore,
> instead it's pg_basebackup. Which is in part due to the small buffer
> used for output data (i.e. libc's FILE buffering), and in part because
> we spend too much time memmove()ing data, because of the "left-justify"
> logic in pqCheckInBufferSpace().

Hmm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-21 Thread Amit Kapila

On Tue, Apr 21, 2020 at 12:51 PM Amit Kapila  wrote:
>
> On Mon, Apr 20, 2020 at 6:32 PM Amit Kapila  wrote:
> >
> > On Sun, Apr 19, 2020 at 3:46 PM Juan José Santamaría Flecha
> >  wrote:
> > >
> > > I cannot find a single place where all supported locales are listed, but 
> > > I have created a small test program (WindowsNLSLocales.c) based on: 
> > > [_] format locales [1], additional supported language 
> > > strings [2], and additional supported country and region strings [3]. 
> > > Based on the results from this test program, it is possible to to do a 
> > > good job with the [_] types using the proposed logic, 
> > > but the two later cases are Windows specific, and there is no way arround 
> > > a lookup-table.
> > >
> > > The attached results (WindowsNLSLocales.ods) come from Windows 10 (1903) 
> > > and Visual C++ build 1924, 64-bit.
> > >
> >
> > I think these are quite intensive tests but I wonder do we need to
> > test some locales with code_page?  Generally speaking, in this case it
> > should not matter as we are trying to get the locale name but no harm
> > in testing.  Also, I think it would be good if we can test how this
> > impacts the error messages as Davinder is trying to do.
> >
>
>
> I have tried a simple test with the latest patch and it failed for me.
>
> Set LC_MESSAGES="English_United Kingdom";
> -- returns en-GB, then code changes it to en_NZ when _create_locale()
> is used whereas with the patch it returns "" (empty string).
>
> There seem to be two problems here (a) The input to enum_locales_fn
> doesn't seem to get the input name as "English_United Kingdom" due to
> which it can't find a match even if the same exists. (b) After
> executing EnumSystemLocalesEx, there is no way the patch can detect if
> it is successful in finding the passed name due to which it appends
> empty string in such cases.
>

Few more comments:
1. I have tried the first one in the list provided by you and that
also didn't work. Basically, I got  empty string when I tried Set
LC_MESSAGES='Afar';

2. Getting below warning
pg_locale.c(1072): warning C4133: 'function': incompatible types -
from 'const char *' to 'const wchar_t *'

3.
+ if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHCOUNTRYNAME,
+ test_locale + len, LOCALE_NAME_MAX_LENGTH - len) > 0)

All > or <= 0 checks should be changed to "!" types which mean to
check whether the call toGetLocaleInfoEx is success or not.

4. In the patch, first, we try to get with LCType as LOCALE_SNAME and
then with LOCALE_SENGLISHLANGUAGENAME and LOCALE_SENGLISHCOUNTRYNAME.
I think we should add comments indicating why we try to get the locale
information with three LCTypes and why the specific order of trying
those types is required.

5. In one of the previous emails, you asked whether we have a list of
supported locales.  I don't find any such list. I think it depends on
Windows locales for which you can get the information from
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

RE: extension patch of CREATE OR REPLACE TRIGGER

2020-04-21 Thread osumi.takami...@fujitsu.com

Dear Tom Lane

Thanks for your so many fruitful comments !

I have fixed my patch again.
On the other hand, there're some questions left
that I'd like to discuss.

> * You missed updating equalfuncs.c/copyfuncs.c.  Pretty much any change in a
> Node struct will require touching backend/nodes/ functions, and in general 
> it's a
> good idea to grep for uses of the struct to see what else might be affected.
Yeah, thanks.

> * Did you use a dartboard while deciding where to add the new field in struct
> CreateTrigger?  Its placement certainly seems quite random.
> Maybe we should put both "replace" and "isconstraint" up near the front, to 
> match
> up with the statement's syntax.
By following the statement's syntax of CREATE TRIGGER,
I've listed up where I should change and fixed their orders.

> * Speaking of which, I think you broke the isInternal case by insisting on 
> doing a
> lookup first.  isInternal should *not* do a lookup, period, especially not 
> with the
> name it's initially given which will not be the final trigger name.  A 
> conflict on that
> name is irrelevant, and so is the OID of any pre-existing trigger.
Sorry for this.
I inserted codes to skip the first lookup for isInternal case.
As a result, when isInternal is set true, trigger_exists flag never becomes 
true.
Doing a lookup first is necessary to fetch information
for following codes such as existing_constraint_oid to run 
CreateConstraintEntry().

> * I'm not entirely sure that this patch interacts gracefully with the 
> provisions for
> per-partition triggers, either.  Is the change correctly cascaded to 
> per-partition
> triggers if there are any?
Yes.
Please check added 4 test cases to prove that replacement of trigger
cascades to partition's trigger when there are other triggers on the relation.

> * The patch doesn't appear to have any defenses against being asked to
> replace the definition of, say, a foreign key trigger.  It might be
> sufficient to refuse to replace an entry that has tgisinternal set,
> though I'm not sure if that covers all cases that we'd want to disallow.

> Do we disallow making a change on a child partition trigger rather than its 
> parent?
> (Checking tgisinternal is going to be bound up in that, since it looks like 
> somebody
> decided to set that for child triggers.  I'm inclined to think that that was 
> a dumb
> idea; we may need to break out a separate tgischild flag so that we can tell 
> what's
> what.)
Does this mean I need to add a new catalog member named 'tgischild' in 
pg_trigger?
This change sounds really widely influential, which means touching other many 
files additionally.
Isn't there any other way to distinguish trigger on partition table
from internally generated trigger ?
Otherwise, I need to fix many codes to achieve
the protection of internally generated trigger from being replaced.

> * I'm a little bit concerned about the semantics of changing the
> tgdeferrable/tginitdeferred properties of an existing trigger.  If there are 
> trigger
> events pending, and the trigger is redefined in such a way that those events
> should already have been fired, what then? 
OK. I need a discussion about this point. 
There would be two ideas to define the behavior of this semantics change, I 
think.
The first idea is to throw an error that means
the *pending* trigger can't be replaced during the session.
The second one is just to replace the trigger and ignite the new trigger
at the end of the session when its tginitdeferred is set true.
For me, the first one sounds safer. Yet, I'd like to know other opinions.

> regression=# create constraint trigger my_trig after insert on trig_table
> deferrable initially deferred for each row execute procedure
> before_replacement(); CREATE TRIGGER regression=# begin; BEGIN
> regression=*# insert into trig_table default values; INSERT 0 1 regression=*#
> drop trigger my_trig on trig_table; DROP TRIGGER regression=*# commit;
> ERROR:  relation 38489 has no triggers
I could reproduce this bug, using the current master without my patch.
So this is another issue.
I'm thinking that throwing an error when *pending* trigger is dropped
makes sense. Does everyone agree with it ?

> * Not the fault of this patch exactly, but trigger.c seems to have an 
> annoyingly
> large number of copies of the code to look up a trigger by name.  I wonder if 
> we
> could refactor that, say by extracting the guts of
> get_trigger_oid() into an internal function that's passed an already-open
> pg_trigger relation.
While waiting for other reviews and comments, I'm willing to give it a try. 

> * Upthread you were complaining about ShareRowExclusiveLock not being a
> strong enough lock, but I think that's nonsense, for the same reason that 
> it's a
> sufficient lock for plain CREATE TRIGGER: if we have that lock then no other
> session can have pending trigger events of any sort on the relation, nor can 
> new
> ones get made before we commit.  But there's no reason to lock out SE

Re: Fix for pg_statio_all_tables

2020-04-21 Thread Alexander Korotkov

On Tue, Apr 21, 2020 at 7:58 AM Tom Lane  wrote:
> Michael Paquier  writes:
> > On Tue, Apr 21, 2020 at 02:44:45AM +0300, Alexander Korotkov wrote:
> >> As a bugfix, I think this should be backpatched.  But this patch
> >> requires catalog change.  Were  similar cases there before?  If so,
> >> how did we resolve them?
>
> > A backpatch can happen in such cases, see for example b6e39ca9.  In
> > this case, the resolution was done with a backpatch to
> > system_views.sql and the release notes include an additional note
> > saying that the fix applies itself only on already-initialized
> > clusters.  For other clusters, it was necessary to apply a SQL query,
> > given also in the release notes, to fix the issue (just grep for
> > CVE-2017-7547 in release-9.6.sgml on the REL9_6_STABLE branch).
>
> Yeah, but that was for a security hole.  I am doubtful that the
> severity of this problem is bad enough to justify jumping through
> similar hoops.  Even if we fixed it and documented it, how many
> users would bother to apply the manual correction?

Sure, only most conscious users will do the manual correction.  But if
there are only two option: backpatch it this way or don't backpatch at
all, then I would choose the first one.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fix for pg_statio_all_tables

2020-04-21 Thread Alexander Korotkov

On Tue, Apr 21, 2020 at 4:38 AM Michael Paquier  wrote:
> On Tue, Apr 21, 2020 at 02:44:45AM +0300, Alexander Korotkov wrote:
> > Among all the joined tables, only "pg_index I" is expected to have
> > multiple rows associated with single relation.  But we do sum() for
> > toast index "pg_index X" as well.  As the result, we multiply
> > statistics for toast index by the number of relation indexes.  This is
> > obviously wrong.
>
> Oops.
>
> > As a bugfix, I think this should be backpatched.  But this patch
> > requires catalog change.  Were  similar cases there before?  If so,
> > how did we resolve them?
>
> A backpatch can happen in such cases, see for example b6e39ca9.  In
> this case, the resolution was done with a backpatch to
> system_views.sql and the release notes include an additional note
> saying that the fix applies itself only on already-initialized
> clusters.  For other clusters, it was necessary to apply a SQL query,
> given also in the release notes, to fix the issue (just grep for
> CVE-2017-7547 in release-9.6.sgml on the REL9_6_STABLE branch).

Thank you for pointing!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: WIP/PoC for parallel backup

2020-04-21 Thread Asif Rehman

On Tue, 21 Apr 2020 at 2:36 PM, Jeevan Ladhe 
wrote:

> Hi Asif,
>
> On Tue, Apr 21, 2020 at 1:00 PM Asif Rehman 
> wrote:
>
>> Hi,
>>
>> I did some tests a while back, and here are the results. The tests were
>> done to simulate
>> a live database environment using pgbench.
>>
>> machine configuration used for this test:
>> Instance Type:t2.xlarge
>> Volume Type  :io1
>> Memory (MiB) :16384
>> vCPU #   :4
>> Architecture:X86_64
>> IOP :16000
>> Database Size (GB) :102
>>
>> The setup consist of 3 machines.
>> - one for database instances
>> - one for pg_basebackup client and
>> - one for pgbench with some parallel workers, simulating SELECT loads.
>>
>>basebackup | 4 workers | 8 Workers  |
>> 16 workers
>> Backup Duration(Min):   69.25|  20.44  | 19.86  |
>> 20.15
>> (pgbench running with 50 parallel client simulating SELECT load)
>>
>
>
> Well that looks a bit strange. All 4, 8 and 16 workers backup
> configurations
> seem to have taken the same time. Is it because the machine CPUs are
> only 4? In that case did you try to run with 2-workers and compare that
> with 4-workers time?
>
> Also, just to clarify and be sure - was there anything else running on any
> of
> these 3 machines while the backup was in progress.
>

The tests were performed only for 4, 8 and 16 at the time and there was
nothing else running on any of the machines.


> Regards,
> Jeevan Ladhe
>
>
>> Backup Duration(Min):   154.75   |  49.28 | 45.27 | 20.35
>> (pgbench running with 100 parallel client simulating SELECT load)
>>
>>
>>
>> On Tue, Apr 21, 2020 at 9:27 AM Amit Kapila 
>> wrote:
>>
>>> On Tue, Apr 14, 2020 at 8:07 PM Asif Rehman 
>>> wrote:
>>>

 I forgot to make a check for no-manifest. Fixed. Attached is the
 updated patch.


>>> Have we done any performance testing with this patch to see the
>>> benefits? If so, can you point me to the results? If not, then can we
>>> perform some tests on large backups to see the benefits of this patch/idea?
>>>
>>> --
>>> With Regards,
>>> Amit Kapila.
>>> EnterpriseDB: http://www.enterprisedb.com
>>>
>>
>>
>> --
>> --
>> Asif Rehman
>> Highgo Software (Canada/China/Pakistan)
>> URL : www.highgo.ca
>>
>> --
--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Re: fixing old_snapshot_threshold's time->xid mapping

2020-04-21 Thread Thomas Munro

On Tue, Apr 21, 2020 at 2:05 PM Thomas Munro  wrote:
> As before, these two apply on top of Robert's patches (or at least his
> 0001 and 0002).

While trying to figure out if Robert's 0003 patch was correct, I added
yet another patch to this stack to test it.  0006 does basic xid map
maintenance that exercises the cases 0003 fixes, and I think it
demonstrates that they now work correctly.  Also some minor perl
improvements to 0005.  I'll attach 0001-0004 again but they are
unchanged.

Since confusion about head vs tail seems to have been at the root of
the bugs addressed by 0003, I wonder if we should also rename
head_{timestamp,offset} to oldest_{timestamp,offset}.
From 192ff883fd834371cc1f674d6a4cdfee89729cb0 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Thu, 16 Apr 2020 09:37:31 -0400
Subject: [PATCH v5 1/6] Expose oldSnapshotControl.

---
 src/backend/utils/time/snapmgr.c | 55 +--
 src/include/utils/old_snapshot.h | 75 
 2 files changed, 77 insertions(+), 53 deletions(-)
 create mode 100644 src/include/utils/old_snapshot.h

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..abaaea569a 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -63,6 +63,7 @@
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/old_snapshot.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
@@ -74,59 +75,7 @@
  */
 int			old_snapshot_threshold; /* number of minutes, -1 disables */
 
-/*
- * Structure for dealing with old_snapshot_threshold implementation.
- */
-typedef struct OldSnapshotControlData
-{
-	/*
-	 * Variables for old snapshot handling are shared among processes and are
-	 * only allowed to move forward.
-	 */
-	slock_t		mutex_current;	/* protect current_timestamp */
-	TimestampTz current_timestamp;	/* latest snapshot timestamp */
-	slock_t		mutex_latest_xmin;	/* protect latest_xmin and next_map_update */
-	TransactionId latest_xmin;	/* latest snapshot xmin */
-	TimestampTz next_map_update;	/* latest snapshot valid up to */
-	slock_t		mutex_threshold;	/* protect threshold fields */
-	TimestampTz threshold_timestamp;	/* earlier snapshot is old */
-	TransactionId threshold_xid;	/* earlier xid may be gone */
-
-	/*
-	 * Keep one xid per minute for old snapshot error handling.
-	 *
-	 * Use a circular buffer with a head offset, a count of entries currently
-	 * used, and a timestamp corresponding to the xid at the head offset.  A
-	 * count_used value of zero means that there are no times stored; a
-	 * count_used value of OLD_SNAPSHOT_TIME_MAP_ENTRIES means that the buffer
-	 * is full and the head must be advanced to add new entries.  Use
-	 * timestamps aligned to minute boundaries, since that seems less
-	 * surprising than aligning based on the first usage timestamp.  The
-	 * latest bucket is effectively stored within latest_xmin.  The circular
-	 * buffer is updated when we get a new xmin value that doesn't fall into
-	 * the same interval.
-	 *
-	 * It is OK if the xid for a given time slot is from earlier than
-	 * calculated by adding the number of minutes corresponding to the
-	 * (possibly wrapped) distance from the head offset to the time of the
-	 * head entry, since that just results in the vacuuming of old tuples
-	 * being slightly less aggressive.  It would not be OK for it to be off in
-	 * the other direction, since it might result in vacuuming tuples that are
-	 * still expected to be there.
-	 *
-	 * Use of an SLRU was considered but not chosen because it is more
-	 * heavyweight than is needed for this, and would probably not be any less
-	 * code to implement.
-	 *
-	 * Persistence is not needed.
-	 */
-	int			head_offset;	/* subscript of oldest tracked time */
-	TimestampTz head_timestamp; /* time corresponding to head xid */
-	int			count_used;		/* how many slots are in use */
-	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
-} OldSnapshotControlData;
-
-static volatile OldSnapshotControlData *oldSnapshotControl;
+volatile OldSnapshotControlData *oldSnapshotControl;
 
 
 /*
diff --git a/src/include/utils/old_snapshot.h b/src/include/utils/old_snapshot.h
new file mode 100644
index 00..284af7d508
--- /dev/null
+++ b/src/include/utils/old_snapshot.h
@@ -0,0 +1,75 @@
+/*-
+ *
+ * old_snapshot.h
+ *		Data structures for 'snapshot too old'
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/utils/old_snapshot.h
+ *
+ *-
+ */
+
+#ifndef OLD_SNAPSHOT_H
+#define OLD_SNAPSHOT_H
+
+#include "datatype/timestamp.h"
+#include "storage/s_lock.h"
+
+/*
+ * Structure for dealing with old_snapsho

Concurrency bug in amcheck

2020-04-21 Thread Alexander Korotkov

Hi!

I found concurrency bug in amcheck running on replica.  When
btree_xlog_unlink_page() replays changes to replica, deleted page is
left with no items.  But if amcheck steps on such deleted page
palloc_btree_page() expects it would have items.

(lldb_on_primary) b btbulkdelete

primary=# drop table test;
primary=# create table test as (select random() x from
generate_series(1,100) i);
primary=# create index test_x_idx on test(x);
primary=# delete from test;
primary=# vacuum test;

(lldb_on_replica) b bt_check_level_from_leftmost

replica=# select bt_index_check('test_x_idx');

# skip to internal level
(lldb_on_replica) c
(lldb_on_replica) b palloc_btree_page
# skip to non-leftmost page
(lldb_on_replica) c
(lldb_on_replica) c
# concurrently delete btree pages
(lldb_on_primary) c
# continue with pages
(lldb_on_replica) c

Finally replica gets error.
ERROR:  internal block 289 in index "test_x_idx" lacks high key and/or
at least one downlink

Proposed fix is attached.  Spotted by Konstantin Knizhnik,
reproduction case and fix from me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


fix_amcheck_concurrency.patch
Description: Binary data

Re: WIP/PoC for parallel backup

2020-04-21 Thread Jeevan Ladhe

Hi Asif,

On Tue, Apr 21, 2020 at 1:00 PM Asif Rehman  wrote:

> Hi,
>
> I did some tests a while back, and here are the results. The tests were
> done to simulate
> a live database environment using pgbench.
>
> machine configuration used for this test:
> Instance Type:t2.xlarge
> Volume Type  :io1
> Memory (MiB) :16384
> vCPU #   :4
> Architecture:X86_64
> IOP :16000
> Database Size (GB) :102
>
> The setup consist of 3 machines.
> - one for database instances
> - one for pg_basebackup client and
> - one for pgbench with some parallel workers, simulating SELECT loads.
>
>basebackup | 4 workers | 8 Workers  |
> 16 workers
> Backup Duration(Min):   69.25|  20.44  | 19.86  |
> 20.15
> (pgbench running with 50 parallel client simulating SELECT load)
>


Well that looks a bit strange. All 4, 8 and 16 workers backup configurations
seem to have taken the same time. Is it because the machine CPUs are
only 4? In that case did you try to run with 2-workers and compare that
with 4-workers time?

Also, just to clarify and be sure - was there anything else running on any
of
these 3 machines while the backup was in progress.

Regards,
Jeevan Ladhe


> Backup Duration(Min):   154.75   |  49.28 | 45.27 | 20.35
> (pgbench running with 100 parallel client simulating SELECT load)
>
>
>
> On Tue, Apr 21, 2020 at 9:27 AM Amit Kapila 
> wrote:
>
>> On Tue, Apr 14, 2020 at 8:07 PM Asif Rehman 
>> wrote:
>>
>>>
>>> I forgot to make a check for no-manifest. Fixed. Attached is the updated
>>> patch.
>>>
>>>
>> Have we done any performance testing with this patch to see the benefits?
>> If so, can you point me to the results? If not, then can we perform some
>> tests on large backups to see the benefits of this patch/idea?
>>
>> --
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>
> --
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

Re: forgotten initalization of a variable

2020-04-21 Thread Kyotaro Horiguchi

At Tue, 21 Apr 2020 17:34:26 +0900, Michael Paquier  wrote 
in 
> On Tue, Apr 21, 2020 at 03:08:30PM +0900, Kyotaro Horiguchi wrote:
> > The commit a7e8ece41c adds a new member restoreCommand to
> > XLogPageReadPrivate. readOneRecord doesn't make use of it but forgets
> > to set NULL. That can lead to illegal pointer access.
> 
> That's an oversight of the original commit.  Now, instead of failing
> even if there is a restore command, wouldn't it be better to pass down
> the restore_command to readOneRecord() so as we can optionally
> improve the stability of a single record lookup?  This only applies to

Oops! You're right.

> a checkpoint record now, but this routine could be called elsewhere in
> the future.  Please see the attached.

It looks fine to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WIP: Aggregation push-down

2020-04-21 Thread Andy Fan

On Thu, Feb 27, 2020 at 4:50 PM Antonin Houska  wrote:

> legrand legrand  wrote:
>
> > Antonin Houska-2 wrote
>
> > > Right now I recall two problems: 1) is the way I currently store
> > > RelOptInfo for the grouped relations correct?, 2) how should we handle
> > > types for which logical equality does not imply physical (byte-wise)
> > > equality?
> > >
> > > Fortunately it seems now that I'm not the only one who cares about 2),
> so
> > > this
> > > problem might be resolved soon:
> > >
> > >
> https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com
> > >
> > > But 1) still remains.
> > >
> >
> > Hello
> > would "pgsql: Add equalimage B-Tree support functions."
> >
> https://www.postgresql.org/message-id/e1j72ny-0002gi...@gemulon.postgresql.org
>
> Yes, it seems so. I'll adapt the patch soon, hopefully next week. Thanks
> for
> reminder.
>
> --
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
>
>
Hi Antonin:

The more tests on your patch, the more powerful I feel it is! At the same
time,
I think the most difficult part to understand your design is you can accept
any number of generic join clauses,  so I guess more explanation on this
part
may be helpful.

At the code level, I did some slight changes on init_grouping_targets which
may
make the code easier to read.  You are free to to use/not use it.

Hope your patch get more attention soon!

Best Regards
Andy Fan

v2-0001-tiny-changes-for-init_grouping_targets.patch
Description: Binary data

Re: forgotten initalization of a variable

2020-04-21 Thread Michael Paquier

On Tue, Apr 21, 2020 at 03:08:30PM +0900, Kyotaro Horiguchi wrote:
> The commit a7e8ece41c adds a new member restoreCommand to
> XLogPageReadPrivate. readOneRecord doesn't make use of it but forgets
> to set NULL. That can lead to illegal pointer access.

That's an oversight of the original commit.  Now, instead of failing
even if there is a restore command, wouldn't it be better to pass down
the restore_command to readOneRecord() so as we can optionally
improve the stability of a single record lookup?  This only applies to
a checkpoint record now, but this routine could be called elsewhere in
the future.  Please see the attached.
--
Michael
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 14a5db5433..c51b5db315 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -106,7 +106,8 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
  * doing anything with the record itself.
  */
 XLogRecPtr
-readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex)
+readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
+			  const char *restoreCommand)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
@@ -115,6 +116,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex)
 	XLogRecPtr	endptr;
 
 	private.tliIndex = tliIndex;
+	private.restoreCommand = restoreCommand;
 	xlogreader = XLogReaderAllocate(WalSegSz, datadir, &SimpleXLogPageRead,
 	&private);
 	if (xlogreader == NULL)
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 101f0911be..633955f7be 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -339,7 +339,8 @@ main(int argc, char **argv)
 			/* Read the checkpoint record on the target to see where it ends. */
 			chkptendrec = readOneRecord(datadir_target,
 		ControlFile_target.checkPoint,
-		targetNentries - 1);
+		targetNentries - 1,
+		restore_command);
 
 			/*
 			 * If the histories diverged exactly at the end of the shutdown
diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index b122ae43e5..5cf5f17bb5 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -50,7 +50,7 @@ extern void findLastCheckpoint(const char *datadir, XLogRecPtr searchptr,
 			   XLogRecPtr *lastchkptredo,
 			   const char *restoreCommand);
 extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
-int tliIndex);
+int tliIndex, const char *restoreCommand);
 
 /* in pg_rewind.c */
 extern void progress_report(bool force);


signature.asc
Description: PGP signature

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Kyotaro Horiguchi

At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao  
wrote in 
> Patch attached. I will add this into the first CF for v14.

-   if (!fast_promoted)
+   if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
  
CHECKPOINT_IMMEDIATE |
  
CHECKPOINT_WAIT);

If we don't find the checkpoint record just before, we don't insert
End-Of-Recovery record then run an immediate chekpoint.  I think if we
nuke the non-fast promotion, shouldn't we insert the EOR record even
in that case?

Or, as Andres suggested upthread, do we always insert it?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WAL page magic errors (and plenty others) got hard to debug.

2020-04-21 Thread Andres Freund

Hi,

On 2020-04-05 15:49:16 -0700, Andres Freund wrote:
> When starting with on a data directory with an older WAL page magic we
> currently make that hard to debug. E.g.:
> 
> 2020-04-05 15:31:04.314 PDT [1896669][:0] LOG:  database system was shut down 
> at 2020-04-05 15:24:56 PDT
> 2020-04-05 15:31:04.314 PDT [1896669][:0] LOG:  invalid primary checkpoint 
> record
> 2020-04-05 15:31:04.314 PDT [1896669][:0] PANIC:  could not locate a valid 
> checkpoint record
> 2020-04-05 15:31:04.315 PDT [1896668][:0] LOG:  startup process (PID 1896669) 
> was terminated by signal 6: Aborted
> 2020-04-05 15:31:04.315 PDT [1896668][:0] LOG:  aborting startup due to 
> startup process failure
> 2020-04-05 15:31:04.316 PDT [1896668][:0] LOG:  database system is shut down
> 
> As far as I can tell this is not just the case for a wrong page magic,
> but for all page level validation errors.
> 
> I think this largely originates in:
> 
> commit 0668719801838aa6a8bda330ff9b3d20097ea844
> Author: Heikki Linnakangas 
> Date:   2018-05-05 01:34:53 +0300
> 
> Fix scenario where streaming standby gets stuck at a continuation record.

Heikki, Kyotaro, it'd be good if you could comment on what motivated
this approach. Because it sure as hell hides a lot of useful information
when there's a problem with WAL. Or well, all information.

- Andres

Greetings,

Andres Freund

Re: Remove page-read callback from XLogReaderState.

2020-04-21 Thread Kyotaro Horiguchi

I found this conficts with a7e8ece41cf7a96d8a9c4c037cdfef304d950831.
Rebased on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e67149578c750977a2a2872d3f4dbb3a86c82a6d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 5 Sep 2019 20:21:55 +0900
Subject: [PATCH v8 1/4] Move callback-call from ReadPageInternal to
 XLogReadRecord.

The current WAL record reader reads page data using a call back
function.  Although it is not so problematic alone, it would be a
problem if we are going to do add tasks like encryption which is
performed on page data before WAL reader reads them. To avoid that the
record reader facility has to have a new code path corresponds to
every new callback, this patch separates page reader from WAL record
reading facility by modifying the current WAL record reader to a state
machine.

As the first step of that change, this patch moves the page reader
function out of ReadPageInternal, then the remaining tasks of the
function are taken over by the new function XLogNeedData. As the
result XLogPageRead directly calls the page reader callback function
according to the feedback from XLogNeedData.
---
 src/backend/access/transam/xlog.c   |  16 +-
 src/backend/access/transam/xlogreader.c | 278 ++--
 src/backend/access/transam/xlogutils.c  |  12 +-
 src/backend/replication/walsender.c |  10 +-
 src/bin/pg_rewind/parsexlog.c   |  21 +-
 src/bin/pg_waldump/pg_waldump.c |   8 +-
 src/include/access/xlogreader.h |  23 +-
 src/include/access/xlogutils.h  |   2 +-
 8 files changed, 218 insertions(+), 152 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..b0ad9376d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -908,7 +908,7 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
-static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
+static bool	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 		 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		bool fetching_ckpt, XLogRecPtr tliRecPtr);
@@ -4328,7 +4328,6 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	XLogRecord *record;
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
-	/* Pass through parameters to XLogPageRead */
 	private->fetching_ckpt = fetching_ckpt;
 	private->emode = emode;
 	private->randAccess = (xlogreader->ReadRecPtr == InvalidXLogRecPtr);
@@ -11810,7 +11809,7 @@ CancelBackup(void)
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
  */
-static int
+static bool
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			 XLogRecPtr targetRecPtr, char *readBuf)
 {
@@ -11869,7 +11868,8 @@ retry:
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
-			return -1;
+			xlogreader->readLen = -1;
+			return false;
 		}
 	}
 
@@ -11964,7 +11964,8 @@ retry:
 		goto next_record_is_invalid;
 	}
 
-	return readLen;
+	xlogreader->readLen = readLen;
+	return true;
 
 next_record_is_invalid:
 	lastSourceFailed = true;
@@ -11978,8 +11979,9 @@ next_record_is_invalid:
 	/* In standby-mode, keep trying */
 	if (StandbyMode)
 		goto retry;
-	else
-		return -1;
+
+	xlogreader->readLen = -1;
+	return false;
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..6250093dd9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -36,8 +36,8 @@
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
-static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-			 int reqLen);
+static bool XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr,
+		 int reqLen, bool header_inclusive);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -272,7 +272,6 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		targetRecOff;
 	uint32		pageHeaderSize;
 	bool		gotheader;
-	int			readOff;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -322,14 +321,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * byte to cover the whole record header, or at least the part of it that
 	 * fits on the same page.
 	 */
-	readOff = ReadPageInternal(state, ta

Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2

2020-04-21 Thread Kyotaro Horiguchi

At Tue, 21 Apr 2020 15:48:02 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/04/21 15:36, Michael Paquier wrote:
> > On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote:
> >> Yeah, but that's not documented. So I don't think that we need to keep
> >> the backward-compatibility for that.
> >>
> >> Also in that case, non-fast promotion is triggered. Since my patch
> >> tries to remove non-fast promotion, it's intentional to prevent them
> >> from doing that. But you think that we should not drop that because
> >> there are still some users for that?
> > It would be good to ask around to folks maintaining HA solutions about
> > that change at least, as there could be a point in still letting
> > promotion to happen in this case, but switch silently to the fast
> > path.
> 
> *If* there are some HA solutions doing that, IMO that they should be
> *changed
> so that the documented official way to trigger promotion (i.e., pg_ctl
> promote,
> pg_promote or trigger_file) is used instead.

The difference between fast and non-fast promotions is far trivial
than the difference between promotion happens or not.  I think
everyone cares about the new version actually promotes by the steps
they have been doing, but few of them even notices the difference
between the fast and non-fast.  If those who are using non-fast
promotion for a certain reason should notice the change of promotion
behavior in release notes.

This is similar to the change of the default waiting behvaior of
pg_ctl at PG10.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WIP/PoC for parallel backup

2020-04-21 Thread Asif Rehman

Hi,

I did some tests a while back, and here are the results. The tests were
done to simulate
a live database environment using pgbench.

machine configuration used for this test:
Instance Type:t2.xlarge
Volume Type  :io1
Memory (MiB) :16384
vCPU #   :4
Architecture:X86_64
IOP :16000
Database Size (GB) :102

The setup consist of 3 machines.
- one for database instances
- one for pg_basebackup client and
- one for pgbench with some parallel workers, simulating SELECT loads.

   basebackup | 4 workers | 8 Workers  | 16
workers
Backup Duration(Min):   69.25|  20.44  | 19.86  | 20.15
(pgbench running with 50 parallel client simulating SELECT load)

Backup Duration(Min):   154.75   |  49.28 | 45.27 | 20.35
(pgbench running with 100 parallel client simulating SELECT load)



On Tue, Apr 21, 2020 at 9:27 AM Amit Kapila  wrote:

> On Tue, Apr 14, 2020 at 8:07 PM Asif Rehman 
> wrote:
>
>>
>> I forgot to make a check for no-manifest. Fixed. Attached is the updated
>> patch.
>>
>>
> Have we done any performance testing with this patch to see the benefits?
> If so, can you point me to the results? If not, then can we perform some
> tests on large backups to see the benefits of this patch/idea?
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


-- 
--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-21 Thread Amit Kapila

On Mon, Apr 20, 2020 at 6:32 PM Amit Kapila  wrote:
>
> On Sun, Apr 19, 2020 at 3:46 PM Juan José Santamaría Flecha
>  wrote:
> >
> > I cannot find a single place where all supported locales are listed, but I 
> > have created a small test program (WindowsNLSLocales.c) based on: 
> > [_] format locales [1], additional supported language 
> > strings [2], and additional supported country and region strings [3]. Based 
> > on the results from this test program, it is possible to to do a good job 
> > with the [_] types using the proposed logic, but the 
> > two later cases are Windows specific, and there is no way arround a 
> > lookup-table.
> >
> > The attached results (WindowsNLSLocales.ods) come from Windows 10 (1903) 
> > and Visual C++ build 1924, 64-bit.
> >
>
> I think these are quite intensive tests but I wonder do we need to
> test some locales with code_page?  Generally speaking, in this case it
> should not matter as we are trying to get the locale name but no harm
> in testing.  Also, I think it would be good if we can test how this
> impacts the error messages as Davinder is trying to do.
>


I have tried a simple test with the latest patch and it failed for me.

Set LC_MESSAGES="English_United Kingdom";
-- returns en-GB, then code changes it to en_NZ when _create_locale()
is used whereas with the patch it returns "" (empty string).

There seem to be two problems here (a) The input to enum_locales_fn
doesn't seem to get the input name as "English_United Kingdom" due to
which it can't find a match even if the same exists. (b) After
executing EnumSystemLocalesEx, there is no way the patch can detect if
it is successful in finding the passed name due to which it appends
empty string in such cases.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

82 matches

Mail list logo