Re: [PATCH] Add extra statistics to explain for Nested Loop

2020-10-16 Thread Pavel Stehule
so 17. 10. 2020 v 6:26 odesílatel Julien Rouhaud 
napsal:

> On Sat, Oct 17, 2020 at 12:15 PM Pavel Stehule 
> wrote:
> >
> > so 17. 10. 2020 v 0:11 odesílatel Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> napsal:
> >>
> >> On 16.10.2020 12:07, Julien Rouhaud wrote:
> >>
> >> Le ven. 16 oct. 2020 à 16:12, Pavel Stehule 
> a écrit :
> >>>
> >>>
> >>>
> >>> pá 16. 10. 2020 v 9:43 odesílatel  napsal:
> 
>  Hi, hackers.
>  For some distributions of data in tables, different loops in nested
> loop
>  joins can take different time and process different amounts of
> entries.
>  It makes average statistics returned by explain analyze not very
> useful
>  for DBA.
>  To fix it, here is the patch that add printing of min and max
> statistics
>  for time and rows across all loops in Nested Loop to EXPLAIN ANALYSE.
>  Please don't hesitate to share any thoughts on this topic!
> >>>
> >>>
> >>> +1
> >>>
> >>> This is great feature - sometimes it can be pretty messy current
> limited format
> >>
> >>
> >> +1, this can be very handy!
> >>
> >> Cool.
> >> I have added your patch to the commitfest, so it won't get lost.
>
> Thanks!  I'll also try to review it next week.
>
> >> https://commitfest.postgresql.org/30/2765/
> >>
> >> I will review the code next week.  Unfortunately, I cannot give any
> feedback about usability of this feature.
> >>
> >> User visible change is:
> >>
> >> -   ->  Nested Loop (actual rows=N loops=N)
> >> +  ->  Nested Loop (actual min_rows=0 rows=0 max_rows=0
> loops=2)
> >
> >
> > This interface is ok - there is not too much space for creativity.
>
> Yes I also think it's ok. We should also consider usability for tools
> like explain.depesz.com, I don't know if the current output is best.
> I'm adding Depesz and Pierre which are both working on this kind of
> tool for additional input.
>
> > I can imagine displaying variance or average - but I am afraid about
> very bad performance impacts.
>
> The original counter (rows here) is already an average right?
> Variance could be nice too.  Instrumentation will already spam
> gettimeofday() calls for nested loops, I don't think that computing
> variance would add that much overhead?
>

There is not any problem to write benchmark for worst case and test it


Re: [PATCH] Add extra statistics to explain for Nested Loop

2020-10-16 Thread Julien Rouhaud
On Sat, Oct 17, 2020 at 12:15 PM Pavel Stehule  wrote:
>
> so 17. 10. 2020 v 0:11 odesílatel Anastasia Lubennikova 
>  napsal:
>>
>> On 16.10.2020 12:07, Julien Rouhaud wrote:
>>
>> Le ven. 16 oct. 2020 à 16:12, Pavel Stehule  a 
>> écrit :
>>>
>>>
>>>
>>> pá 16. 10. 2020 v 9:43 odesílatel  napsal:

 Hi, hackers.
 For some distributions of data in tables, different loops in nested loop
 joins can take different time and process different amounts of entries.
 It makes average statistics returned by explain analyze not very useful
 for DBA.
 To fix it, here is the patch that add printing of min and max statistics
 for time and rows across all loops in Nested Loop to EXPLAIN ANALYSE.
 Please don't hesitate to share any thoughts on this topic!
>>>
>>>
>>> +1
>>>
>>> This is great feature - sometimes it can be pretty messy current limited 
>>> format
>>
>>
>> +1, this can be very handy!
>>
>> Cool.
>> I have added your patch to the commitfest, so it won't get lost.

Thanks!  I'll also try to review it next week.

>> https://commitfest.postgresql.org/30/2765/
>>
>> I will review the code next week.  Unfortunately, I cannot give any feedback 
>> about usability of this feature.
>>
>> User visible change is:
>>
>> -   ->  Nested Loop (actual rows=N loops=N)
>> +  ->  Nested Loop (actual min_rows=0 rows=0 max_rows=0 loops=2)
>
>
> This interface is ok - there is not too much space for creativity.

Yes I also think it's ok. We should also consider usability for tools
like explain.depesz.com, I don't know if the current output is best.
I'm adding Depesz and Pierre which are both working on this kind of
tool for additional input.

> I can imagine displaying variance or average - but I am afraid about very bad 
> performance impacts.

The original counter (rows here) is already an average right?
Variance could be nice too.  Instrumentation will already spam
gettimeofday() calls for nested loops, I don't think that computing
variance would add that much overhead?




Re: [PATCH] Add extra statistics to explain for Nested Loop

2020-10-16 Thread Pavel Stehule
so 17. 10. 2020 v 0:11 odesílatel Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> napsal:

> On 16.10.2020 12:07, Julien Rouhaud wrote:
>
> Le ven. 16 oct. 2020 à 16:12, Pavel Stehule  a
> écrit :
>
>>
>>
>> pá 16. 10. 2020 v 9:43 odesílatel  napsal:
>>
>>> Hi, hackers.
>>> For some distributions of data in tables, different loops in nested loop
>>> joins can take different time and process different amounts of entries.
>>> It makes average statistics returned by explain analyze not very useful
>>> for DBA.
>>> To fix it, here is the patch that add printing of min and max statistics
>>> for time and rows across all loops in Nested Loop to EXPLAIN ANALYSE.
>>> Please don't hesitate to share any thoughts on this topic!
>>>
>>
>> +1
>>
>> This is great feature - sometimes it can be pretty messy current limited
>> format
>>
>
> +1, this can be very handy!
>
>> Cool.
> I have added your patch to the commitfest, so it won't get lost.
> https://commitfest.postgresql.org/30/2765/
>
> I will review the code next week.  Unfortunately, I cannot give any
> feedback about usability of this feature.
>
> User visible change is:
>
> -   ->  Nested Loop (actual rows=N loops=N)
> +  ->  Nested Loop (actual min_rows=0 rows=0 max_rows=0
> loops=2)
>

This interface is ok - there is not too much space for creativity. I can
imagine displaying variance or average - but I am afraid about very bad
performance impacts.

Regards

Pavel

>
> Pavel, Julien, could you please say if it looks good?
>
> --
> Anastasia Lubennikova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-16 Thread Julien Rouhaud
On Fri, Oct 16, 2020 at 11:04 PM Bruce Momjian  wrote:
>
> On Thu, Oct 15, 2020 at 11:41:23AM +0800, Julien Rouhaud wrote:
> > On Wed, Oct 14, 2020 at 10:40 PM Bruce Momjian  wrote:
> > > There is that, and log_line_prefix, which I can imaging being useful.
> > > My point is that if the queryid is visible, there should be a reason it
> > > defaults to show empty.
> >
> > I did some naive benchmarking.  Using a custom pgbench script with this 
> > query:
> >
> > SELECT *
> > FROM pg_class c
> > JOIN pg_attribute a ON a.attrelid = c.oid
> > ORDER BY 1 DESC
> > LIMIT 1;
> >
> > I can see around 2% overhead (this query is reported with ~ 3ms
> > latency average).  Adding a few joins, overhead goes down to 1%.
>
> That number is too high to enable this by default.  I suggest we either
> improve the performance of this, or clearly document that you have to
> enable the hash computation to see the pg_stat_activity and
> log_line_prefix fields.

I realize that I didn't update the documentation part to reflect the
new GUC.  I'll fix that and add more warnings about the requirements
to have values displayed in pg_stat_acitivity and log_line_prefix.




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-16 Thread Julien Rouhaud
On Sat, Oct 17, 2020 at 12:23 AM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > In this case, I suppose using pg_stat_statement would require to have it
> > enabled, and it'd just not collect anything if disabled.

Yes, my idea was to be able to have pg_stat_statements enabled even if
no queryid is computed without that being a problem, and the patch I
sent should handle that properly, as pgss_store (and a few other
places) check for a non-zero queryid before doing any work.

Also, we can't have pg_stat_statements have any specific behavior
based on the new GUC, as there could alternatively be another module
that handles the queryid generation.

> Alternatively, pg_stat_statement might be able to force it on
> (applying a non-overridable PGC_INTERNAL-level setting) on load?
> Not sure if that'd be desirable or not.
>
> If the behavior of pg_stat_statement is to do nothing when it
> sees a query without the ID calculated (which I guess it'd have to)

Yes that's what it does.

> then there's a potential security issue if the GUC is USERSET level:
> a user could hide her queries from pg_stat_statement by turning the
> GUC off.  So this line of thought suggests the GUC needs to be at
> least SUSET, and maybe higher ... doesn't pg_stat_statement need it
> to have the same value cluster-wide?

Well, I don't think that there's any guarantee that pg_stat_statemens
will display all activity that has been run, since there's a limited
amount of (userid, dbid, queryid) that can be stored, but I agree that
allowing random user to hide their activity isn't nice.  Note that I
defined the GUC as SUSET, but maybe it should be SIGHUP?




Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers

2020-10-16 Thread Alvaro Herrera
On 2020-Oct-16, Alvaro Herrera wrote:

> I also just noticed that ALTER TABLE ONLY recurses to children, which it
> should not.

Apparently I wrote (bogus) bespoke code to handle recursion in
EnableDisableTrigger instead of using ATSimpleRecursion.  This patch
seems to fix this problem.

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 511f015a86..c8d6f78da2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4321,6 +4321,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 		case AT_DisableTrigAll:
 		case AT_DisableTrigUser:
 			ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE);
+			ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode, context);
 			pass = AT_PASS_MISC;
 			break;
 		case AT_EnableRule:		/* ENABLE/DISABLE RULE variants */
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3b4fbdadf4..b28b49bdfa 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -1530,27 +1530,6 @@ EnableDisableTrigger(Relation rel, const char *tgname,
 
 			heap_freetuple(newtup);
 
-			/*
-			 * When altering FOR EACH ROW triggers on a partitioned table, do
-			 * the same on the partitions as well.
-			 */
-			if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
-(TRIGGER_FOR_ROW(oldtrig->tgtype)))
-			{
-PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-int			i;
-
-for (i = 0; i < partdesc->nparts; i++)
-{
-	Relation	part;
-
-	part = relation_open(partdesc->oids[i], lockmode);
-	EnableDisableTrigger(part, NameStr(oldtrig->tgname),
-		 fires_when, skip_system, lockmode);
-	table_close(part, NoLock);	/* keep lock till commit */
-}
-			}
-
 			changed = true;
 		}
 


Re: Internal key management system

2020-10-16 Thread Bruce Momjian
On Fri, Oct 16, 2020 at 04:56:47PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Second, in testing starting/stopping the server, pg_ctl doesn't allow
> > the cluster_passphrase_command to read from /dev/tty, which I think is a
> > requirement because the command could likely require a user-supplied
> > unlock key, even if that is not the actual passphrase, just like ssl
> > keys.  This is because pg_ctl calls setsid() just before calling execl()
> > to start the server, and setsid() disassociates itself from the
> > controlling terminal.  I think the fix is to remove setsid() from pg_ctl
> > and add a postmaster flag to call setsid() after it has potentially
> > called cluster_passphrase_command, and pg_ctl would use that flag.
> 
> We discussed that and rejected it in the thread leading up to
> bb24439ce [1].  The primary problem being that it's not very clear
> when the postmaster should daemonize itself, and later generally
> isn't better.  I doubt that this proposal is doing anything to
> clarify that situation.

Agreed.  No reason to destablize the postmaster for this.  What about
having pg_ctl open /dev/tty, and then pass in an open file descriptor
that is a copy of /dev/tty, that can be closed by the postmaster after
the cluster_passphrase_command?  I just tested this and it worked.

I am thinking we would pass the file descriptor number to the postmaster
via a command-line argument.  Ideally we would pass in the device name
of /dev/tty, but I don't know of a good way to do that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Stats collector's idx_blks_hit value is highly misleading in practice

2020-10-16 Thread Peter Geoghegan
It occurs to mean that statistics collector stats such as
pg_statio_*_tables.idx_blks_hit are highly misleading in practice
because they fail to take account of the difference between internal
pages and leaf pages in B-Tree indexes. These two types of pages are
in fundamentally different categories, and I think that failing to
recognize that at the level of these system views makes them much less
useful. Somebody should probably write a patch that makes this
difference clear from the system views. Possibly by using some
generalized notion of "record" pages instead of leaf pages, and
"metadata" pages instead of internal pages. That would even work with
hash indexes, I think.

Consider the following example, which is based on a standard nbtree
index, but could work in almost the same way with other index access
methods:

We have a pgbench_accounts pkey after initialization by pgbench at
scale 1500. It has 409,837 leaf pages and 1,451 internal pages,
meaning that about one third of one percent of all pages in the index
are internal pages. Occasionally, with indexes on large text strings
we might notice that as many as 1% of all index pages are internal
pages, but that's very much on the high side. Generally speaking,
we're virtually guaranteed to have *all* internal pages in
shared_buffers once a steady state has been reached. Once the cache
warms up, point lookups (like the queries pgbench performs) will only
have to access one leaf page at most, which amounts to only one I/O at
most. (This asymmetry is the main reason why B-Trees are generally
very effective when buffered in a buffer cache.)

If we run the pgbench queries against this database/example index
we'll find that we have to access 4 index pages per query execution --
the root, two additional internal pages, plus a leaf page. Based on
the reasonable assumptions I'm making, 3 out of 4 of those pages will
be hits when steady state is reached with pgbench's SELECT-only
workload, regardless of how large shared_buffers is or how bloated the
index is (we only need 1451 buffers for that, and those are bound to
get hot quickly).

The overall effect is idx_blks_hit changes over time in a way that
makes no sense -- even to an expert. Let's say we start with this
entire 3213 MB pgbench index in shared_buffers. We should only get
increments in idx_blks_hit, never increments in idx_blks_read - that
much makes sense. If we then iteratively shrink shared_buffers (or
equivalently, make the index grow without adding a new level), the
proportion of page accesses that increment idx_blks_read (rather than
incrementing idx_blks_hit) goes up roughly linearly as misses increase
linearly - which also makes sense. But here is the silly part: we
cannot really have a hit rate of less than 75% if you compare
idx_blks_hit to idx_blks_read, unless and until we can barely even fit
1% of the index in memory (at which point it's hard to distinguish
from noise). So if we naively consume the current view we'll see a hit
rate that starts at 100%, and very slowly shrinks to 75%, which is
where we bottom out (more or less, roughly speaking). This behavior
seems pretty hard to defend to me.

If somebody fixed this by putting internal pages into their own bucket
in the system view, then motivated users would quickly learn that
internal page stats aren't really useful -- they are only included for
completeness. They're such a small contributor to the overall hit rate
that they can simply be ignored completely. The thing that users ought
to focus on is leaf page hit rate. Now index hit rate (by which I mean
leaf page hit rate) actually makes sense. Note that Heroku promoted
simple heuristics like this for many years.

I suppose that a change like this could end up affecting other things,
such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
separately for bitmap scans at the moment, so maybe it could be fairly
well targeted. And, maybe this is unappealing given the current
statistics collector limitations. I'm not volunteering to work on it
right now, but it would be nice to fix this. Please don't wait for me
to do it.

-- 
Peter Geoghegan




Re: upcoming API changes for LLVM 12

2020-10-16 Thread Alvaro Herrera
On 2020-Oct-16, Andres Freund wrote:

> A related question is whether it'd be time to prune the oldest supported
> LLVM version. 3.9.0 was released 2016-08-31 (and 3.9.1, the only point
> release, was 2016-12-13). There's currently no *pressing* reason to
> reduce it, but it is the cause of few #ifdefs - but more importantly it
> increases the test matrix.

Is there a matrix of LLVM versions supported by live distros?  It sounds
like pruning away 3.9 from branch master would be reasonable enough;
OTOH looking at the current LLVM support code in Postgres it doesn't
look like you would actually save all that much.  Maybe the picture
changes with things you're doing now, but it's not evident from what's
in the tree now.

> I'm inclined to just have a deterministic policy that we apply around
> release time going forward. E.g. don't support versions that are newer
> than the newest available LLVM version in the second newest
> long-term-supported distribution release of RHEL, Ubuntu, Debian?

It seems fair to think that new Postgres releases should be put in
production only with the newest LTS release of each OS -- no need to go
back to the second newest.  But I think we should use such a criteria to
drive discussion rather than as a battle axe chopping stuff away.




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-10-16 Thread Alvaro Herrera
On 2020-Sep-24, Amit Langote wrote:

Hello Amit,

> Sorry I totally failed to see the v2 you had posted and a couple of
> other emails where you mentioned the issues I brought up.

No worries, I appreciate you reviewing this.

> However, I am a bit curious about including detached partitions in
> some cases while not in other, which can result in a (to me)
> surprising behavior as follows:
[ snip ]
> begin;
> insert into foo values (2);  -- ok
> select * from foo;
> select * from foo;  -- ?!
>  a | b
> ---+---
> (0 rows)
> 
> Maybe, it's fine to just always exclude detached partitions, although
> perhaps I am missing some corner cases that you have thought of?

Well, this particular case can be fixed by changing
ExecInitPartitionDispatchInfo specifically, from including detached
partitions to excluding them, as in the attached version.  Given your
example I think the case is fairly good that they should be excluded
there.  I can't think of a case that this change break.

However I'm not sure that excluding them everywhere is sensible.  There
are currently two cases where they are included (search for calls to
CreatePartitionDirectory if you're curious).  One is snapshot-isolation
transactions (repeatable read and serializable) in
set_relation_partition_info, per the example from Hao Wu.  If we simply
exclude detached transaction there, repeatable read no longer works
properly; rows will just go missing for no apparent reason.  I don't
think this is acceptable.

The other case is ExecCreatePartitionPruneState().  The whole point of
including detached partitions here is to make them available for queries
that were planned before the detach and executed after the detach.  My
fear is that the pruning plan will contain references (from planner) to
partitions that the executor doesn't know about.  If there are extra
partitions at the executor side, it shouldn't harm anything (and it
shouldn't change query results); but I'm not sure that things will work
OK if partitions seen by the planner disappear from under the executor.

I'm posting this version just as a fresh rebase -- it is not yet
intended for commit.  I haven't touched the constraint stuff.  I still
think that that can be removed before commit, which is a simple change;
but even if not, the problem with the duplicated constraints should be
easy to fix.

Again, my thanks for reviewing.
>From 2cf81404fb4fa4e85665d6695361830d015acac0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 13 Jul 2020 20:15:30 -0400
Subject: [PATCH v3 1/2] Let ALTER TABLE exec routines deal with the relation

This means that ATExecCmd relies on AlteredRelationInfo->rel instead of
keeping the relation as a local variable; this is useful if the
subcommand needs to modify the relation internally.  For example, a
subcommand that internally commits its transaction needs this.
---
 src/backend/commands/tablecmds.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 511f015a86..2784fb11f9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -157,6 +157,8 @@ typedef struct AlteredTableInfo
 	Oid			relid;			/* Relation to work on */
 	char		relkind;		/* Its relkind */
 	TupleDesc	oldDesc;		/* Pre-modification tuple descriptor */
+	/* Transiently set during Phase 2, normally set to NULL */
+	Relation	rel;
 	/* Information saved by Phase 1 for Phase 2: */
 	List	   *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
 	/* Information saved by Phases 1/2 for Phase 3: */
@@ -354,7 +356,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	  AlterTableUtilityContext *context);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			  AlterTableUtilityContext *context);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 	  AlterTableUtilityContext *context);
 static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab,
@@ -4393,7 +4395,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 		{
 			AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 			List	   *subcmds = tab->subcmds[pass];
-			Relation	rel;
 			ListCell   *lcmd;
 
 			if (subcmds == NIL)
@@ -4402,10 +4403,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			/*
 			 * Appropriate lock was obtained by phase 1, needn't get it again
 			 */
-			rel = relation_open(tab->relid, NoLock);
+			tab->rel = relation_open(tab->relid, NoLock);
 
 			foreach(lcmd, subcmds)
-ATExecCmd(wqueue, tab, rel,
+ATExecCmd(wqueue, tab,
 		  castNode(AlterTableCmd, lfirst(lcmd)),
 		  lockmode, pass, context);
 
@@ -4417,7 +4418,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			if (pass == AT_PASS_ALTER_TYPE)
 

Re: [PATCH] Add extra statistics to explain for Nested Loop

2020-10-16 Thread Anastasia Lubennikova

On 16.10.2020 12:07, Julien Rouhaud wrote:
Le ven. 16 oct. 2020 à 16:12, Pavel Stehule > a écrit :




pá 16. 10. 2020 v 9:43 odesílatel mailto:e.sokol...@postgrespro.ru>> napsal:

Hi, hackers.
For some distributions of data in tables, different loops in
nested loop
joins can take different time and process different amounts of
entries.
It makes average statistics returned by explain analyze not
very useful
for DBA.
To fix it, here is the patch that add printing of min and max
statistics
for time and rows across all loops in Nested Loop to EXPLAIN
ANALYSE.
Please don't hesitate to share any thoughts on this topic!


+1

This is great feature - sometimes it can be pretty messy current
limited format


+1, this can be very handy!


Cool.
I have added your patch to the commitfest, so it won't get lost.
https://commitfest.postgresql.org/30/2765/

I will review the code next week.  Unfortunately, I cannot give any 
feedback about usability of this feature.


User visible change is:

-   ->  Nested Loop (actual rows=N loops=N)
+  ->  Nested Loop (actual min_rows=0 rows=0 max_rows=0 loops=2)

Pavel, Julien, could you please say if it looks good?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: upcoming API changes for LLVM 12

2020-10-16 Thread Tom Lane
Andres Freund  writes:
> A related question is whether it'd be time to prune the oldest supported
> LLVM version. 3.9.0 was released 2016-08-31 (and 3.9.1, the only point
> release, was 2016-12-13). There's currently no *pressing* reason to
> reduce it, but it is the cause of few #ifdefs - but more importantly it
> increases the test matrix.

> I'm inclined to just have a deterministic policy that we apply around
> release time going forward. E.g. don't support versions that are newer
> than the newest available LLVM version in the second newest
> long-term-supported distribution release of RHEL, Ubuntu, Debian?

Meh.  I do not think these should be mechanistic one-size-fits-all
decisions.  A lot hinges on just how messy it is to continue support
for a given tool.  Moreover, the policy you propose above is
completely out of line with our approach to every other toolchain
we use.

I'd rather see an approach along the lines of "it's time to drop
support for LLVM version X because it can't do Y", rather than
"... because Z amount of time has passed".

regards, tom lane




Re: should INSERT SELECT use a BulkInsertState?

2020-10-16 Thread Justin Pryzby
On Sat, Sep 19, 2020 at 08:32:15AM -0500, Justin Pryzby wrote:
> On Sun, Jul 12, 2020 at 08:57:00PM -0500, Justin Pryzby wrote:
> > On Thu, Jun 04, 2020 at 10:30:47AM -0700, Andres Freund wrote:
> > > On 2020-05-08 02:25:45 -0500, Justin Pryzby wrote:
> > > > Seems to me it should, at least conditionally.  At least if there's a 
> > > > function
> > > > scan or a relation or ..
> > > 
> > > Well, the problem is that this can cause very very significant
> > > regressions. As in 10x slower or more. The ringbuffer can cause constant
> > > XLogFlush() calls (due to the lsn interlock), and the eviction from
> > > shared_buffers (regardless of actual available) will mean future vacuums
> > > etc will be much slower.  I think this is likely to cause pretty
> > > widespread regressions on upgrades.
> > > 
> > > Now, it sucks that we have this problem in the general facility that's
> > > supposed to be used for this kind of bulk operation. But I don't really
> > > see it realistic as expanding use of bulk insert strategies unless we
> > > have some more fundamental fixes.
> > 
> > I made this conditional on BEGIN BULK/SET bulk, so I'll solicit comments on 
> > that.
> 
> @cfbot: rebased

again

-- 
Justin
>From 7f856d0597a34a98be848c337612f1671497f52f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 8 May 2020 02:17:32 -0500
Subject: [PATCH v4] Allow INSERT SELECT to use a BulkInsertState

---
 src/backend/executor/nodeModifyTable.c | 22 --
 src/backend/parser/gram.y  |  7 ++-
 src/backend/tcop/utility.c |  4 
 src/backend/utils/misc/guc.c   | 11 +++
 src/include/executor/nodeModifyTable.h |  2 ++
 src/include/nodes/execnodes.h  |  2 ++
 src/include/parser/kwlist.h|  1 +
 7 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0c055ed408..d19d6d1a85 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -76,6 +76,8 @@ static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
    int whichplan);
+/* guc */
+bool insert_in_bulk = false;
 
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
@@ -598,7 +600,7 @@ ExecInsert(ModifyTableState *mtstate,
 			table_tuple_insert_speculative(resultRelationDesc, slot,
 		   estate->es_output_cid,
 		   0,
-		   NULL,
+		   NULL, /* Bulk insert not supported */
 		   specToken);
 
 			/* insert index entries for tuple */
@@ -638,7 +640,7 @@ ExecInsert(ModifyTableState *mtstate,
 			/* insert the tuple normally */
 			table_tuple_insert(resultRelationDesc, slot,
 			   estate->es_output_cid,
-			   0, NULL);
+			   0, mtstate->bistate);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
@@ -2365,6 +2367,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 	mtstate->mt_arowmarks = (List **) palloc0(sizeof(List *) * nplans);
 	mtstate->mt_nplans = nplans;
+	mtstate->bistate = NULL;
+	if (operation == CMD_INSERT &&
+			node->onConflictAction != ONCONFLICT_UPDATE &&
+			node->rootRelation == 0)
+	{
+		Assert(nplans == 1);
+
+		if (insert_in_bulk)
+			mtstate->bistate = GetBulkInsertState();
+	}
 
 	/* set up epqstate with dummy subplan data for the moment */
 	EvalPlanQualInit(>mt_epqstate, estate, NULL, NIL, node->epqParam);
@@ -2805,6 +2817,12 @@ ExecEndModifyTable(ModifyTableState *node)
 		   resultRelInfo);
 	}
 
+	if (node->bistate)
+	{
+		FreeBulkInsertState(node->bistate);
+		table_finish_bulk_insert((getTargetResultRelInfo(node))->ri_RelationDesc, 0);
+	}
+
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 * and release the slot used for tuple routing, if set.
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 480d168346..b87e31e36a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -631,7 +631,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	ASSERTION ASSIGNMENT ASYMMETRIC AT ATTACH ATTRIBUTE AUTHORIZATION
 
 	BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT
-	BOOLEAN_P BOTH BY
+	BOOLEAN_P BOTH BY BULK
 
 	CACHE CALL CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
 	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
@@ -9873,6 +9873,9 @@ transaction_mode_item:
 			| NOT DEFERRABLE
 	{ $$ = makeDefElem("transaction_deferrable",
 	   makeIntConst(false, @1), @1); }
+			| BULK
+	{ $$ = makeDefElem("bulk",
+	   makeIntConst(true, @1), @1); }
 		;
 
 /* Syntax with commas is SQL-spec, without commas is Postgres historical */
@@ -15079,6 +15082,7 @@ unreserved_keyword:
 			| BACKWARD
 

Re: Deleting older versions in unique indexes to avoid page splits

2020-10-16 Thread Peter Geoghegan
On Fri, Oct 16, 2020 at 1:00 PM Victor Yegorov  wrote:
> I really like these results, great work!

Thanks Victor!

> I'm also wondering how IO numbers changed due to these improvements, 
> shouldn't be difficult to look into.

Here is the pg_statio_user_indexes for patch for the same run:

 schemaname | relname  |   indexrelname|
idx_blks_read | idx_blks_hit
+--+---+---+---
 public | pgbench_accounts | aid_pkey_include_abalance |
12,828,736 |   534,819,826
 public | pgbench_accounts | one   |
12,750,275 |   534,486,742
 public | pgbench_accounts | two   |
2,474,893 | 2,216,047,568
(3 rows)

And for master:

 schemaname | relname  |   indexrelname|
idx_blks_read | idx_blks_hit
+--+---+---+---
 public | pgbench_accounts | aid_pkey_include_abalance |
29,526,568 |   292,705,432
 public | pgbench_accounts | one   |
28,239,187 |   293,164,160
 public | pgbench_accounts | two   |
6,505,615 | 1,318,164,692
(3 rows)

Here is pg_statio_user_tables patch:

 schemaname | relname  | heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit  | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
+--++---+---+---+-+++---
 public | pgbench_accounts |123,195,496 |   696,805,485 |
28,053,904 | 3,285,354,136 | ||
|
 public | pgbench_branches | 11 | 1,553 |
 |   | ||
  |
 public | pgbench_history  |  0 | 0 |
 |   | ||
  |
 public | pgbench_tellers  | 86 |15,416 |
 |   | ||
  |
(4 rows)

And the pg_statio_user_tables for master:

 schemaname | relname  | heap_blks_read | heap_blks_hit |
idx_blks_read | idx_blks_hit  | toast_blks_read | toast_blks_hit |
tidx_blks_read | tidx_blks_hit
+--++---+---+---+-+++---
 public | pgbench_accounts |106,502,089 |   334,875,058 |
64,271,370 | 1,904,034,284 | ||
|
 public | pgbench_branches | 11 | 1,553 |
 |   | ||
  |
 public | pgbench_history  |  0 | 0 |
 |   | ||
  |
 public | pgbench_tellers  | 86 |15,416 |
 |   | ||
  |
(4 rows)

Of course, it isn't fair to make a direct comparison because we're
doing ~1.7x times more work with the patch. But even still, the
idx_blks_read is less than half with the patch.

BTW, the extra heap_blks_hit from the patch are not only due to the
fact that the system does more directly useful work. It's also due to
the extra garbage collection triggered in indexes. The same is
probably *not* true with heap_blks_read, though. I minimize the number
of heap pages accessed by the new cleamup mechanism each time, and
temporal locality will help a lot. I think that we delete index
entries pointing to garbage in the heap at pretty predictable
intervals. Heap pages full of LP_DEAD line pointer garbage only get
processed with a few times close together in time, after which they're
bound to either get VACUUM'd or get evicted from shared buffers.

> Peter, according to cfbot patch no longer compiles.
> Can you send and update, please?

Attached is v3, which is rebased against the master branch as of
today. No real changes, though.

-- 
Peter Geoghegan


v3-0001-Add-delete-deduplication-to-nbtree.patch
Description: Binary data


Re: Internal key management system

2020-10-16 Thread Tom Lane
Bruce Momjian  writes:
> Second, in testing starting/stopping the server, pg_ctl doesn't allow
> the cluster_passphrase_command to read from /dev/tty, which I think is a
> requirement because the command could likely require a user-supplied
> unlock key, even if that is not the actual passphrase, just like ssl
> keys.  This is because pg_ctl calls setsid() just before calling execl()
> to start the server, and setsid() disassociates itself from the
> controlling terminal.  I think the fix is to remove setsid() from pg_ctl
> and add a postmaster flag to call setsid() after it has potentially
> called cluster_passphrase_command, and pg_ctl would use that flag.

We discussed that and rejected it in the thread leading up to
bb24439ce [1].  The primary problem being that it's not very clear
when the postmaster should daemonize itself, and later generally
isn't better.  I doubt that this proposal is doing anything to
clarify that situation.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CAEET0ZH5Bf7dhZB3mYy8zZQttJrdZg_0Wwaj0o1PuuBny1JkEw%40mail.gmail.com




Re: upcoming API changes for LLVM 12

2020-10-16 Thread Andres Freund
Hi,

On 2020-10-16 10:22:57 -0400, Tom Lane wrote:
> Yeah.  As long as we're not breaking the ability to build against older
> LLVM, I can't see a reason not to apply and back-patch these changes.
> We usually want all supported PG versions to build against newer tool
> chains, and this seems to fall into that category.

Cool! I just ran that branch against 3.9 (the currently oldest supported
version), and that still works.


A related question is whether it'd be time to prune the oldest supported
LLVM version. 3.9.0 was released 2016-08-31 (and 3.9.1, the only point
release, was 2016-12-13). There's currently no *pressing* reason to
reduce it, but it is the cause of few #ifdefs - but more importantly it
increases the test matrix.

I'm inclined to just have a deterministic policy that we apply around
release time going forward. E.g. don't support versions that are newer
than the newest available LLVM version in the second newest
long-term-supported distribution release of RHEL, Ubuntu, Debian?

Regards,

Andres




Re: Internal key management system

2020-10-16 Thread Bruce Momjian
On Fri, Jul 31, 2020 at 04:06:38PM +0900, Masahiko Sawada wrote:
> > Given that the purpose of the key manager is to help TDE, discussing
> > the SQL interface part (i.g., the second patch) deviates from the
> > original purpose. I think we should discuss the design and
> > implementation of the key manager first and then other additional
> > interfaces. So I’ve attached a new version patch and removed the
> > second patch part so that we can focus on only the key manager part.
> >
> 
> Since the previous patch sets conflicts with the current HEAD, I've
> attached the rebased patch set.

I have updated the attached patch and am hoping to move this feature
forward.  The changes I made are:

*  handle merge conflicts
*  changed ssl initialization to match other places in our code
*  changed StrNCpy() to strlcpy
*  update the docs

The first three were needed to get it to compile.  I then ran some tests
using the attached shell script as my password script.  First, I found
that initdb called the script twice.  The first call worked fine, but
the second call would accept a password that didn't match the first
call.   This is because there are no keys defined, so there is nothing
for kmgr_verify_passphrase() to check for passkey verification, so it
just succeeds.   In fact, I can't figure out how to create any keys with
the patch, and pg_encrypt() is documented, but not defined anywhere.

Second, in testing starting/stopping the server, pg_ctl doesn't allow
the cluster_passphrase_command to read from /dev/tty, which I think is a
requirement because the command could likely require a user-supplied
unlock key, even if that is not the actual passphrase, just like ssl
keys.  This is because pg_ctl calls setsid() just before calling execl()
to start the server, and setsid() disassociates itself from the
controlling terminal.  I think the fix is to remove setsid() from pg_ctl
and add a postmaster flag to call setsid() after it has potentially
called cluster_passphrase_command, and pg_ctl would use that flag.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 2768c85..44e0c1e
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** COPY postgres_log FROM '/full/path/to/lo
*** 7793,7798 
--- 7793,7833 
  
 
  
+
+ Encryption Key Management
+ 
+ 
+  
+   cluster_passphrase_command (string)
+   
+cluster_passphrase_command configuration parameter
+   
+   
+   
+
+ This option specifies an external command to be invoked when a
+ passphrase for key management system needs to be obtained.
+
+
+ The command must print the passphrase to the standard
+ output and have a zero exit code.  In the parameter value,
+ %p is replaced by a prompt string.  (Write
+ %% for a literal %.)
+ Note that the prompt string will probably contain whitespace,
+ so be sure to quote its use adequately.  A single newline is
+ stripped from the end of the output if present.  The passphrase
+ must be at least 64 bytes.
+
+
+ This parameter can only be set in the
+ postgresql.conf file or on the server
+ command line.
+
+   
+  
+ 
+
+ 
 
  Client Connection Defaults
  
*** dynamic_library_path = 'C:\tools\postgre
*** 9636,9641 
--- 9671,9692 
 

   
+ 
+   
+   key_management_enabled (boolean)
+   
+Key management configuration parameter parameter
+   
+   
+   
+
+ Reports whether encryption key management
+ is enabled for this cluster.  See  for more
+ information.
+
+   
+  
  
   
data_directory_mode (integer)
diff --git a/doc/src/sgml/database-encryption.sgml b/doc/src/sgml/database-encryption.sgml
new file mode 100644
index ...db84b0d
*** a/doc/src/sgml/database-encryption.sgml
--- b/doc/src/sgml/database-encryption.sgml
***
*** 0 
--- 1,292 
+ 
+ 
+ 
+  Database Encryption
+ 
+  
+   Server Side Encryption
+  
+ 
+  
+   The purpose of database encryption is to protect the confidential data
+   stored in a database from being revealed.
+  
+ 
+  
+   Encryption Key Management
+ 
+   
+PostgreSQL supports internal
+Encryption Key Management System, which is designed
+to manage the life cycles of cryptographic keys within the
+PostgreSQL.  This includes dealing with their
+generation, storage, usage and rotation.
+   
+ 
+   
+Encryption key management system is enabled when
+PostgreSQL is built with
+--with-openssl and
+ is specified during
+initdb.  The cluster passphrase provided by the
+ 

Re: Commitfest manager 2020-11

2020-10-16 Thread Anastasia Lubennikova

On 16.10.2020 21:57, Tom Lane wrote:

Anastasia Lubennikova  writes:

On the other hand, I noticed a lot of stall threads, that weren't
updated in months. Some of them seem to pass several CFs without any
activity at all. I believe that it is wrong for many reasons, the major
of which IMHO is a frustration of the authors. Can we come up with
something to impove this situation?

Yeah, that's a perennial problem.  Part of the issue is just a shortage
of people --- there are always more patches than we can review and
commit in one month.  IMO, another cause is that we have a hard time
saying "no".  If a particular patch isn't too well liked, we tend to
just let it slide to the next CF rather than making the uncomfortable
decision to reject it.  If you've got thoughts about that, or any other
ways to improve the process, for sure speak up.



From a CFM perspective, we can try the following things:

- Write recaps for long-running threads, listing open questions and TODOs.
This one is my personal pain. Some threads do look scary and it is less 
likely that someone will even start a review if they have to catch up 
with a year-long discussion of 10 people.


- Mark patches from first-time contributors with some tag.
Probably, these entries are simple/dummy enough to handle them faster. 
And also it will be a good reminder to be a bit less demanding with 
beginners. See Dmitry's statistic about how many people have sent patch 
only once [1].


- Proactively ask committers, if they are going to work on the upcoming 
CF and will they need any specific help.
Maybe we can also ask about their preferred code areas and check what is 
left uncovered. It's really bad if there is no one, who is working with 
let's say WAL internals during the CF. TBH, I have no idea, what are we 
going to do with this knowledge, but I think it's better to know.


- From time to time call a council of several committers and make tough 
decisions about patches that are in discussion for too long (let's say 4 
commitfests).
Hopefully, it will be easier to reach a consensus in a "real-time" 
discussion, or we can toss a coin. This problem was raised in previous 
discussions too [2].


[1] 
https://www.postgresql.org/message-id/CA+q6zcXtg7cFwX-c+BoOwk65+jtR-sQGZ=1mqg-vgmvzuh8...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CAA8%3DA7-owFLugBVZ0JjehTZJue7brEs2qTjVyZFRDq-B%3D%2BNwNg%40mail.gmail.com



--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: allow online change primary_conninfo

2020-10-16 Thread Sergei Kornilov
Hello

Yep, I think it's useful and I already posted patch in this thread: 
https://www.postgresql.org/message-id/flat/3090621585393698%40myt5-1466095fe4e5.qloud-c.yandex.net#ee6574e93982b5628d140f15cb44
Currently without consensus

regards, Sergei




Re: Deleting older versions in unique indexes to avoid page splits

2020-10-16 Thread Victor Yegorov
пт, 16 окт. 2020 г. в 21:12, Peter Geoghegan :

> I ran the script for two hours and 16 clients with the patch, then for
> another two hours with master. After that time, all 3 indexes were
> exactly the same size with the patch, but had almost doubled in size
> on master:
>
> aid_pkey_include_abalance: 784,264 pages (or ~5.983 GiB)
> one: 769,545 pages (or ~5.871 GiB)
> two: 770,295 pages (or ~5.876 GiB)
>
> (With the patch, all three indexes were 100% pristine -- they remained
> precisely 411,289 pages in size by the end, which is ~3.137 GiB.)
>
> …
>
> The TPS/throughput is about what you'd expect for the two hour run:
>
> 18,988.762398 TPS for the patch
> 11,123.551707 TPS for the master branch.
>

I really like these results, great work!

I'm also wondering how IO numbers changed due to these improvements,
shouldn't be difficult to look into.

Peter, according to cfbot patch no longer compiles.
Can you send and update, please?


-- 
Victor Yegorov


Re: More aggressive vacuuming of temporary tables

2020-10-16 Thread Peter Geoghegan
On Wed, Sep 9, 2020 at 11:02 AM Andres Freund  wrote:
> Obviously lookup in such a more complicated structure isn't free. Nor is
> building it. So we'd need some heuristics about when to do so. It'd
> probably be OK to occasionally look at the age of the oldest in-progress
> statement, to infer the time for old snapshots based on that. And then
> we could have a GUC that says when it's worth doing more complicated
> lookups.
>
> I don't have a handle on how to deal with the ctid chaining for
> intermediate row versions.

I wonder if it makes sense to add the feature to nbtree first. This
side-steps the practical problem of having to figure out ctid
chaining. You can prove the idea in a relatively low risk way, while
still creating significant value for users.

We can reason about versioning within indexes without having
authoritative information about it close at hand. The trick is to
delay everything until it looks like we'll have to split the page. I
can see very substantial improvements to index bloat from non-HOT
updates with the deduplication-deletion patch I've been working on:

https://postgr.es/m/cah2-wznjqqwspsxbiz6wqybikqfmfdjodeqp28otqw551t7...@mail.gmail.com

Importantly, the patch tends to bound the number of distinct index
entries *per logical row* in the presence of non-HOT updates (at least
for indexes that are not logically modified by the update). ISTM that
this is what really matters. The patch doesn't just reduce index bloat
-- it greatly caps the number of heap pages any given primary key
point lookup query will do. So in practice we eagerly kill precisely
the garbage index tuples that would have caused us the most trouble
without the new mechanism in place.

Avoiding a page split is a big win, so we can probably justify
relatively expensive lookup calls to make this work. This seems
especially likely to be true because we can probably ratchet up to
that only when we see that it will win. If a unique index has 10
entries for one value (10 versions), which is inevitable once HOT
pruning fails, then how many of those 10 do we really need? The answer
is perhaps just 2 or 3 maybe 99%+ of the time. I strongly suspect that
preventing page splits is more valuable than opportunistic heap
pruning (unless the pruning itself prevents page splits, which it may
or may not). Logically unnecessary page splits have obvious lasting
consequences, are usually highly correlated, and affect performance in
ways that are non-linear and likely very harmful in the real world.

--
Peter Geoghegan




Re: Deleting older versions in unique indexes to avoid page splits

2020-10-16 Thread Peter Geoghegan
On Wed, Oct 14, 2020 at 7:40 AM Peter Geoghegan  wrote:
> Right. The trick is to pay only a fixed low cost (maybe as low as one
> heap page access) when we start out, and ratchet it up only if the
> first heap page access looks promising.

Just as an example of how the patch can help, consider the following
pgbench variant script:

\set aid1 random_gaussian(1, 10 * :scale, 2.0)
\set aid2 random_gaussian(1, 10 * :scale, 2.5)
\set aid3 random_gaussian(1, 10 * :scale, 2.2)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
END;

(These details you see here are a bit arbitrary; don't worry about the
specifics too much.)

Before running the script with pgbench, I initialized pgbench to scale
1500, and made some changes to the indexing (in order to actually test
the patch). There was no standard pgbench_accounts PK. Instead, I
created a unique index that had an include column, which is enough to
make every update a non-HOT update. I also added two more redundant
non-unique indexes to create more overhead from non-HOT updates. It
looked like this:

create unique index aid_pkey_include_abalance on pgbench_accounts
(aid) include (abalance);
create index one on pgbench_accounts (aid);
create index two on pgbench_accounts (aid);

So 3 indexes on the accounts table.

I ran the script for two hours and 16 clients with the patch, then for
another two hours with master. After that time, all 3 indexes were
exactly the same size with the patch, but had almost doubled in size
on master:

aid_pkey_include_abalance: 784,264 pages (or ~5.983 GiB)
one: 769,545 pages (or ~5.871 GiB)
two: 770,295 pages (or ~5.876 GiB)

(With the patch, all three indexes were 100% pristine -- they remained
precisely 411,289 pages in size by the end, which is ~3.137 GiB.)

Note that traditional deduplication is used by the indexes I've called
"one" and "two" here, but not the include index called
"aid_pkey_include_abalance". But it makes little difference, for
reasons that will be obvious if you think about what this looks like
at the leaf page level. Cases that Postgres 13 deduplication does
badly with are often the ones that this new mechanism does well with.
Deduplication by deleting and by merging are truly complementary -- I
haven't just structured the code that way because it was convenient to
use dedup infrastructure just to get the dups at the start. (Yes, it
*was* convenient, but there clearly are cases where each mechanism
competes initially, before nbtree converges on the best strategy at
the local level. So FWIW this patch is a natural and logical extension
of the deduplication work in my mind.)

The TPS/throughput is about what you'd expect for the two hour run:

18,988.762398 TPS for the patch
11,123.551707 TPS for the master branch.

This is a ~1.7x improvement, but I can get more than 3x by changing
the details at the start -- just add more indexes. I don't think that
the precise throughput difference you see here matters. The important
point is that we've more or less fixed a pathological set of behaviors
that have poorly understood cascading effects. Full disclosure: I rate
limited pgbench to 20k for this run, which probably wasn't significant
because neither patch nor master hit that limit for long.

Big latency improvements for that same run, too:

Patch:

statement latencies in milliseconds:
 0.001  \set aid1 random_gaussian(1, 10 * :scale, 2.0)
 0.000  \set aid2 random_gaussian(1, 10 * :scale, 2.5)
 0.000  \set aid3 random_gaussian(1, 10 * :scale, 2.2)
 0.000  \set bid random(1, 1 * :scale)
 0.000  \set tid random(1, 10 * :scale)
 0.000  \set delta random(-5000, 5000)
 0.057  BEGIN;
 0.294  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
 0.204  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
 0.195  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
 0.090  END;

Master:

statement latencies in milliseconds:
 0.002  \set aid1 random_gaussian(1, 10 * :scale, 2.0)
 0.001  \set aid2 random_gaussian(1, 10 * :scale, 2.5)
 0.001  \set aid3 random_gaussian(1, 10 * :scale, 2.2)
 0.001  \set bid random(1, 1 * :scale)
 0.000  \set tid random(1, 10 * :scale)
 0.001  \set delta random(-5000, 5000)
 0.084  BEGIN;
 0.604  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
 0.317  SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
 0.311  SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
 0.120  END;

Note that the mechanism added by the patch naturally limits the number
of versions that are in the index for each 

Re: Commitfest manager 2020-11

2020-10-16 Thread Tom Lane
Anastasia Lubennikova  writes:
> I was looking for this message, to find out who is the current CFM. 
> Apparently, the November commitfest is not in progress yet.

Nope, nor have we officially appointed a CFM for it yet.  We're seldom
organized enough to do that much in advance of the CF's start.

> Still, I have a question. Should we also maintain statuses of the 
> patches in the "Open" commitfest?

Yes, absolutely, if you notice something out-of-date there, go ahead
and fix it.  If nothing else, you'll save the eventual CFM some time.

> On the other hand, I noticed a lot of stall threads, that weren't 
> updated in months. Some of them seem to pass several CFs without any 
> activity at all. I believe that it is wrong for many reasons, the major 
> of which IMHO is a frustration of the authors. Can we come up with 
> something to impove this situation?

Yeah, that's a perennial problem.  Part of the issue is just a shortage
of people --- there are always more patches than we can review and
commit in one month.  IMO, another cause is that we have a hard time
saying "no".  If a particular patch isn't too well liked, we tend to
just let it slide to the next CF rather than making the uncomfortable
decision to reject it.  If you've got thoughts about that, or any other
ways to improve the process, for sure speak up.

> P.S. I have a few more ideas about the CF management. I suppose, that 
> they are usually being discussed at pgcon meetings, but those won't 
> happen anytime soon. Is there a special place for such discussions, or I 
> may continue this thread?

This thread seems like an OK place for the discussion.  As you say,
there are not likely to be any in-person meetings for awhile :-(

regards, tom lane




Re: speed up unicode decomposition and recomposition

2020-10-16 Thread Daniel Verite
John Naylor wrote:

> I'd be curious how it compares to ICU now

I've made another run of the test in [1] with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:

 operation  | unpatched | patched | icu_ext 
+---+-+-
 nfc check  |  7968 |5989 |4076
 nfc conv   |700894 |   15163 |6808
 nfd check  | 16399 |9852 |3849
 nfd conv   | 17391 |   10916 |6758
 nfkc check |  8259 |6092 |4301
 nfkc conv  |700241 |   15354 |7034
 nfkd check | 16585 |   10049 |4038
 nfkd conv  | 17587 |   11109 |7086

The ICU implementation still wins by a large margin, but
the improvements brought by the patch are gorgeous,
especially for the conversion to NFC/NFKC.
This test ran on a slower machine than what I used for [1], so
that's why all queries take longer.

For the two queries upthread, I get this:

1)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,10) as i
) s;
count  

 10
(1 row)

Time: 371.043 ms

VS ICU:

select count(icu_normalize(t, 'NFC')) from (
select md5(i::text) as t from
generate_series(1,10) as i
) s;
 count  

 10
(1 row)

Time: 125.809 ms


2)
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,10) as i
) s;
 count  

 10
(1 row)
Time: 428.214 ms


VS ICU:

select count(icu_normalize(t, 'NFC')) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,10) as i
) s;
 count  

 10
(1 row)

Time: 147.713 ms


[1]
https://www.postgresql.org/message-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a%40manitou-mail.org


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: Commitfest manager 2020-11

2020-10-16 Thread Anastasia Lubennikova

On 24.08.2020 16:08, gkokola...@pm.me wrote:

Hi all,

Admittedly quite ahead of time, I would like to volunteer as Commitfest manager 
for 2020-11.

If the role is not filled and there are no objections, I can reach out again in 
October for confirmation.

//Georgios


Wow, that was well in advance) I am willing to assist if you need any help.

I was looking for this message, to find out who is the current CFM. 
Apparently, the November commitfest is not in progress yet.


Still, I have a question. Should we also maintain statuses of the 
patches in the "Open" commitfest? 21 patches were already committed 
during this CF, which shows that even "open" CF is quite active. I've 
updated a few patches, that were sent by my colleagues. If there are no 
objections, I can do that for other entries too.


On the other hand, I noticed a lot of stall threads, that weren't 
updated in months. Some of them seem to pass several CFs without any 
activity at all. I believe that it is wrong for many reasons, the major 
of which IMHO is a frustration of the authors. Can we come up with 
something to impove this situation?


P.S. I have a few more ideas about the CF management. I suppose, that 
they are usually being discussed at pgcon meetings, but those won't 
happen anytime soon. Is there a special place for such discussions, or I 
may continue this thread?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-16 Thread Tom Lane
I wrote:
> David Rowley  writes:
>> I've ended up leaving the NaN checks in the join costing functions.
>> There was no case mentioned in [1] that showed how we hit that
>> reported test case, so I'm not really confident enough to know I'm not
>> just reintroducing the same problem again by removing that.  The path
>> row estimate that had the NaN might not have been through
>> clamp_row_est(). Many don't.

> Hmm, I will try to find some time tomorrow to reconstruct that.

I'm confused now, because the v2 patch does remove those isnan calls?

I rechecked the archives, and I agree that there's no data about
exactly how we could have gotten a NaN here.  My guess though is
infinity-times-zero in some earlier relation size estimate.  So
hopefully the clamp to 1e100 will make that impossible, or if it
doesn't then clamp_row_est() should still prevent a NaN from
propagating to the next level up.

I'm good with the v2 patch.

regards, tom lane




Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers

2020-10-16 Thread Alvaro Herrera
Same, with a little test.

I also just noticed that ALTER TABLE ONLY recurses to children, which it
should not.
>From 2fb3a3122bdbbb1eb5aa6608b5132b8ab07096d4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 16 Oct 2020 10:58:54 -0300
Subject: [PATCH] When cloning triggers, preserve enabling state

---
 src/backend/commands/tablecmds.c   |  8 +++---
 src/backend/commands/trigger.c | 30 
 src/include/commands/trigger.h |  5 
 src/test/regress/expected/triggers.out | 39 ++
 src/test/regress/sql/triggers.sql  | 24 
 5 files changed, 96 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 511f015a86..f20b877f68 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16879,10 +16879,10 @@ CloneRowTriggersToPartition(Relation parent, Relation partition)
 		trigStmt->initdeferred = trigForm->tginitdeferred;
 		trigStmt->constrrel = NULL; /* passed separately */
 
-		CreateTrigger(trigStmt, NULL, RelationGetRelid(partition),
-	  trigForm->tgconstrrelid, InvalidOid, InvalidOid,
-	  trigForm->tgfoid, trigForm->oid, qual,
-	  false, true);
+		CreateTriggerFiringOn(trigStmt, NULL, RelationGetRelid(partition),
+			  trigForm->tgconstrrelid, InvalidOid, InvalidOid,
+			  trigForm->tgfoid, trigForm->oid, qual,
+			  false, true, trigForm->tgenabled);
 
 		MemoryContextSwitchTo(oldcxt);
 		MemoryContextReset(perTupCxt);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3b4fbdadf4..b826b3595c 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -158,6 +158,24 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 			  Oid relOid, Oid refRelOid, Oid constraintOid, Oid indexOid,
 			  Oid funcoid, Oid parentTriggerOid, Node *whenClause,
 			  bool isInternal, bool in_partition)
+{
+	return
+		CreateTriggerFiringOn(stmt, queryString, relOid, refRelOid,
+			  constraintOid, indexOid, funcoid,
+			  parentTriggerOid, whenClause, isInternal,
+			  in_partition, TRIGGER_FIRES_ON_ORIGIN);
+}
+
+/*
+ * Like the above; additionally the firing condition
+ * (always/origin/replica/disabled) can be specified.
+ */
+ObjectAddress
+CreateTriggerFiringOn(CreateTrigStmt *stmt, const char *queryString,
+	  Oid relOid, Oid refRelOid, Oid constraintOid,
+	  Oid indexOid, Oid funcoid, Oid parentTriggerOid,
+	  Node *whenClause, bool isInternal, bool in_partition,
+	  char trigger_fires_when)
 {
 	int16		tgtype;
 	int			ncolumns;
@@ -800,7 +818,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 			 CStringGetDatum(trigname));
 	values[Anum_pg_trigger_tgfoid - 1] = ObjectIdGetDatum(funcoid);
 	values[Anum_pg_trigger_tgtype - 1] = Int16GetDatum(tgtype);
-	values[Anum_pg_trigger_tgenabled - 1] = CharGetDatum(TRIGGER_FIRES_ON_ORIGIN);
+	values[Anum_pg_trigger_tgenabled - 1] = trigger_fires_when;
 	values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal || in_partition);
 	values[Anum_pg_trigger_tgconstrrelid - 1] = ObjectIdGetDatum(constrrelid);
 	values[Anum_pg_trigger_tgconstrindid - 1] = ObjectIdGetDatum(indexOid);
@@ -1130,11 +1148,11 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 map_partition_varattnos((List *) qual, PRS2_NEW_VARNO,
 		childTbl, rel);
 
-			CreateTrigger(childStmt, queryString,
-		  partdesc->oids[i], refRelOid,
-		  InvalidOid, indexOnChild,
-		  funcoid, trigoid, qual,
-		  isInternal, true);
+			CreateTriggerFiringOn(childStmt, queryString,
+  partdesc->oids[i], refRelOid,
+  InvalidOid, indexOnChild,
+  funcoid, trigoid, qual,
+  isInternal, true, trigger_fires_when);
 
 			table_close(childTbl, NoLock);
 
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a40ddf5db5..40b8154876 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -162,6 +162,11 @@ extern ObjectAddress CreateTrigger(CreateTrigStmt *stmt, const char *queryString
    Oid relOid, Oid refRelOid, Oid constraintOid, Oid indexOid,
    Oid funcoid, Oid parentTriggerOid, Node *whenClause,
    bool isInternal, bool in_partition);
+extern ObjectAddress CreateTriggerFiringOn(CreateTrigStmt *stmt, const char *queryString,
+		   Oid relOid, Oid refRelOid, Oid constraintOid,
+		   Oid indexOid, Oid funcoid, Oid parentTriggerOid,
+		   Node *whenClause, bool isInternal, bool in_partition,
+		   char trigger_fires_when);
 
 extern void RemoveTriggerById(Oid trigOid);
 extern Oid	get_trigger_oid(Oid relid, const char *name, bool missing_ok);
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 5e76b3a47e..d85471a3a9 100644
--- a/src/test/regress/expected/triggers.out
+++ 

Re: allow online change primary_conninfo

2020-10-16 Thread Maxim Orlov

On 2020-03-28 02:39, Alvaro Herrera wrote:

On 2020-Mar-27, Alvaro Herrera wrote:


On 2020-Mar-27, Alvaro Herrera wrote:

> I pushed the wal_receiver_create_temp_slot bugfix, because I realized
> after looking for long enough at WalReceiverMain() that the code was
> beyond saving.  I'll be pushing the rest of this later today.

So here's the next one.  I still have to go over the doc changes here,
but at least the tests are passing for me.


Pushed with some documentation tweaks.

Many thanks for working on this!


Nice work! What do you think of possibility to add restore_command?

Is it a good idea to make a restore_command to be reloadable via SUGHUP 
too?


On the one hand it looks reasonable since primary_conninfo got such 
ability. On the other hand we don't exactly know whether the actual 
process after reload config would use "new" command or "old"  since it 
may take a long time to finish running old command (in case of scp or 
smth).


Also setting faulty restore_command lead to server termination, which is 
rather unexpectedly.
If anyone makes some minor typo in restore_command, say write 
restore_command='cp1 /mnt/srv/%f %p' instead of restore_command='cp 
/mnt/srv/%f %p' and do SELECT pg_reload_conf() then server will 
terminate.


Do you consider this feature useful? Any suggestions?

---
Best regards,
Maxim Orlov.diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa47..87fd5939246 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3699,7 +3699,7 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"restore_command", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+		{"restore_command", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
 			gettext_noop("Sets the shell command that will be called to retrieve an archived WAL file."),
 			NULL
 		},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc7..9c9091e601e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -253,7 +253,6 @@
 # placeholders: %p = path of file to restore
 #   %f = file name only
 # e.g. 'cp /mnt/server/archivedir/%f %p'
-# (change requires restart)
 #archive_cleanup_command = ''	# command to execute at every restartpoint
 #recovery_end_command = ''	# command to execute at completion of recovery
 


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-16 Thread Bruce Momjian
On Fri, Oct 16, 2020 at 01:03:55PM -0300, Álvaro Herrera wrote:
> On 2020-Oct-16, Bruce Momjian wrote:
> 
> > On Thu, Oct 15, 2020 at 11:41:23AM +0800, Julien Rouhaud wrote:
> 
> > > I did some naive benchmarking.  Using a custom pgbench script with this 
> > > query:
> 
> > > I can see around 2% overhead (this query is reported with ~ 3ms
> > > latency average).  Adding a few joins, overhead goes down to 1%.
> > 
> > That number is too high to enable this by default.  I suggest we either
> > improve the performance of this, or clearly document that you have to
> > enable the hash computation to see the pg_stat_activity and
> > log_line_prefix fields.
> 
> Agreed.  This is similar to how we used to deal with query strings: an
> optional feature, disabled by default (cf.  commit b13c9686d084).
> 
> In this case, I suppose using pg_stat_statement would require to have it
> enabled, and it'd just not collect anything if disabled.  Similarly, the
> field would show NULL in pg_stat_activity or an empty string in
> log_line_prefix/CSV logs.

Yes, and at each use point, e.g., pg_stat_activity, log_line_prefix, we
have to remind people how to turn hash compuation on.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-16 Thread Tom Lane
Alvaro Herrera  writes:
> In this case, I suppose using pg_stat_statement would require to have it
> enabled, and it'd just not collect anything if disabled.

Alternatively, pg_stat_statement might be able to force it on
(applying a non-overridable PGC_INTERNAL-level setting) on load?
Not sure if that'd be desirable or not.

If the behavior of pg_stat_statement is to do nothing when it
sees a query without the ID calculated (which I guess it'd have to)
then there's a potential security issue if the GUC is USERSET level:
a user could hide her queries from pg_stat_statement by turning the
GUC off.  So this line of thought suggests the GUC needs to be at
least SUSET, and maybe higher ... doesn't pg_stat_statement need it
to have the same value cluster-wide?

regards, tom lane




Re: Sometimes the output to the stdout in Windows disappears

2020-10-16 Thread Tom Lane
Alexander Lakhin  writes:
> I've managed to make a simple reproducer. Please look at the patch attached.
> There are two things crucial for reproducing the bug:
>     ioctlsocket(sock, FIONBIO, _ret); // from pgwin32_socket()
> and
>     WSACleanup();

Oh, very interesting.

Now that you have it somewhat in captivity, maybe you could determine
some things:

1. Is it only stdout that's affected?  What of other stdio streams?
(Note that testing stderr might be tricky because it's probably
line-buffered.)

2. Does an fflush() just before, or just after, WSACleanup() fix it?

> I see no solution for this on the postgres side for now, but this
> information about Windows quirks could be useful in case someone
> stumbled upon it too.

Depending on your answers to the above, maybe some hack like this
would be acceptable:

free(conn);
 
 #ifdef WIN32
+   fflush(NULL);
WSACleanup();
 #endif
 }

It's not very nice for a library to be doing global things like that,
but if the alternative is loss of output, maybe we should.

But wait a minute: I just looked at Microsoft's docs [1] and found

In a multithreaded environment, WSACleanup terminates Windows Sockets
operations for all threads.

This makes me (a) wonder if that explains the side-effects on stdio,
and (b) question why libpq is calling WSACleanup at all.
What if we arranged to call WSAStartup just once, during the first
libpq connection-open in a process, and then never did WSACleanup?
Surely the OS can cope with that, and it eliminates any risk that
WSACleanup breaks something.

regards, tom lane

[1] 
https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-wsacleanup




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-16 Thread Alvaro Herrera
On 2020-Oct-16, Bruce Momjian wrote:

> On Thu, Oct 15, 2020 at 11:41:23AM +0800, Julien Rouhaud wrote:

> > I did some naive benchmarking.  Using a custom pgbench script with this 
> > query:

> > I can see around 2% overhead (this query is reported with ~ 3ms
> > latency average).  Adding a few joins, overhead goes down to 1%.
> 
> That number is too high to enable this by default.  I suggest we either
> improve the performance of this, or clearly document that you have to
> enable the hash computation to see the pg_stat_activity and
> log_line_prefix fields.

Agreed.  This is similar to how we used to deal with query strings: an
optional feature, disabled by default (cf.  commit b13c9686d084).

In this case, I suppose using pg_stat_statement would require to have it
enabled, and it'd just not collect anything if disabled.  Similarly, the
field would show NULL in pg_stat_activity or an empty string in
log_line_prefix/CSV logs.

So users that want it can easily have it, and users that don't are not
paying the price.

For maximum user-friendliness, pg_stat_statement could be loaded and
shmem-initialized even when query ID computation is turned off, and
you'd be able to enable query ID computation with just SIGHUP; so you
don't have to restart the server in order to enable statement tracking.
(I suppose we would forbid users from disabling query ID with SET,
though.)




Re: [Patch] Using Windows groups for SSPI authentication

2020-10-16 Thread Stephen Frost
Greetings,

* Russell Foster (russell.foster.cod...@gmail.com) wrote:
> On Thu, Oct 15, 2020 at 11:31 AM Stephen Frost  wrote:
> 
> > Please don't top-post on these lists..
> Didn't even know what that was, had to look it up. Hopefully it is
> resolved. Gmail does too many things for you!

Indeed!  This looks much better, thanks!

> > While not exactly the same, of course, they are more-or-less equivilant
> > to Unix groups (it's even possible using NSS to get Unix groups to be
> > backed by Windows groups) and so calling it 'system-group' does seem
> > like it'd make sense, rather than calling it "Windows groups" or
> > similar.
> >
> > One unfortunate thing regarding this is that, unless things have
> > changed, this won't end up working with GSS (unless we add the unix
> > group support and that's then backed by AD as I described above) since
> > the ability to check group membership using SSPI is an extension to the
> > Kerberos protocol, which never included group membership information in
> > it, and therefore while this would work for Windows clients connecting
> > to Windows servers, it won't work for Windows clients connecting to Unix
> > servers with GSSAPI authentication.
> >
> > The direction I had been thinking of addressing that was to add an
> > option to pg_hba.conf's 'gss' auth method which would allow reaching out
> > to check group membership against an AD server.  In a similar vein, we
> > could add an option to the 'sspi' auth method to check the group
> > membership, rather than having this done in pg_ident.conf, which is
> > really intended to allow mapping between system usernames and PG
> > usernames which are different, not really for controlling authentication
> > based on group membership when the username is the same.
> >
> > Russell, thoughts on that..?
> 
> So are you saying something like this where its an option to the sspi method?
> 
> # TYPE DATABASE USERADDRESS  MASKMETHOD
> hostssl all  some_user   0.0.0.0  0.0.0.0 sspi group="Windows Group"

Yes, something along those lines.

> I guess the code wouldn't change much, unless you mean for it to do a
> more generic ldap query. Seems OK to me, but I guess the hba could
> become more verbose. The map is nice as it allows your HBA to be very
> precise in how your connections and database users are represented,
> and the ident map file is there to group those external identities. I
> can't say I have a strong opinion either way though.

No, no, not suggesting you need to rewrite it as a generic LDAP query-
that would be a patch that I'd like to see but is a different feature
from this and wouldn't even be applicable to SSPI (it'd be for GSS..
and perhaps some other methods, but with SSPI we should use the SSPI
methods- I can't think of a reason to go to an LDAP query when the group
membership is directly available from SSPI, can you?).

The pg_ident is specifically intended to be a mapping from external user
identities to PG users.  Reading back through the thread, in the end it
seems like it really depends on what we're trying to solve here and
perhaps it's my fault for misunderstanding your original goal, but maybe
we get two features out of this in the end, and for not much more code.

Based on your example pg_ident.conf (which I took as more of a "this is
what using this would look like" and not as literally as I think you
meant it, now that I read back through it), there's a use-case of:

"Allow anyone in this group to log in as this *specific* PG user"

The other use-case is:

"Allow users in this group to be able to log into this PG server"

(The latter use-case potentially being further extended to
"automatically create the PG user if it doesn't already exist",
something which has been discussed elsewhere previously and is what
folks coming from other database systems may be used to).

The former would be more appropriate in pg_ident.conf, the latter would
fit into pg_hba.conf, both are useful.

To the prior discussion around pg_ident.conf, I do think having the
keyword being 'system-group' would fit well, but something we need to
think about is that multiple auth methods work with pg_ident and we need
to either implement the functionality for each of them, or make it clear
that it doesn't work- in particular, if you have 'system-group' as an
option in pg_ident.conf and you're using 'peer' auth on a Unix system,
we either need to make it work (which should be pretty easy..?), or
refuse to accept that map for that auth-method if it's not going to
work.

As it relates to pg_hba.conf- if you don't think it'd be much additional
code and you'd be up for it, I do think it'd be awesome to address that
use-case as well, but I do agree it's a separate feature and probably
committed independently.

Or, if I've managed to misunderstand again, please let me know. :)

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Sometimes the output to the stdout in Windows disappears

2020-10-16 Thread Alexander Lakhin
Hello hackers,

13.09.2020 21:37, Tom Lane wrote:
> I happened to try googling for other similar reports, and I found
> a very interesting recent thread here:
>
> https://github.com/nodejs/node/issues/33166
>
> It might not have the same underlying cause, of course, but it sure
> sounds familiar.  If Node.js are really seeing the same effect,
> that would point to an underlying Windows bug rather than anything
> Postgres is doing wrong.
>
> It doesn't look like the Node.js crew got any closer to
> understanding the issue than we have, unfortunately.  They made
> their problem mostly go away by reverting a seemingly-unrelated
> patch.  But I can't help thinking that it's a timing-related bug,
> and that patch was just unlucky enough to change the timing of
> their tests so that they saw the failure frequently.
I've managed to make a simple reproducer. Please look at the patch attached.
There are two things crucial for reproducing the bug:
    ioctlsocket(sock, FIONBIO, _ret); // from pgwin32_socket()
and
    WSACleanup();

I still can't understand what affects the effect. With this reproducer I
get:
vcregress taptest src\test\modules\connect
...
t/000_connect.pl .. # test
#
t/000_connect.pl .. 13346/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 16714/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 26216/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 30077/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 36505/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 43647/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 53070/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 54402/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 55685/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 83193/10
#   Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 2/10 # Looks like you failed 10 tests of 10.
t/000_connect.pl .. Dubious, test returned 10 (wstat 2560, 0xa00)
Failed 10/10 subtests

But in our test farm the pg_bench test (from the installcheck-world
suite that we run with using msys) can fail roughly on each third run.
Perhaps it depends on I/O load. It seems, that searching files/scanning
disk in parallel increases the probability of the glitch.
I see no solution for this on the postgres side for now, but this
information about Windows quirks could be useful in case someone
stumbled upon it too.

Best regards,
Alexander
diff --git a/src/test/modules/connect/connect.c b/src/test/modules/connect/connect.c
new file mode 100644
index 00..c875b7647b
--- /dev/null
+++ b/src/test/modules/connect/connect.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+
+int
+main(int argc, char *argv[])
+{
+	WSADATA	  wsaData;
+	SOCKET sock;
+	int port = 5432;
+	struct sockaddr_in ai_addr;
+	unsigned long ioctlsocket_ret = 1;
+
+	if (argc > 1) port = atoi(argv[1]);
+
+	ai_addr.sin_family = AF_INET;
+	ai_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
+	ai_addr.sin_port = htons(port);
+
+	WSAStartup(MAKEWORD(1, 1), );
+	sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	ioctlsocket(sock, FIONBIO, _ret);
+	connect(sock, (SOCKADDR *) & ai_addr, sizeof (ai_addr));
+	fprintf(stdout, "test\n");
+	WSACleanup();
+	return 0;
+}
diff --git a/src/test/modules/connect/t/000_connect.pl b/src/test/modules/connect/t/000_connect.pl
new file mode 100644
index 00..bd23d1e419
--- /dev/null
+++ b/src/test/modules/connect/t/000_connect.pl
@@ -0,0 +1,25 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Test::More tests => 10;
+
+# start a server
+my $node = get_new_node('main');
+$node->init;
+$node->start;
+
+my $stdout;
+my $stderr;
+my @cmd = ('connect', $node->port);
+
+IPC::Run::run(\@cmd, '>', \$stdout, '2>', \$stderr);
+diag($stdout);
+diag($stderr);
+
+for (my $i =0; $i < 10; $i++) {
+	IPC::Run::run(\@cmd, '>', \$stdout, '2>', \$stderr);
+	ok(defined $stdout && $stdout ne '');
+}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 20da7985c1..70959fbb24 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -43,6 +43,7 @@ my $contrib_extrasource = {
 	'seg'  => [ 'contrib/seg/segscan.l',   'contrib/seg/segparse.y' ],
 };
 my @contrib_excludes = (
+	'connect',
 	'bool_plperl',  'commit_ts',
 	'hstore_plperl','hstore_plpython',
 	'intagg',   'jsonb_plperl',
@@ -423,6 +424,10 @@ sub mkvcbuild
 	$zic->AddDirResourceFile('src/timezone');
 	$zic->AddReference($libpgcommon, $libpgport);
 
+	my $test_connect = $solution->AddProject('connect', 'exe', 'utils');
+	$test_connect->AddFile('src/test/modules/connect/connect.c');
+	$test_connect->AddLibrary('ws2_32.lib');
+
 	if (!$solution->{options}->{xml})
 	{
 		push @contrib_excludes, 'xml2';


Re: Potential use of uninitialized context in pgcrypto

2020-10-16 Thread Tom Lane
Daniel Gustafsson  writes:
> Even though we know that the digest algorithm exists when we reach the second
> call, we must check the returnvalue from each call to px_find_digest to handle
> allocation errors.

Agreed, it's a bug.  Will push in a bit.

regards, tom lane




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-16 Thread Bruce Momjian
On Thu, Oct 15, 2020 at 11:41:23AM +0800, Julien Rouhaud wrote:
> On Wed, Oct 14, 2020 at 10:40 PM Bruce Momjian  wrote:
> > There is that, and log_line_prefix, which I can imaging being useful.
> > My point is that if the queryid is visible, there should be a reason it
> > defaults to show empty.
> 
> I did some naive benchmarking.  Using a custom pgbench script with this query:
> 
> SELECT *
> FROM pg_class c
> JOIN pg_attribute a ON a.attrelid = c.oid
> ORDER BY 1 DESC
> LIMIT 1;
> 
> I can see around 2% overhead (this query is reported with ~ 3ms
> latency average).  Adding a few joins, overhead goes down to 1%.

That number is too high to enable this by default.  I suggest we either
improve the performance of this, or clearly document that you have to
enable the hash computation to see the pg_stat_activity and
log_line_prefix fields.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [patch] Fix checksum verification in base backups for zero page headers

2020-10-16 Thread Anastasia Lubennikova

On 07.10.2020 11:18, Michael Paquier wrote:

On Fri, Sep 25, 2020 at 08:53:27AM +0200, Michael Banck wrote:

Oh right, I've fixed up the commit message in the attached V4.

Not much a fan of what's proposed here, for a couple of reasons:
- If the page is not new, we should check if the header is sane or
not.
- It may be better to actually count checksum failures if these are
repeatable in a given block, and report them.
- The code would be more simple with a "continue" added for a block
detected as new or with a LSN newer than the backup start.
- The new error messages are confusing, and I think that these are
hard to translate in a meaningful way.

I am interested in moving this patch forward, so here is the updated v5.
This version uses PageIsVerified. All error messages are left unchanged.

So I think that we should try to use PageIsVerified() directly in the
code path of basebackup.c, and this requires a couple of changes to
make the routine more extensible:
- Addition of a dboid argument for pgstat_report_checksum_failure(),
whose call needs to be changed to use the *_in_db() flavor.


In the current patch, PageIsVerifed called from pgbasebackup simply 
doesn't report failures to pgstat.
Because in basebackup.c we already report checksum failures separately. 
Once per file.

        pgstat_report_checksum_failures_in_db(dboid, checksum_failures);

Should we change this too? I don't see any difference.


- Addition of an option to decide if a log should be generated or
not.
- Addition of an option to control if a checksum failure should be
reported to pgstat or not, even if this is actually linked to the
previous point, I have seen cases where being able to control both
separately would be helpful, particularly the log part.

For the last two ones, I would use an extensible argument based on
bits32 with a set of flags that the caller can set at will.

Done.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 553d62dbcdec1a00139f3c5ab6a325ed857b6c9d
Author: anastasia 
Date:   Fri Oct 16 17:36:02 2020 +0300

Fix checksum verification in base backups for zero page headers

We currently do not checksum a page if it is considered new by PageIsNew().
However, this means that if the whole page header is zero, PageIsNew() will
consider that page new due to the pd_upper field being zero and no subsequent
checksum verification is done.

To fix, use PageIsVerified() in pg_basebackup code.

Reported-By: Andres Freund
Author: Michael Banck
Reviewed-By: Asif Rehman, Anastasia Lubennikova
Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dbbd3aa31f..b4e050c9b8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
 		smgrread(src, forkNum, blkno, buf.data);
 
-		if (!PageIsVerified(page, blkno))
+		if (!PageIsVerified(page, blkno, PIV_LOG_WARNING | PIV_REPORT_STAT))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATA_CORRUPTED),
 	 errmsg("invalid page in block %u of relation %s",
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..aa8d52c1aa 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1682,11 +1682,14 @@ sendFile(const char *readfilename, const char *tarfilename,
  * this case. We also skip completely new pages, since they
  * don't have a checksum yet.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageGetLSN(page) < startptr)
 {
-	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
-	phdr = (PageHeader) page;
-	if (phdr->pd_checksum != checksum)
+	int piv_flags = (checksum_failures < 5) ? (PIV_LOG_WARNING) : 0;
+	/*
+	 * Do not report checksum failures to pgstats from
+	 *  PageIsVerified, since we will do it later.
+	 */
+	if (!PageIsVerified(page, blkno, piv_flags))
 	{
 		/*
 		 * Retry the block on the first failure.  It's
@@ -1732,13 +1735,6 @@ sendFile(const char *readfilename, const char *tarfilename,
 
 		checksum_failures++;
 
-		if (checksum_failures <= 5)
-			ereport(WARNING,
-	(errmsg("checksum verification failed in "
-			"file \"%s\", block %d: calculated "
-			"%X but expected %X",
-			readfilename, blkno, checksum,
-			phdr->pd_checksum)));
 		if (checksum_failures == 5)
 			ereport(WARNING,
 	(errmsg("further checksum verification "
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..971c96c7de 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -917,7 +917,7 @@ ReadBuffer_common(SMgrRelation 

Re: partition routing layering in nodeModifyTable.c

2020-10-16 Thread Alvaro Herrera
On 2020-Oct-16, Amit Langote wrote:

> On Thu, Oct 15, 2020 at 11:59 PM Heikki Linnakangas  wrote:
> > On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas  wrote:

> > And if we removed
> > ri_PartitionInfo->pi_PartitionToRootMap, and always used
> > ri_ChildToRootMap for it.
> 
> Done in the attached.

Hmm...  Overall I like the simplification.

> > Maybe remove PartitionRoutingInfo struct altogether, and just move its
> > fields directly to ResultRelInfo.
> 
> If we do that, we'll end up with 3 notations for the same thing across
> releases: In v10 and v11, PartitionRoutingInfos members are saved in
> arrays in ModifyTableState, totally detached from the partition
> ResultRelInfos.  In v12 (3f2393edef), we moved them into ResultRelInfo
> but chose to add them into a sub-struct (PartitionRoutingInfo), which
> in retrospect was not a great decision.  Now if we pull them into
> ResultRelInfo, we'll have invented the 3rd notation.  Maybe that makes
> things hard when back-patching bug-fixes?

I don't necessarily agree that PartitionRoutingInfo was such a bad idea.
In fact I wonder if we shouldn't move *more* stuff into it
(ri_PartitionCheckExpr), and keep struct ResultRelInfo clean of
partitioning-related stuff (other than ri_PartitionInfo and
ri_PartitionRoot); there are plenty of ResultRelInfos that are not
partitions, so I think it makes sense to keep the split.  I'm thinking
that the ChildToRootMap should continue to be in PartitionRoutingInfo.

Maybe what we need in order to keep the initialization "lazy enough" is
some inline functions that act as getters, initializing members of
PartitionRoutingInfo when first needed.  (This would probably need
boolean flags, to distinguish "hasn't been set up yet" from "it is not
needed for this partition" for each member that requires it).

BTW it is curious that ExecInitRoutingInfo is called both in
ExecInitPartitionInfo() (from ExecFindPartition when the ResultRelInfo
for the partition is not found) *and* from ExecFindPartition again, when
the ResultRelInfo for the partition *is* found.  Doesn't this mean that
ri_PartitionInfo is set up twice for the same partition?





Re: upcoming API changes for LLVM 12

2020-10-16 Thread Tom Lane
Andres Freund  writes:
> On 2020-10-16 02:45:51 -0300, Alvaro Herrera wrote:
>>> 2) When do we want to add LLVM 12 support? PG will soon stop compiling
>>> against LLVM 12, which will be released in about 6 months. I worked
>>> with Lang to make most of the breaking changes in a branch (to be
>>> merged in the next few days), but it's possible that there will be a
>>> few smaller changes.

>> hmm, how regular are LLVM releases?  I mean, what if pg14 ends up being
>> released sooner than LLVM12 – would there be a problem?

> Pretty unlikely - they're half yearly releases, and come out on a
> somewhat regular schedule. They've moved a few weeks but not more. And
> even if they did - having a few #ifdefs for LLVM 12 would be ok anyway.

Yeah.  As long as we're not breaking the ability to build against older
LLVM, I can't see a reason not to apply and back-patch these changes.
We usually want all supported PG versions to build against newer tool
chains, and this seems to fall into that category.

regards, tom lane




Re: Improper use about DatumGetInt32

2020-10-16 Thread Alvaro Herrera
On 2020-Sep-23, Ashutosh Bapat wrote:

> > You're ignoring the xid use-case, for which DatumGetUInt32 actually is
> > the right thing.
> 
> There is DatumGetTransactionId() which should be used instead.
> That made me search if there's PG_GETARG_TRANSACTIONID() and yes it's
> there but only defined in xid.c. So pg_xact_commit_timestamp(),
> pg_xact_commit_timestamp_origin() and pg_get_multixact_members() use
> PG_GETARG_UNIT32. IMO those should be changed to use
> PG_GETARG_TRANSACTIONID. That would require moving
> PG_GETARG_TRANSACTIONID somewhere outside xid.c; may be fmgr.h where
> other PG_GETARG_* are.

Hmm, yeah, I think this would be a good idea.

> get_raw_page() also does similar thing but the effect is not as dangerous
> SELECT octet_length(get_raw_page('test1', 'main', -1)) AS main_1;
>   ERROR:  block number 4294967295 is out of range for relation "test1"
> Similarly for bt_page_stats() and bt_page_items()

Hmm, but page numbers above signed INT_MAX are valid.  So this would
prevent reading all legitimate pages past that.





Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers

2020-10-16 Thread Alvaro Herrera
On 2020-Sep-30, Justin Pryzby wrote:

> postgres=# SELECT tgrelid::regclass, tgenabled FROM pg_trigger WHERE 
> tgrelid::regclass::text IN ('t1','t2');
>  tgrelid | tgenabled 
> -+---
>  t1  | D
>  t2  | O
> (2 rows)
> 
> I consider this a bug,

Yeah.

> but CreateTrigStmt doesn't have any "enabled" member
> (since it's impossible to CREATE TRIGGER .. DISABLED), so I'm not sure where
> the fix should be.

I suggest we add a new function, as in the attached.  It's important to
keep the ABI of CreateTrigger unchanged, for the sake of
backpatchability, but ISTM it's equally important to keep its API
unchanged, because outside callers such as ProcessUtility_slow does not
have to care about the new trigger's enabled state.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 511f015a86..f20b877f68 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16879,10 +16879,10 @@ CloneRowTriggersToPartition(Relation parent, Relation partition)
 		trigStmt->initdeferred = trigForm->tginitdeferred;
 		trigStmt->constrrel = NULL; /* passed separately */
 
-		CreateTrigger(trigStmt, NULL, RelationGetRelid(partition),
-	  trigForm->tgconstrrelid, InvalidOid, InvalidOid,
-	  trigForm->tgfoid, trigForm->oid, qual,
-	  false, true);
+		CreateTriggerFiringOn(trigStmt, NULL, RelationGetRelid(partition),
+			  trigForm->tgconstrrelid, InvalidOid, InvalidOid,
+			  trigForm->tgfoid, trigForm->oid, qual,
+			  false, true, trigForm->tgenabled);
 
 		MemoryContextSwitchTo(oldcxt);
 		MemoryContextReset(perTupCxt);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3b4fbdadf4..b826b3595c 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -158,6 +158,24 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 			  Oid relOid, Oid refRelOid, Oid constraintOid, Oid indexOid,
 			  Oid funcoid, Oid parentTriggerOid, Node *whenClause,
 			  bool isInternal, bool in_partition)
+{
+	return
+		CreateTriggerFiringOn(stmt, queryString, relOid, refRelOid,
+			  constraintOid, indexOid, funcoid,
+			  parentTriggerOid, whenClause, isInternal,
+			  in_partition, TRIGGER_FIRES_ON_ORIGIN);
+}
+
+/*
+ * Like the above; additionally the firing condition
+ * (always/origin/replica/disabled) can be specified.
+ */
+ObjectAddress
+CreateTriggerFiringOn(CreateTrigStmt *stmt, const char *queryString,
+	  Oid relOid, Oid refRelOid, Oid constraintOid,
+	  Oid indexOid, Oid funcoid, Oid parentTriggerOid,
+	  Node *whenClause, bool isInternal, bool in_partition,
+	  char trigger_fires_when)
 {
 	int16		tgtype;
 	int			ncolumns;
@@ -800,7 +818,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 			 CStringGetDatum(trigname));
 	values[Anum_pg_trigger_tgfoid - 1] = ObjectIdGetDatum(funcoid);
 	values[Anum_pg_trigger_tgtype - 1] = Int16GetDatum(tgtype);
-	values[Anum_pg_trigger_tgenabled - 1] = CharGetDatum(TRIGGER_FIRES_ON_ORIGIN);
+	values[Anum_pg_trigger_tgenabled - 1] = trigger_fires_when;
 	values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal || in_partition);
 	values[Anum_pg_trigger_tgconstrrelid - 1] = ObjectIdGetDatum(constrrelid);
 	values[Anum_pg_trigger_tgconstrindid - 1] = ObjectIdGetDatum(indexOid);
@@ -1130,11 +1148,11 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 map_partition_varattnos((List *) qual, PRS2_NEW_VARNO,
 		childTbl, rel);
 
-			CreateTrigger(childStmt, queryString,
-		  partdesc->oids[i], refRelOid,
-		  InvalidOid, indexOnChild,
-		  funcoid, trigoid, qual,
-		  isInternal, true);
+			CreateTriggerFiringOn(childStmt, queryString,
+  partdesc->oids[i], refRelOid,
+  InvalidOid, indexOnChild,
+  funcoid, trigoid, qual,
+  isInternal, true, trigger_fires_when);
 
 			table_close(childTbl, NoLock);
 
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a40ddf5db5..40b8154876 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -162,6 +162,11 @@ extern ObjectAddress CreateTrigger(CreateTrigStmt *stmt, const char *queryString
    Oid relOid, Oid refRelOid, Oid constraintOid, Oid indexOid,
    Oid funcoid, Oid parentTriggerOid, Node *whenClause,
    bool isInternal, bool in_partition);
+extern ObjectAddress CreateTriggerFiringOn(CreateTrigStmt *stmt, const char *queryString,
+		   Oid relOid, Oid refRelOid, Oid constraintOid,
+		   Oid indexOid, Oid funcoid, Oid parentTriggerOid,
+		   Node *whenClause, bool isInternal, bool in_partition,
+		   char trigger_fires_when);
 
 extern void RemoveTriggerById(Oid trigOid);
 extern Oid	get_trigger_oid(Oid relid, const char *name, bool missing_ok);


Re: partition routing layering in nodeModifyTable.c

2020-10-16 Thread Amit Langote
On Thu, Oct 15, 2020 at 11:59 PM Heikki Linnakangas  wrote:
> On Wed, Oct 14, 2020 at 6:04 PM Heikki Linnakangas  wrote:
> > I'll continue with the last couple of patches in this thread.
>
> I committed the move of the cross-partition logic to new
> ExecCrossPartitionUpdate() function, with just minor comment editing and
> pgindent. I left out the refactoring around the calls to AFTER ROW
> INSERT/DELETE triggers. I stared at the change for a while, and wasn't
> sure if I liked the patched or the unpatched new version better, so I
> left it alone.

Okay, thanks for committing that.

> Looking at the last patch, "Revise child-to-root tuple conversion map
> management", that's certainly an improvement. However, I find it
> confusing that sometimes the mapping from child to root is in
> relinfo->ri_ChildToRootMap, and sometimes in
> relinfo->ri_PartitionInfo->pi_PartitionToRootMap. When is each of those
> filled in? If both are set, is it well defined which one is initialized
> first?

It is ri_ChildToRootMap that is set first, because it's only set in
child UPDATE target relations which are initialized in
ExecInitModifyTable(), that is way before partition tuple routing
comes into picture.

ri_PartitionInfo and hence pi_PartitionToRootMap is set in tuple
routing target partition's ResultRelInfos, which are lazily
initialized when tuples land into them.

If a tuple routing target partition happens to be an UPDATE target
relation and we need to initialize the partition-to-root map, which
for a tuple routing target partition is to be saved in
pi_PartitionToRootMap, with the patch, we will try to reuse
ri_ChildToRootMap because it would already be initialized.

But as you say below, maybe we don't need to have two fields for the
same thing, which I agree with.  Having only ri_ChildToRootMap as you
suggest below suffices.

> In general, I'm pretty confused by the initialization of
> ri_PartitionInfo. Where is initialized, and when? In execnodes.h, the
> definition of ResultRelInfo says:
>
> >   /* info for partition tuple routing (NULL if not set up yet) */
> >   struct PartitionRoutingInfo *ri_PartitionInfo;
>
> That implies that the field is initialized lazily. But in
> ExecFindPartition, we have this:
>
> >   if (partidx == partdesc->boundinfo->default_index)
> >   {
> >   PartitionRoutingInfo *partrouteinfo = 
> > rri->ri_PartitionInfo;
> >
> >   /*
> >* The tuple must match the partition's layout for 
> > the constraint
> >* expression to be evaluated successfully.  If the 
> > partition is
> >* sub-partitioned, that would already be the case 
> > due to the code
> >* above, but for a leaf partition the tuple still 
> > matches the
> >* parent's layout.
> >*
> >* Note that we have a map to convert from root to 
> > current
> >* partition, but not from immediate parent to 
> > current partition.
> >* So if we have to convert, do it from the root 
> > slot; if not, use
> >* the root slot as-is.
> >*/
> >   if (partrouteinfo)
> >   {
> >   TupleConversionMap *map = 
> > partrouteinfo->pi_RootToPartitionMap;
> >
> >   if (map)
> >   slot = 
> > execute_attr_map_slot(map->attrMap, rootslot,
> > 
> >partrouteinfo->pi_PartitionTupleSlot);
> >   else
> >   slot = rootslot;
> >   }
> >
> >   ExecPartitionCheck(rri, slot, estate, true);
> >   }
>
> That check implies that it's not just lazily initialized, the code will
> work differently if ri_PartitionInfo is set or not.
>
> I think all this would be more clear if ri_PartitionInfo and
> ri_ChildToRootMap were both truly lazily initialized, the first time
> they're needed.

So, we initialize these maps when we initialize a partition's
ResultRelInfo.  I mean if the partition has a different tuple
descriptor than root, we know we are going to need to convert tuples
between them (in either direction), so we might as well initialize the
maps when the ResultRelInfo is built, which we do lazily for tuple
routing target relations at least.  In that sense, at least
root-to-partition maps are initialized lazily, that is only when a
partition receives a tuple via routing.

Partition-to-root maps' initialization though is not always lazy,
because they are also needed by UPDATE target relations, whose
ResultRelInfo are initialized in ExecInitModifyTable(), which is not
lazy enough.  That will change with my 

Potential use of uninitialized context in pgcrypto

2020-10-16 Thread Daniel Gustafsson
In px_crypt_md5() we have this section, with the second assignment to err being
unchecked:

   /* */
   err = px_find_digest("md5", );
   if (err)
   return NULL;
   err = px_find_digest("md5", );

Even though we know that the digest algorithm exists when we reach the second
call, we must check the returnvalue from each call to px_find_digest to handle
allocation errors.  Depending on which lib is backing pgcrypto, px_find_digest
may perform resource allocation which can fail on the subsequent call.  It does
fall in the not-terrible-likely-to-happen category but there is a non-zero risk
which would lead to using a broken context.  The attached checks the err
returnvalue and exits in case it indicates an error.

cheers ./daniel



pgcrypto_digest_error.patch
Description: Binary data



Re: Feature improvement for pg_stat_statements

2020-10-16 Thread Kyotaro Horiguchi
At Fri, 16 Oct 2020 10:47:50 +, Yuki Seino  wrote 
in 
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, failed
> Spec compliant:   tested, passed
> Documentation:tested, failed
> 
> The patch applies cleanly and looks fine to me.I'm going to list a few points 
> that I think need to be fixed.
> 
> 1.There are unnecessary difference lines 89 in "pg_stat_statements.c".
> 2.It is confusing that the initial value of reset_time is the current date 
> and time, so why not set it to null?
> 3.How about pg_stat_statements_reset_time creates a view?
> 4.How about counting the number of resets?
> 5."pgstatstatstatements.sgml" would need to include the function name in the 
> following section.
> -   these statistics, the module provides a view, 
> pg_stat_statements,
> -   and the utility functions 
> pg_stat_statements_reset and
> -   pg_stat_statements.  These are not available 
> globally but
> -   can be enabled for a specific database with
> +   these statistics, the module provides views, 
> pg_stat_statements
> +   and pg_stat_statements_ctl,
> +   and the utility functions 
> pg_stat_statements_reset,
> +   pg_stat_statements, and 
> pg_stat_statements_reset_time.
> +   These are not available globally but can be enabled for a specific 
> database with
> 
> It would be nice to be able to keep the timing of resetting the userid, dbid 
> and queryid as well, but I think the status quo is appropriate for management 
> in memory.
> 
> The new status of this patch is: Waiting on Author


+   SpinLockAcquire(>mutex);

You might noticed, but there a purpose of using the following
idiom. Without that, compiler might optimize out the comparison
assuming *pgss won't change.

>   volatile pgssSharedState *s = (volatile pgssSharedState *) pgss; \
>   SpinLockAcquire(>mutex); \

+   SpinLockAcquire(>mutex);
+   pgss->reset_time = GetCurrentTimestamp();

We should avoid (possiblly) time-cosuming thing like GetCurrentTimestamp();


+  this function shows last reset time only when 
pg_stat_statements_reset(0,0,0) is called.

This reads as "この関数はpg_statstatement_reset(0,0,0) が呼び出された
ときにだけ最終リセット時刻を表示します。", which I think is different
from what is intentended.

And the wording is not alike with the documentation for similar functions.

https://www.postgresql.org/docs/13/functions-datetime.html
> current_timestamp → timestamp with time zone
> Current date and time (start of current transaction); see Section 9.9.4

https://www.postgresql.org/docs/13/monitoring-stats.html
pg_stat_archiver view
> stats_reset timestamp with time zone
> Time at which these statistics were last reset

So something like the following?

Time at which pg_stat_statements_reset(0,0,0) was last called.

or

Time at which statistics are last discarded by calling 
pg_stat_statements_reset(0,0,0).


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add session statistics to pg_stat_database

2020-10-16 Thread Ahsan Hadi
Hi Laurenz,

I have applied the latest patch on master, all the regression test cases
are passing and the implemented functionality is also looking fine. The
point that I raised about idle connection not included is also addressed.

thanks,
Ahsan

On Wed, Oct 14, 2020 at 2:28 PM Laurenz Albe 
wrote:

> Thanks for the --- as always --- valuable review!
>
> On Tue, 2020-10-13 at 17:55 -0500, Justin Pryzby wrote:
> > On Tue, Oct 13, 2020 at 01:44:41PM +0200, Laurenz Albe wrote:
> > > Attached is v3 with improvements.
> >
> > +  
> > +   Time spent in database sessions in this database, in
> milliseconds.
> > +  
> >
> > Should say "Total time spent *by* DB sessions..." ?
>
> That is indeed better.  Fixed.
>
> > I think these counters are only accurate as of the last state change,
> right?
> > So a session which has been idle for 1hr, that 1hr is not included.  I
> think
> > the documentation should explain that, or (ideally) the implementation
> would be
> > more precise.  Maybe the timestamps should only be updated after a
> session
> > terminates (and the docs should say so).
>
> I agree, and I have added an explanation that the value doesn't include
> the duration of the current state.
>
> Of course it would be nice to have totally accurate values, but I think
> that the statistics are by nature inaccurate (datagrams can get lost),
> and more frequent statistics updates increase the work load.
> I don't think that is worth the effort.
>
> > +  
> > +   connections bigint
> > +  
> > +  
> > +   Number of connections established to this database.
> >
> > *Total* number of connections established, otherwise it sounds like it
> might
> > mean "the number of sessions [currently] established".
>
> Fixed like that.
>
> > +   Number of database sessions to this database that did not end
> > +   with a regular client disconnection.
> >
> > Does that mean "sessions which ended irregularly" ?  Or does it also
> include
> > "sessions which have not ended" ?
>
> I have added an explanation for that.
>
> > +   msg.m_aborted = (!disconnect || pgStatSessionDisconnected) ? 0 :
> 1;
> >
> > I think this can be just:
> > msg.m_aborted = (bool) (disconnect && !pgStatSessionDisconnected);
>
> I mulled over this and finally decided to leave it as it is.
>
> Since "m_aborted" gets added to the total counter, I'd prefer to
> have it be an "int".
>
> Your proposed code works (the cast is actually not necessary, right?).
> But I think that my version is more readable if you think of
> "m_aborted" as a counter rather than a flag.
>
> > +   if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
> > +   result = 0;
> > +   else
> > +   result = ((double) dbentry->n_session_time) / 1000.0;
> >
> > I think these can say:
> > > double result = 0;
> > > if ((dbentry=..) != NULL)
> > >  result = (double) ..;
> >
> > That not only uses fewer LOC, but also the assignment to zero is (known
> to be)
> > done at compile time (BSS) rather than runtime.
>
> I didn't know about the performance difference.
> Concise code (if readable) is good, so I changed the code like you propose.
>
> The code pattern is actually copied from neighboring functions,
> which then should also be changed like this, but that is outside
> the scope of this patch.
>
> Attached is v4 of the patch.
>
> Yours,
> Laurenz Albe
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.h...@highgo.ca


Re: Enumize logical replication message actions

2020-10-16 Thread Amit Kapila
On Fri, Oct 16, 2020 at 12:55 PM Ashutosh Bapat
 wrote:
>
> Hi All,
>  Logical replication protocol uses single byte character to identify
> different chunks of logical repliation messages. The code uses
> character literals for the same. These literals are used as bare
> constants in code as well. That's true for almost all the code that
> deals with wire protocol. With that it becomes difficult to identify
> the code which deals with a particular message. For example code that
> deals with message type 'B'. In various protocol 'B' has different
> meaning and it gets difficult and time consuming to differentiate one
> usage from other and find all places which deal with one usage. Here's
> a patch simplifying that for top level logical replication messages.
>

+1. I think this will make the code easier to read and understand. I
think it would be good to do this in some other parts as well but
starting with logical replication is a good idea as that area is still
evolving.

-- 
With Regards,
Amit Kapila.




Re: New statistics for tuning WAL buffer size

2020-10-16 Thread Masahiro Ikeda

On 2020-10-15 19:49, Fujii Masao wrote:

On 2020/10/13 11:57, Masahiro Ikeda wrote:

On 2020-10-06 15:57, Masahiro Ikeda wrote:

2.  Number of when new WAL file is created and zero-filled.

As Fujii-san already commented, I think it's good for tuning.

Just idea; it may be worth exposing the number of when new WAL file 
is created and zero-filled. This initialization may have impact on 
the performance of write-heavy workload generating lots of WAL. If 
this number is reported high, to reduce the number of this 
initialization, we can tune WAL-related parameters so that more 
"recycled" WAL files can be hold.


+1

But it might be better to track the number of when new WAL file is
created whether it's zero-filled or not, if file creation and sync 
itself

takes time.


OK, I changed to track the number of when new WAL file is created.




3.  Number of when to switch the WAL logfile segment.

This is similar to 2, but this counts the number of when WAL file is
recylcled too.
I think it's useful for tuning "wal_segment_size"
if the number is high relative to the startup time, 
"wal_segment_size"

must be bigger.


You're thinking to count all the WAL file switch? That number is equal
to the number of WAL files generated since the last reset of 
pg_stat_wal?


Yes. I think it might be better to count it because I think the ratio in 
which a new WAL file is created is important.

To calculate it, we need the count all the WAL file switch.



4. Number of when WAL is flushed

I think it's useful for tuning "synchronous_commit" and 
"commit_delay"

for query executions.
If the number of WAL is flushed is high, users can know
"synchronous_commit" is useful for the workload.

Also, it's useful for tuning "wal_writer_delay" and
"wal_writer_flush_after" for wal writer.
If the number is high, users can change the parameter for 
performance.


I think it's better to separate this for backends and wal writer.


+1


Thanks, I separated the statistics for backends and wal writer.
When checkpointer process flushes the WAL, the statistics for backends 
are counted now.
Although I think its impact is not big, is it better to make statistics 
for checkpointer?




5.  Wait time when WAL is flushed.

This is the accumulated time when wal is flushed.
If the time becomes much higher, users can detect the possibility of
disk failure.


This should be tracked, e.g., only when track_io_timing is enabled?
Otherwise, tracking that may cause performance overhead.


OK,  I changed the implementation.


Since users can see how much flash time occupies of the query 
execution time,

it may lead to query tuning and so on.

Since there is the above reason, I think it's better to separate this
for backends and wal writer.



I'm afraid that this counter for a backend may be a bit confusing. 
Because
when the counter indicates small time, we may think that walwriter 
almost
write WAL data and a backend doesn't take time to write WAL. But a 
backend

may be just waiting for walwriter to write WAL.


Thanks for your comments. I agreed.



Now, the following is the view implemented in the attached patch.
If you have any other comments, please let me know.

```
postgres=# SELECT * FROM pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 1000128   # Total number of WAL records 
generated
wal_fpi | 1 # Total number of WAL full page 
images generated

wal_bytes   | 124013682 #Total amount of WAL bytes generated
wal_buffers_full| 7952   #Total number of WAL data written to 
the disk because WAL buffers got full
wal_file| 14   #Total number of WAL file segment created or 
opened a pre-existing one

wal_init_file   | 7  #Total number of WAL file segment created
wal_write_backend   | 7956#Total number of WAL data written to the 
disk by backends
wal_write_walwriter | 27 #Total number of WAL data written to the 
disk by walwriter
wal_write_time  | 40# Total amount of time that has been spent 
in the portion of WAL data was written to disk by backend and walwriter, 
in milliseconds
wal_sync_backend| 1   # Total number of WAL data synced to the disk 
by backends
wal_sync_walwriter  | 6   #Total number of WAL data synced to the disk 
by walwriter
wal_sync_time   | 0  # Total amount of time that has been spent in 
the portion of WAL data was synced to disk by backend and walwriter, in 
milliseconds

stats_reset | 2020-10-16 19:41:01.892272+09
```

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6656676..ee70b7b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3406,12 +3406,115 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
 
 
+ 
+  
+   wal_records bigint
+  
+  
+   Total number of WAL records generated
+  
+ 
+
+ 
+  
+   

Re: Feature improvement for pg_stat_statements

2020-10-16 Thread Yuki Seino
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, failed
Spec compliant:   tested, passed
Documentation:tested, failed

The patch applies cleanly and looks fine to me.I'm going to list a few points 
that I think need to be fixed.

1.There are unnecessary difference lines 89 in "pg_stat_statements.c".
2.It is confusing that the initial value of reset_time is the current date and 
time, so why not set it to null?
3.How about pg_stat_statements_reset_time creates a view?
4.How about counting the number of resets?
5."pgstatstatstatements.sgml" would need to include the function name in the 
following section.
-   these statistics, the module provides a view, 
pg_stat_statements,
-   and the utility functions pg_stat_statements_reset 
and
-   pg_stat_statements.  These are not available 
globally but
-   can be enabled for a specific database with
+   these statistics, the module provides views, 
pg_stat_statements
+   and pg_stat_statements_ctl,
+   and the utility functions pg_stat_statements_reset,
+   pg_stat_statements, and 
pg_stat_statements_reset_time.
+   These are not available globally but can be enabled for a specific 
database with

It would be nice to be able to keep the timing of resetting the userid, dbid 
and queryid as well, but I think the status quo is appropriate for management 
in memory.

The new status of this patch is: Waiting on Author


Re: Implementing Incremental View Maintenance

2020-10-16 Thread Yugo NAGATA
Hi,

I have reviewed the past discussions in this thread on IVM implementation
of the proposed patch[1], and summarized it as following . We would appreciate
any comments or suggestions on the patch as regard of them.

* Aggregate support

The current patch supports several built-in aggregates, that is, count, sum, 
avg, min, and max. Other built-in aggregates or user-defined aggregates are
not supported.

Aggregates in a materialized view definition is checked if this is supported
using OIDs of aggregate function. For this end, Gen_fmgrtab.pl is changed to
output aggregate function's OIDs to fmgroids.h
(by 0006-Change-Gen_fmgrtab.pl-to-output-aggregate-function-s.patch). 
The logic for each aggregate function to update aggregate values in
materialized views is enbedded in a trigger function.

There was another option in the past discussion. That is, we could add one
or more new attribute to pg_aggregate which provides information about if
each aggregate function supports IVM and its logic[2]. If we have a mechanism
to support IVM in pg_aggregate, we may use more general aggregate functions
including user-defined aggregate in materialized views for IVM.

For example, the current pg_aggregate has aggcombinefn attribute for
supporting partial aggregation.  Maybe we could use combine functions to
calculate new aggregate values in materialized views when tuples are
inserted into a base table.  However, in the context of IVM, we also need
other function used when tuples are deleted from a base table, so we can not
use partial aggregation for IVM in the current implementation. 

Maybe, we could support the deletion case by adding a new support function,
say, "inverse combine function".  The "inverse combine function" would take 
aggregate value in a materialized view and aggregate value calculated from a
delta of view, and produces the new aggregate value which equals the result
after tuples in a base table are deleted.

However, we don't have concrete plan for the new design of pg_aggregate.  
In addition, even if make a new support function in pg_aggregate for IVM, 
we can't use this in the current IVM code because our code uses SQL via SPI
in order to update a materialized view and we can't call "internal" type
function directly in SQL.

For these reasons, in the current patch, we decided to left supporting
general aggregates to the next version for simplicity, so the current patch
supports only some built-in aggregates and checks if they can be used in IVM
by their OIDs.

* Hidden columns

For supporting aggregates, DISTINCT, and EXISTS, the current implementation
automatically create hidden columns whose name starts with "__ivm_" in
materialized views.

The columns starting with "__ivm_" are hidden, so when "SELECT * FROM ..." is
issued to a materialized view,  these are invisible for users. Users can not
use such name as a user column in materialized views with IVM support. 

As for how to make internal columns invisible to SELECT *, previously there
have been discussions about doing that using a new flag in pg_attribute[3]. 
However, the discussion is no longer active. So, we decided to use column
name for checking if this is special or not in our IVM implementation
for now.

* TRUNCATE support

Currently, TRUNCATE on base tables are not supported. When TRUNCATE command
is executed on a base table, it is ignored and nothing occurs on materialized
views. 

There are another options as followings:

- Raise an error or warning when a base table is TRUNCATEed.
- Make the view non-scannable (like REFRESH WITH NO DATA)
- Update the view in any way. It would be easy for inner joins
  or aggregate views, but there is some difficult with outer joins.

Which is the best way? Should we support TRUNCATE in the first version?
Any suggestions would be greatly appreciated.

[1] https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
[2] 
https://www.postgresql.org/message-id/20191129173328.e5a0e9f81e369a3769c4fd0c%40sraoss.co.jp
[3] 
https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com


Regard,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-16 Thread Amit Kapila
On Fri, Oct 16, 2020 at 2:16 PM Greg Nancarrow  wrote:
>
> On Fri, Oct 16, 2020 at 3:43 PM Amit Kapila  wrote:
> >
>
> > Also, I noticed that you have allowed for
> > parallelism only when all expressions/functions involved with Insert
> > are parallel-safe, can't we allow parallel-restricted case because
> > anyway Inserts have to be performed by the leader for this patch.
> >
>
> Yes, I think you're right.
> "A parallel restricted operation is one which cannot be performed in a
> parallel worker, but which can be performed in the leader while
> parallel query is in use."
> I'll make the change and test that everything works OK.
>

Cool, let me try to explain my thoughts a bit more. The idea is first
(in standard_planner) we check if there is any 'parallel_unsafe'
function/expression (via max_parallel_hazard) in the query tree. If we
don't find anything 'parallel_unsafe' then we mark parallelModeOk. At
this stage, the patch is checking whether there is any
'parallel_unsafe' or 'parallel_restricted' expression/function in the
target relation and if there is none then we mark parallelModeOK as
true. So, if there is anything 'parallel_restricted' then we will mark
parallelModeOK as false which doesn't seem right to me.

Then later in the planner during set_rel_consider_parallel, we
determine if a particular relation can be scanned from within a
worker, then we consider that relation for parallelism. Here, we
determine if certain things are parallel-restricted then we don't
consider this for parallelism. Then we create partial paths for the
relations that are considered for parallelism. I think we don't need
to change anything for the current patch in these later stages because
we anyway are not considering Insert to be pushed into workers.
However, in the second patch where we are thinking to push Inserts in
workers, we might need to do something to filter parallel-restricted
cases during this stage of the planner.

-- 
With Regards,
Amit Kapila.




Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-10-16 Thread movead...@highgo.ca

>It looks to me "We can know that length by subtracting the LSN of
>XLOG_SWITCH from the next record's LSN so it doesn't add any
>information."
Sorry,maybe I miss this before.
But I think it will be better to show it clearly.

>So the length of  in this case is:
> 
>LOC(SEG A+1) - ReadRecPtr - LEN(short header) - LEN(XLOG_SWITCH)
Thanks, I should not have missed this and fixed.




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


fix_waldump_size_for_wal_switch_v5.patch
Description: Binary data


Re: Enumize logical replication message actions

2020-10-16 Thread Ashutosh Bapat
On Fri, 16 Oct 2020 at 14:06, Kyotaro Horiguchi 
wrote:

> At Fri, 16 Oct 2020 08:08:40 +, Li Japin  wrote
> in
> >
> > > On Oct 16, 2020, at 3:25 PM, Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
> >
> > What about ’N’ for new tuples, ‘O’ for old tuple follows, ‘K’ for old
> key follows?
> > Those are also logical replication protocol message, I think.
>
> They are flags stored in a message so they can be seen as different
> from the message type letters.
>

I think we converting those into macros/enums will help but for now I have
tackled only the top level message types.


>
> Anyway if the values are determined after some meaning, I'm not sure
> enumerize them is good thing or not.  In other words, 'U' conveys
> almost same amount of information with LOGICAL_REP_MSG_UPDATE in the
> context of logical replcation protocol.
>
> We have the same code pattern in PostgresMain and perhaps we don't
> going to change them into enums.
>

That's exactly the problem I am trying to solve. Take for example 'B' as I
have mentioned before. That string literal appears in 64 different places
in the master branch. Which of those are the ones related to a "BEGIN"
message in logical replication protocol is not clear, unless I thumb
through each of those. In PostgresMain it's used to indicate a BIND
message. Which of those 64 instances are also using 'B' to mean a bind
message? Using enums or macros makes it clear. Just look
up LOGICAL_REP_MSG_BEGIN. Converting all 'B' to their respective macros
will help but might be problematic for back-patching. So that's arguable.
But doing that in something as new as logical replication will be helpful,
before it gets too late to change that.

Further logical repliation protocol is using the same literal e.g. 'O' to
mean origin in some places and old tuple in some other. While comments
there help, it's not easy to locate all the code that deals with one
meaning or the other. This change will help with that. Another reason as to
why logical replication.
-- 
Best Wishes,
Ashutosh


Re: gs_group_1 crashing on 13beta2/s390x

2020-10-16 Thread Christoph Berg
Re: Andres Freund
> I had a successful check-world run with maximum jittery on s390x. But I
> did hit the issue in different places than you did, so it'd be cool if
> you could re-enable JIT for s390x - I think you have a package tracking
> HEAD?

Cool, thanks!

I'm tracking PG14 head with apt.postgresql.org, but that doesn't have
s390x.

I'll pull the patches for PG13, re-enable JIT on some more
architectures, and use the opportunity to bump the LLVM version used
to 11.

Christoph




Re: [PATCH] Add extra statistics to explain for Nested Loop

2020-10-16 Thread Julien Rouhaud
Le ven. 16 oct. 2020 à 16:12, Pavel Stehule  a
écrit :

>
>
> pá 16. 10. 2020 v 9:43 odesílatel  napsal:
>
>> Hi, hackers.
>> For some distributions of data in tables, different loops in nested loop
>> joins can take different time and process different amounts of entries.
>> It makes average statistics returned by explain analyze not very useful
>> for DBA.
>> To fix it, here is the patch that add printing of min and max statistics
>> for time and rows across all loops in Nested Loop to EXPLAIN ANALYSE.
>> Please don't hesitate to share any thoughts on this topic!
>>
>
> +1
>
> This is great feature - sometimes it can be pretty messy current limited
> format
>

+1, this can be very handy!

>


Possible typo in nodeAgg.c

2020-10-16 Thread Hou, Zhijie
Hi

In /src/backend/executor/nodeAgg.c

I found the following comment still use work mem,
Since hash_mem has been introduced, Is it more accurate to use hash_mem here ?

@@ -1827,7 +1827,7 @@ hash_agg_set_limits(double hashentrysize, double 
input_groups, int used_bits,
/*
 * Don't set the limit below 3/4 of hash_mem. In that case, we are at 
the
 * minimum number of partitions, so we aren't going to dramatically 
exceed
-* work mem anyway.
+* hash_mem anyway.

Best regards,
houzj




0001-fix-typo-in-nodeAgg.c.patch
Description: 0001-fix-typo-in-nodeAgg.c.patch


Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-10-16 Thread Kyotaro Horiguchi
At Fri, 16 Oct 2020 16:21:47 +0800, "movead...@highgo.ca"  
wrote in 
> Thanks for all the suggestion, and new patch attached.
> 
> >Andres suggested that we don't need that description with per-record
> >basis. Do you have a reason to do that?  (For clarity, I'm not
> >suggesting that you should reving it.)
> I think Andres is saying if we just log it in xlog_desc() then we can not get
> the result for '--stats=record' case. And the patch solve the problem.

Mmm.

>  and
> for that including it in the record description is useless. When looking
> at plain records the length is sufficiently deducable by looking at the
> next record's LSN.

It looks to me "We can know that length by subtracting the LSN of
XLOG_SWITCH from the next record's LSN so it doesn't add any
information."

> >+ XLByteToSeg(record->EndRecPtr - 1, endSegNo, record->segcxt.ws_segsize);
> >We use XLByteToPrevSeg instead for this purpose.
> Thanks and follow the suggestion.
> 
> >You forgot to add a correction for short headers.
> Infact, we need to consider this matter when the remain size of a wal
> segment can not afford a XLogRecord struct for XLOG_SWITCH. 
> It's mean that if record->ReadRecPtr is on A wal segment, then
> record->EndRecPtr is on (A+2) wal segment. So we need to minus
> the longpagehead size on (A+1) wal segment.
> In my thought we need not to care the short header, if my understand
> is wrong?

Maybe.

When a page doesn't have sufficient space for a record, the record is
split into to pieces and the last half is recorded after the header of
the next page. If it next page is in the next segment, the header is a
long header and a short header otherwise.

If it were the last page of a segment,

ReadRecPtr
v
<--- SEG A --->|<-- SEG A+1 ->|<-SEG A+2
||

So the length of  is:

  LOC(SEG A+2) - ReadRecPtr - LEN(long header) - LEN(XLOG_SWITCH)


If not, that is, it were not the last page of a segment.

< SEG A >|<-SEG A+1
< page n ->|<-- page n + 1 -->||<-last page->|<-first page
||

So the length of  in this case is:

  LOC(SEG A+1) - ReadRecPtr - LEN(short header) - LEN(XLOG_SWITCH)

> >I'm not confindent on which is better, but I feel that this is not a
> >work for display side, but for the recorder side like attached.
> The patch really seems more clearly, but the new 'OTHERS' may confuse
> the users and we hard to handle it with '--rmgr=RMGR' option. So I have
> not use this design in this patch, let's see other's opinion.

Yeah, I don't like the "OTHERS", too.

> >Sorry for the confusion, but it would be a separate topic even if we
> >are going to fix that..
> Sorry, I remove the code, make sense we should discuss it in a
> separate topic.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-16 Thread Greg Nancarrow
On Fri, Oct 16, 2020 at 3:43 PM Amit Kapila  wrote:
>
> On Thu, Oct 15, 2020 at 6:13 PM Amit Kapila  wrote:
> >
> > On Thu, Oct 15, 2020 at 9:56 AM Greg Nancarrow  wrote:
> > >
> > > Also, I'm seeing a partition-related error when running
> > > installcheck-world - I'm investigating that.
> > >
> >
> > Okay.
> >
>
> The attached patch fixes this partition case for me. Basically, we
> need to check the parallel-safety of PartitionKey. I have only checked
> for partsupfunc but I think we need to check the parallel-safety of
> partexprs as well.

Thanks, I had already added the parallel-safety check for partexprs
when I saw this, so your patch hopefully completes all the
partition-related checks required.

> Also, I noticed that you have allowed for
> parallelism only when all expressions/functions involved with Insert
> are parallel-safe, can't we allow parallel-restricted case because
> anyway Inserts have to be performed by the leader for this patch.
>

Yes, I think you're right.
"A parallel restricted operation is one which cannot be performed in a
parallel worker, but which can be performed in the leader while
parallel query is in use."
I'll make the change and test that everything works OK.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Enumize logical replication message actions

2020-10-16 Thread Kyotaro Horiguchi
At Fri, 16 Oct 2020 08:08:40 +, Li Japin  wrote in 
> 
> > On Oct 16, 2020, at 3:25 PM, Ashutosh Bapat  
> > wrote:
> > 
> > Hi All,
> > Logical replication protocol uses single byte character to identify
> > different chunks of logical repliation messages. The code uses
> > character literals for the same. These literals are used as bare
> > constants in code as well. That's true for almost all the code that
> > deals with wire protocol. With that it becomes difficult to identify
> > the code which deals with a particular message. For example code that
> > deals with message type 'B'. In various protocol 'B' has different
> > meaning and it gets difficult and time consuming to differentiate one
> > usage from other and find all places which deal with one usage. Here's
> > a patch simplifying that for top level logical replication messages.
> > 
> > I think I have covered the places that need change. But I might have
> > missed something, given that these literals are used at several other
> > places (a problem this patch tries to fix :)).
> > 
> > Initially I had used #define for the same, but Peter E suggested using
> > Enums so that switch cases can detect any remaining items along with
> > stronger type checks.
> > 
> > Pavan offleast suggested to create a wrapper
> > pg_send_logical_rep_message() on top of pg_sendbyte(), similarly for
> > pg_getmsgbyte(). I wanted to see if this change is acceptable. If so,
> > I will change that as well. Comments/suggestions welcome.
> > 
> > -- 
> > Best Wishes,
> > Ashutosh Bapat
> > <0001-Enumize-top-level-logical-replication-actions.patch>
> 
> What about ’N’ for new tuples, ‘O’ for old tuple follows, ‘K’ for old key 
> follows?
> Those are also logical replication protocol message, I think.

They are flags stored in a message so they can be seen as different
from the message type letters.

Anyway if the values are determined after some meaning, I'm not sure
enumerize them is good thing or not.  In other words, 'U' conveys
almost same amount of information with LOGICAL_REP_MSG_UPDATE in the
context of logical replcation protocol.

We have the same code pattern in PostgresMain and perhaps we don't
going to change them into enums.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-10-16 Thread movead...@highgo.ca
Thanks for all the suggestion, and new patch attached.

>Andres suggested that we don't need that description with per-record
>basis. Do you have a reason to do that?  (For clarity, I'm not
>suggesting that you should reving it.)
I think Andres is saying if we just log it in xlog_desc() then we can not get
the result for '--stats=record' case. And the patch solve the problem.

>+ XLByteToSeg(record->EndRecPtr - 1, endSegNo, record->segcxt.ws_segsize);
>We use XLByteToPrevSeg instead for this purpose.
Thanks and follow the suggestion.

>You forgot to add a correction for short headers.
Infact, we need to consider this matter when the remain size of a wal
segment can not afford a XLogRecord struct for XLOG_SWITCH. 
It's mean that if record->ReadRecPtr is on A wal segment, then
record->EndRecPtr is on (A+2) wal segment. So we need to minus
the longpagehead size on (A+1) wal segment.
In my thought we need not to care the short header, if my understand
is wrong?

>+ if(RM_XLOG_ID == rmid && XLOG_SWITCH == info)
> 
>We need a comment for the special code path.
>We don't follow this kind of convension. Rather use "variable =
>constant".
>+ {
>+ junk_len = xlog_switch_junk_len(record);
>+ stats->count_xlog_switch++;
>+ stats->junk_size += junk_len;
> 
>junk_len is used only the last line above.  We don't need that
>temporary variable.
> 
>+ * If the wal switch record spread on two segments, we should extra minus the
>This comment is sticking out of 80-column border.  However, I'm not
>sure if we have reached a conclustion about the target console-width.
>+ info = (rj << 4) & ~XLR_INFO_MASK;
>+ switch_junk_id = "XLOG/SWITCH_JUNK";
>+ if( XLOG_SWITCH == info && stats->count_xlog_switch > 0)
> 
>This line order is strange. At least switch_junk_id is used only in
>the if-then block.
Thanks and follow the suggestions.

 
>I'm not confindent on which is better, but I feel that this is not a
>work for display side, but for the recorder side like attached.
The patch really seems more clearly, but the new 'OTHERS' may confuse
the users and we hard to handle it with '--rmgr=RMGR' option. So I have
not use this design in this patch, let's see other's opinion.

>Sorry for the confusion, but it would be a separate topic even if we
>are going to fix that..
Sorry, I remove the code, make sense we should discuss it in a
separate topic.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


fix_waldump_size_for_wal_switch_v4.patch
Description: Binary data


Re: [PATCH] Add extra statistics to explain for Nested Loop

2020-10-16 Thread Pavel Stehule
pá 16. 10. 2020 v 9:43 odesílatel  napsal:

> Hi, hackers.
> For some distributions of data in tables, different loops in nested loop
> joins can take different time and process different amounts of entries.
> It makes average statistics returned by explain analyze not very useful
> for DBA.
> To fix it, here is the patch that add printing of min and max statistics
> for time and rows across all loops in Nested Loop to EXPLAIN ANALYSE.
> Please don't hesitate to share any thoughts on this topic!
>

+1

This is great feature - sometimes it can be pretty messy current limited
format

Pavel

-- 
> Ekaterina Sokolova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company


Re: Enumize logical replication message actions

2020-10-16 Thread Li Japin

> On Oct 16, 2020, at 3:25 PM, Ashutosh Bapat  
> wrote:
> 
> Hi All,
> Logical replication protocol uses single byte character to identify
> different chunks of logical repliation messages. The code uses
> character literals for the same. These literals are used as bare
> constants in code as well. That's true for almost all the code that
> deals with wire protocol. With that it becomes difficult to identify
> the code which deals with a particular message. For example code that
> deals with message type 'B'. In various protocol 'B' has different
> meaning and it gets difficult and time consuming to differentiate one
> usage from other and find all places which deal with one usage. Here's
> a patch simplifying that for top level logical replication messages.
> 
> I think I have covered the places that need change. But I might have
> missed something, given that these literals are used at several other
> places (a problem this patch tries to fix :)).
> 
> Initially I had used #define for the same, but Peter E suggested using
> Enums so that switch cases can detect any remaining items along with
> stronger type checks.
> 
> Pavan offleast suggested to create a wrapper
> pg_send_logical_rep_message() on top of pg_sendbyte(), similarly for
> pg_getmsgbyte(). I wanted to see if this change is acceptable. If so,
> I will change that as well. Comments/suggestions welcome.
> 
> -- 
> Best Wishes,
> Ashutosh Bapat
> <0001-Enumize-top-level-logical-replication-actions.patch>

What about ’N’ for new tuples, ‘O’ for old tuple follows, ‘K’ for old key 
follows?
Those are also logical replication protocol message, I think.

--
Best regards
Japin Li



Re: shared-memory based stats collector

2020-10-16 Thread Kyotaro Horiguchi
It occurred to my mind the fact that I forgot to mention the most
significant outcome of this patch.

At Thu, 08 Oct 2020 16:03:26 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > The previous version failed to flush local database stats for certain
> > condition. That behavior caused useless retries and finally a forced
> > flush that leads to contention. I fixed that and will measure
> > performance with this version.
> 
> I (we) got some performance numbers.
> 
> - Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
> - Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.
> 
> Those showed speed of over 400,000 TPS at maximum, and no siginificant
> difference is seen between patched and unpatched at the all range of
> the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch)
> but that made no difference.
> 
> - Fetching 1 tuple from 1 table from 800 clients.
> 
> No graph for this is not attached but this test shows speed of over 42
> TPS with or without the v39 patch.

Under a heavy load and having many tables, the *reader* side takes
seconds ore more time to read the stats table.  With this patch, it
takes almost no time (maybe ms order?) for the same operation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: upcoming API changes for LLVM 12

2020-10-16 Thread Andres Freund
Hi,

On 2020-10-16 02:45:51 -0300, Alvaro Herrera wrote:
> Whee, sounds pretty good ... (am I dreaming too much if I hope
> execution starts with non-jitted and switches on the fly to jitted
> once background compilation finishes?)

There's some more work needed to get there, but yes, the basics for that
are there now. It'd perhaps be doable with threads now, but it's not
clear we want that... We probably could build it with processes too -
it'd require some memory management fun, but it's doable.


> > 2) When do we want to add LLVM 12 support? PG will soon stop compiling
> >against LLVM 12, which will be released in about 6 months. I worked
> >with Lang to make most of the breaking changes in a branch (to be
> >merged in the next few days), but it's possible that there will be a
> >few smaller changes.
> 
> hmm, how regular are LLVM releases?  I mean, what if pg14 ends up being
> released sooner than LLVM12 – would there be a problem?

Pretty unlikely - they're half yearly releases, and come out on a
somewhat regular schedule. They've moved a few weeks but not more. And
even if they did - having a few #ifdefs for LLVM 12 would be ok anyway.

Greetings,

Andres Freund




Re: Parallel Inserts in CREATE TABLE AS

2020-10-16 Thread Luc Vlaming

On 16.10.20 08:23, Bharath Rupireddy wrote:

On Fri, Oct 16, 2020 at 11:33 AM Luc Vlaming  wrote:


Really looking forward to this ending up in postgres as I think it's a
very nice improvement.

Whilst reviewing your patch I was wondering: is there a reason you did
not introduce a batch insert in the destreceiver for the CTAS? For me
this makes a huge difference in ingest speed as otherwise the inserts do
not really scale so well as lock contention start to be a big problem.
If you like I can make a patch to introduce this on top?



Thanks for your interest. You are right, we can get maximum
improvement if we have multi inserts in destreceiver for the CTAS on
the similar lines to COPY FROM command. I specified this point in my
first mail [1]. You may want to take a look at an already existing
patch [2] for multi inserts, I think there are some review comments to
be addressed in that patch. I would love to see the multi insert patch
getting revived.

[1] - 
https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Sorry had not seen that pointer in your first email.

I'll first finish some other patches I'm working on and then I'll try to 
revive that patch. Thanks for the pointers.


Kind regards,
Luc
Swarm64




Enumize logical replication message actions

2020-10-16 Thread Ashutosh Bapat
Hi All,
 Logical replication protocol uses single byte character to identify
different chunks of logical repliation messages. The code uses
character literals for the same. These literals are used as bare
constants in code as well. That's true for almost all the code that
deals with wire protocol. With that it becomes difficult to identify
the code which deals with a particular message. For example code that
deals with message type 'B'. In various protocol 'B' has different
meaning and it gets difficult and time consuming to differentiate one
usage from other and find all places which deal with one usage. Here's
a patch simplifying that for top level logical replication messages.

I think I have covered the places that need change. But I might have
missed something, given that these literals are used at several other
places (a problem this patch tries to fix :)).

Initially I had used #define for the same, but Peter E suggested using
Enums so that switch cases can detect any remaining items along with
stronger type checks.

Pavan offleast suggested to create a wrapper
pg_send_logical_rep_message() on top of pg_sendbyte(), similarly for
pg_getmsgbyte(). I wanted to see if this change is acceptable. If so,
I will change that as well. Comments/suggestions welcome.

-- 
Best Wishes,
Ashutosh Bapat
From fa2447ff73cc94b27e3641eda1a3f3d26fd72381 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Fri, 16 Oct 2020 12:31:35 +0530
Subject: [PATCH] Enumize top level logical replication actions

Logical replication protocol uses single byte character to identify
different chunks of logical repliation messages. The code uses string
literals for the same. Enumize those so that
1. All the string literals used can be found at a single place. This
makes it easy to add more actions without the risk of conflicts.
2. It's easy to locate the code handling a given action.

We could create such enums even for the string literals that
differentiate parts of a message, but I have not attempted that in this
commit.

Ashutosh Bapat
---
 src/backend/replication/logical/proto.c  | 26 ++--
 src/backend/replication/logical/worker.c | 52 
 src/include/replication/logicalproto.h   | 19 +
 3 files changed, 58 insertions(+), 39 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142b48..09bce22d96 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -44,7 +44,7 @@ static const char *logicalrep_read_namespace(StringInfo in);
 void
 logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 {
-	pq_sendbyte(out, 'B');		/* BEGIN */
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN);		/* BEGIN */
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
@@ -76,7 +76,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 {
 	uint8		flags = 0;
 
-	pq_sendbyte(out, 'C');		/* sending COMMIT */
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);		/* sending COMMIT */
 
 	/* send the flags field (unused for now) */
 	pq_sendbyte(out, flags);
@@ -112,7 +112,7 @@ void
 logicalrep_write_origin(StringInfo out, const char *origin,
 		XLogRecPtr origin_lsn)
 {
-	pq_sendbyte(out, 'O');		/* ORIGIN */
+	pq_sendbyte(out, LOGICAL_REP_MSG_ORIGIN);		/* ORIGIN */
 
 	/* fixed fields */
 	pq_sendint64(out, origin_lsn);
@@ -141,7 +141,7 @@ void
 logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
 		HeapTuple newtuple, bool binary)
 {
-	pq_sendbyte(out, 'I');		/* action INSERT */
+	pq_sendbyte(out, LOGICAL_REP_MSG_INSERT);		/* action INSERT */
 
 	/* transaction ID (if not valid, we're not streaming) */
 	if (TransactionIdIsValid(xid))
@@ -185,7 +185,7 @@ void
 logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
 		HeapTuple oldtuple, HeapTuple newtuple, bool binary)
 {
-	pq_sendbyte(out, 'U');		/* action UPDATE */
+	pq_sendbyte(out, LOGICAL_REP_MSG_UPDATE);		/* action UPDATE */
 
 	Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
@@ -263,7 +263,7 @@ logicalrep_write_delete(StringInfo out, TransactionId xid, Relation rel,
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
 		   rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
 
-	pq_sendbyte(out, 'D');		/* action DELETE */
+	pq_sendbyte(out, LOGICAL_REP_MSG_DELETE);		/* action DELETE */
 
 	/* transaction ID (if not valid, we're not streaming) */
 	if (TransactionIdIsValid(xid))
@@ -317,7 +317,7 @@ logicalrep_write_truncate(StringInfo out,
 	int			i;
 	uint8		flags = 0;
 
-	pq_sendbyte(out, 'T');		/* action TRUNCATE */
+	pq_sendbyte(out, LOGICAL_REP_MSG_TRUNCATE);		/* action TRUNCATE */
 
 	/* transaction ID (if not valid, we're not streaming) */
 	if (TransactionIdIsValid(xid))
@@ -369,7 +369,7 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel)
 {
 	char	   *relname;
 
-	pq_sendbyte(out, 'R');		

Sending logical replication data only after synchronous replication happened

2020-10-16 Thread Andrey Borodin
Hi!

On PgCon 2020 we had been discussing some caveats of synchronous replication 
[0] related to data durability in HA postgres installations.

But also there was raised important concern about streaming logical replication 
only after it "actually happened" for HA cluster.
Is anyone working on it?If no, I propose to discuss design of this feature.

Why is it important? It's important for changed data capture (CDC).
For physical replication we can apply changed forward (just replay WAL) and 
backward (with help of pg_rewind).
But there is no clean way to undo logical replication.

Consider someone having a data publication from HA cluster A to another 
postgres installation B. A consists of primary A1 and standby A2.

When failover happens from A1 to A2 some part of A1 history can be committed 
locally on A. And streamed to B via logical replication. After failover to A2 B 
cannot continue CDC from A2 because B already applied part of a history from A1 
which never existed for A2.

During unconference session [0] there was proposed GUC that is 
'post_synchronous_standby_names' of standbys that can't get data until the 
transaction has been sent to the sync standbys.
This will do the trick, though I'm not sure It's best possible interface for 
the feature.
Any ideas on the feature will be appreciated.

Thanks!


Best regards, Andrey Borodin.


[0] 
https://wiki.postgresql.org/wiki/PgCon_2020_Developer_Unconference/Edge_cases_of_synchronous_replication_in_HA_solutions



Re: Parallel Inserts in CREATE TABLE AS

2020-10-16 Thread Bharath Rupireddy
On Fri, Oct 16, 2020 at 11:33 AM Luc Vlaming  wrote:
>
> Really looking forward to this ending up in postgres as I think it's a
> very nice improvement.
>
> Whilst reviewing your patch I was wondering: is there a reason you did
> not introduce a batch insert in the destreceiver for the CTAS? For me
> this makes a huge difference in ingest speed as otherwise the inserts do
> not really scale so well as lock contention start to be a big problem.
> If you like I can make a patch to introduce this on top?
>

Thanks for your interest. You are right, we can get maximum
improvement if we have multi inserts in destreceiver for the CTAS on
the similar lines to COPY FROM command. I specified this point in my
first mail [1]. You may want to take a look at an already existing
patch [2] for multi inserts, I think there are some review comments to
be addressed in that patch. I would love to see the multi insert patch
getting revived.

[1] - 
https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel Inserts in CREATE TABLE AS

2020-10-16 Thread Luc Vlaming

On 14.10.20 11:16, Bharath Rupireddy wrote:

On Tue, Oct 6, 2020 at 10:58 AM Amit Kapila  wrote:



Yes we do a bunch of catalog changes related to the created new table.
We will have both the txn id and command id assigned when catalogue
changes are being made. But, right after the table is created in the
leader, the command id is incremented (CommandCounterIncrement() is
called from create_ctas_internal()) whereas the txn id remains the
same. The new command id is marked as GetCurrentCommandId(true); in
intorel_startup, then the parallel mode is entered. The txn id and
command id are serialized into parallel DSM, they are then available
to all parallel workers. This is discussed in [1].

Few changes I have to make in the parallel worker code: set
currentCommandIdUsed = true;, may be via a common API
SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the
extra command id sharing from the leader to workers.

I will add a few comments in the upcoming patch related to the above info.



Yes, that would be good.



Added comments.




But how does that work for SELECT INTO? Are you prohibiting
that? ...



In case of SELECT INTO, a new table gets created and I'm not
prohibiting the parallel inserts and I think we don't need to.



So, in this case, also do we ensure that table is created before we
launch the workers. If so, I think you can explain in comments about
it and what you need to do that to ensure the same.



For SELECT INTO, the table gets created by the leader in
create_ctas_internal(), then ExecInitParallelPlan() gets called which
launches the workers and then the leader(if asked to do so) and the
workers insert the rows. So, we don't need to do any extra work to
ensure the table gets created before the workers start inserting
tuples.



While skimming through the patch, a small thing I noticed:
+ /*
+ * SELECT part of the CTAS is parallelizable, so we can make
+ * each parallel worker insert the tuples that are resulted
+ * in it's execution into the target table.
+ */
+ if (!is_matview &&
+ IsA(plan->planTree, Gather))
+ ((DR_intorel *) dest)->is_parallel = true;
+

I am not sure at this stage if this is the best way to make CTAS as
parallel but if so, then probably you can expand the comments a bit to
say why you consider only Gather node (and that too when it is the
top-most node) and why not another parallel node like GatherMerge?



If somebody expects to preserve the order of the tuples that are
coming from GatherMerge node of the select part in CTAS or SELECT INTO
while inserting, now if parallelism is allowed, that may not be the
case i.e. the order of insertion of tuples may vary. I'm not quite
sure, if someone wants to use order by in the select parts of CTAS or
SELECT INTO in a real world use case. Thoughts?



Right, for now, I think you can simply remove that check from the code
instead of just commenting it. We will see if there is a better
check/Assert we can add there.



Done.

I also worked on some of the open points I listed earlier in my mail.



3. Need to restrict parallel inserts, if CTAS tries to create temp/global 
tables as the workers will not have access to those tables.



Done.



Need to analyze whether to allow parallelism if CTAS has prepared statements or 
with no data.



For prepared statements, the parallelism will not be picked and so is
parallel insertion.
For CTAS with no data option case the select part is not even planned,
and so the parallelism will also not be picked.



4. Need to stop unnecessary parallel shared state such as tuple queue being 
created and shared to workers.



Done.

I'm listing the things that are still pending.

1. How to represent the parallel insert for CTAS in explain plans? The
explain CTAS shows the plan for only the SELECT part. How about having
some textual info along with the Gather node? I'm not quite sure on
this point, any suggestions are welcome.
2. Addition of new test cases. Testing with more scenarios and
different data sets, sizes, tablespaces, select into. Analysis on the
2 mismatches in write_parallel.sql regression test.

Attaching v2 patch, thoughts and comments are welcome.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Hi,

Really looking forward to this ending up in postgres as I think it's a 
very nice improvement.


Whilst reviewing your patch I was wondering: is there a reason you did 
not introduce a batch insert in the destreceiver for the CTAS? For me 
this makes a huge difference in ingest speed as otherwise the inserts do 
not really scale so well as lock contention start to be a big problem. 
If you like I can make a patch to introduce this on top?


Kind regards,
Luc
Swarm64