Re: archive status ".ready" files may be created too early

2021-08-28 Thread Andres Freund
Hi,

On 2021-08-23 15:55:03 -0400, alvhe...@alvh.no-ip.org wrote:
> On 2021-Aug-23, Bossart, Nathan wrote:
> 
> > Ah, okay.  BTW the other changes you mentioned made sense to me.
> 
> Thanks.  I've pushed this now to all live branches.

While rebasing the aio patchset ontop of HEAD I noticed that this commit added
another atomic operation to XLogWrite() with archiving enabled. The WAL write
path is really quite hot, and at least some of the
NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.

I think we should at least try to make the fast-path where no segment
boundaries were crossed use no atomic operations.

Greetings,

Andres Freund




Re: Added schema level support for publication.

2021-08-28 Thread vignesh C
On Sat, Aug 28, 2021 at 3:19 PM Amit Kapila  wrote:
>
> On Fri, Aug 27, 2021 at 6:09 PM vignesh C  wrote:
> >
> > On Fri, Aug 27, 2021 at 4:57 PM Amit Kapila  wrote:
> > >
> > > On Fri, Aug 27, 2021 at 11:43 AM vignesh C  wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 3:07 PM vignesh C  wrote:
> > > >
> > > > I have implemented this in the 0003 patch, I have kept it separate to
> > > > reduce the testing effort and also it will be easier if someone
> > > > disagrees with the syntax. I will merge it to the main patch later
> > > > based on the feedback. Attached v22 patch has the changes for the
> > > > same.
> > > >
> > >
> > > Few comments on v22-0001-Added-schema-level-support-for-publication:
> > > 
> > > 1. Why in publication_add_schema(), you are registering invalidation
> > > for all schema relations? It seems this is to allow rebuilding the
> > > publication info for decoding sessions. But that is not required as
> > > you are registering rel_sync_cache_publication_cb for
> > > publication_schema relation. In rel_sync_cache_publication_cb, we are
> > > marking replicate_valid as false for each entry which will allow
> > > publication info to be rebuilt in get_rel_sync_entry.
> > >
> > > I see that it is done for a single relation in the current code in
> > > function publication_add_relation but I guess that is also not
> > > required. You can once test this. If you still think it is required,
> > > can you please explain me why and then we can accordingly add some
> > > comments in the patch.
> >
> > I felt this is required for handling the following concurrency scenario:
> > create schema sch1;
> > create table sch1.t1(c1 int);
> > insert into sch1.t1 values(10);
> > update sch1.t1 set c1 = 11;
> > # update will be successful and relation cache will update publication
> > actions based on the current state i.e no publication
> > create publication pub1 for all tables in schema sch1;
> > # After the above publication is created the relations present in this
> > schema should be invalidated so that the next update should fail.
> >
>
> What do you mean by update should fail? I think all the relations in
> RelationSyncCache via rel_sync_cache_publication_cb because you have
> updated pg_publication_schema and that should register syscache
> invalidation.

By update should fail, I meant the updates without setting replica
identity before creating the decoding context. The scenario is like
below (all in the same session, the subscription is not created):
create schema sch1;
create table sch1.t1(c1 int);
insert into sch1.t1 values(10);
# Before updating we will check CheckCmdReplicaIdentity, as there are
no publications on this table rd_pubactions will be set accordingly in
relcache entry.
update sch1.t1 set c1 = 11;
# Now we will create the publication after rd_pubactions has been set
in the cache. Now when we create this publication we should invalidate
the relations present in the schema, this is required so that when the
next update happens, we should check the publication actions again in
CheckCmdReplicaIdentity and fail the update which does not  set
replica identity after the publication is created.
create publication pub1 for all tables in schema sch1;
# After the above publication is created the relations present in this
schema will be invalidated. Now we will check the publication actions
again in CheckCmdReplicaIdentity and the update will fail.
update sch1.t1 set c1 = 11;
ERROR:  cannot update table "t1" because it does not have a replica
identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

The rel_sync_cache_publication_cb invalidations are registered once
the decoding context is created. Since the decoding context is not
created at this point of time we should explicitly invalidate all the
relations present in this schema by calling InvalidatePublicationRels.

> > If
> > the relations are not invalidated the updates will be successful based
> > on previous publication actions.
> > update sch1.t1 set c1 = 11;
> >
>
> I think even without special relcache invalidations the relations
> should be invalidated because of syscache invalidation as mentioned in
> the previous point. Am I missing something here?

The rel_sync_cache_publication_cb invalidations are registered once
the decoding context is created. Since the decoding context is not
created at this point of time we should explicitly invalidate all the
relations present in this schema by calling InvalidatePublicationRels.
This is to prevent updates for which the replica identity is not set,
I have mentioned more details above.

> >
> > > 5. Don't we need to expose a view for publication schemas similar to
> > > pg_publication_tables?
> >
> > pg_publication_tables is a common view for both "FOR TABLE", "FOR ALL
> > TABLES" and "FOR ALL TABLES IN SCHEMA", this view will internally
> > access pg_publication_rel and 

Re: Autovacuum on partitioned table (autoanalyze)

2021-08-28 Thread Álvaro Herrera
On 2021-Aug-17, Justin Pryzby wrote:

> I suggest the attached (which partially reverts the revert), to allow showing
> correct data for analyze_count and last_analyzed.

Yeah, that makes sense and my keeping of the pg_stat_all_tables entries
seems pretty useless without this change.  I have pushed a slightly
modified version of this to 14 and master.

> Arguably these should be reported as null in v14 for partitioned tables, since
> they're not "known to be zero", but rather "currently unpopulated".
> 
> n_mod_since_analyze | 0
> n_ins_since_vacuum  | 0

I don't disagree, but it's not easy to implement this at present.  I
think almost all counters should be nulls for partitioned tables.  For
some of them one could make a case that it'd be more convenient to
propagate numbers up from partitions.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: pg_receivewal: remove extra conn = NULL; in StreamLog

2021-08-28 Thread Daniel Gustafsson
> On 28 Aug 2021, at 14:10, Bharath Rupireddy 
>  wrote:

> It seems there's a redundant assignment statement conn = NULL in
> pg_receivewal's StreamLog function. Attaching a tiny patch herewith.
> Thoughts?

Agreed, while harmless this is superfluous since conn is already set to NULL
after the PQfinish call a few lines up (which was added in a4205fa00d526c3).
Unless there are objections I’ll apply this tomorrow or Monday.

--
Daniel Gustafsson   https://vmware.com/





Re: extended stats objects are the only thing written like "%s"."%s"

2021-08-28 Thread Tom Lane
Alvaro Herrera  writes:
> I think using "%s.%s" as is done everywhere else is pretty much
> pointless.  It's not usable as an object identifier, since you have to
> make sure to remove the existing quotes, and unless the names work
> without quotes, you have to add different quotes.  So it looks «nice»
> but it's functionally more work.

I think what we are doing there is following the message style
guideline that says to put double quotes around inserted strings.
In this case schema.object (as a whole) is the inserted string.
People often confuse this with SQL double-quoted identifiers, but it
has nothing whatsoever to do with SQL's rules.  (It's easier to make
sense of this rule in translations where the quote marks are not
ASCII double-quotes ... like your example with «nice».)

In short: Justin is right, this should not be done this way.

regards, tom lane




Re: extended stats objects are the only thing written like "%s"."%s"

2021-08-28 Thread Alvaro Herrera
On 2021-Aug-28, Justin Pryzby wrote:

> Note that check constraints and indexes have the same schema as their table, 
> so
> \d doesn't show a schema at all, and quotes the name of the object.  That
> distinction may be relevant to how stats objects ended up being quoted like
> this.

Yeah, this was the rationale for including the schema name here.

I think using "%s.%s" as is done everywhere else is pretty much
pointless.  It's not usable as an object identifier, since you have to
make sure to remove the existing quotes, and unless the names work
without quotes, you have to add different quotes.  So it looks «nice»
but it's functionally more work.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"But static content is just dynamic content that isn't moving!"
http://smylers.hates-software.com/2007/08/15/fe244d0c.html




extended stats objects are the only thing written like "%s"."%s"

2021-08-28 Thread Justin Pryzby
commit a4d75c86bf15220df22de0a92c819ecef9db3849
Author: Tomas Vondra 
Date:   Fri Mar 26 23:22:01 2021 +0100

Extended statistics on expressions

This commit added to psql/describe.c:

+   /* statistics object name (qualified 
with namespace) */
+   appendPQExpBuffer(, "\"%s\".\"%s\"",
+ 
PQgetvalue(result, i, 2),
+ 
PQgetvalue(result, i, 3));

Everywhere else the double quotes are around the whole "schema.object" rather
than both separately: "schema"."object".  The code handling servers before v14
has the same thing, since:

commit bc085205c8a425fcaa54e27c6dcd83101130439b
Author: Alvaro Herrera 
Date:   Fri May 12 14:59:23 2017 -0300

Change CREATE STATISTICS syntax

src/bin/psql/describe.c-/* statistics 
object name (qualified with namespace) */
src/bin/psql/describe.c:
appendPQExpBuffer(, "\"%s\".\"%s\" (",
src/bin/psql/describe.c-
  PQgetvalue(result, i, 2),
src/bin/psql/describe.c-
  PQgetvalue(result, i, 3));

That seems to have been first added in the patch here, but AFAIT not
specifically discussed.
https://www.postgresql.org/message-id/20170511221330.5akgbsoyx6wm4u34%40alvherre.pgsql

At the time the patch was commited, it was the only place that used
"schema"."object":
$ git show bc085205c8a425fcaa54e27c6dcd83101130439b:src/bin/psql/describe.c 
|grep '\\"\.\\"'
appendPQExpBuffer(, "\"%s\".\"%s\" 
(",

And it's still the only place, not just in describe.c, but the entire project.
$ git grep -Fc '\"%s\".\"%s\"' '*.c'
src/bin/psql/describe.c:2

I actually don't like writing it as "a.b" since it doesn't work to copy+paste
that, because that means an object called "a.b" in the default schema.
But I think for consistency it should be done the same here as everywhere else.

I noticed that Peter E recently changed amcheck in the direction of consistency:
| 4279e5bc8c pg_amcheck: Message style and structuring improvements

I propose to change extended stats objects to be shown the same as everywhere
else, with double quotes around the whole %s.%s:
$ git grep '\\"%s\.%s\\"' '*.c'  |wc -l
126

This affects 9 lines of output in regression tests.

Note that check constraints and indexes have the same schema as their table, so
\d doesn't show a schema at all, and quotes the name of the object.  That
distinction may be relevant to how stats objects ended up being quoted like
this.

-- 
Justin




Re: Use extended statistics to estimate (Var op Var) clauses

2021-08-28 Thread Mark Dilger



> On Aug 28, 2021, at 10:18 AM, Zhihong Yu  wrote:
> 
> I wonder if the queries with (a=a) or (a complexity to address.
> Has anyone seen such clause in production queries ?

You might expect clauses like WHERE FALSE to be unusual, but that phrase gets 
added quite a lot by query generators.  Somebody could add "WHERE a != a" in a 
misguided attempt to achieve the same thing.

I wouldn't be terribly surprised if query generators might generate queries 
with a variable number of tables joined together with comparisons between the 
joined tables, and in the degenerate case of only one table end up with a table 
column compared against itself.

You could argue that those people need to fix their queries/generators to not 
do this sort of thing, but the end user affected by such queries may have 
little ability to fix it.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Use extended statistics to estimate (Var op Var) clauses

2021-08-28 Thread Zhihong Yu
On Sat, Aug 28, 2021 at 9:30 AM Mark Dilger 
wrote:

>
>
> > On Aug 28, 2021, at 6:52 AM, Tomas Vondra 
> wrote:
> >
> > Part 0003 fixes handling of those clauses so that we don't treat them as
> simple, but it does that by tweaking statext_is_compatible_clause(), as
> suggested by Dean.
>
> Function examine_opclause_args() doesn't set issimple to anything in the
> IsA(rightop, Const) case, but assigns *issimplep = issimple at the bottom.
> The compiler is not complaining about using a possibly uninitialized
> variable, but if I change the "return true" on the very next line to
> "return issimple", the compiler complains quite loudly.
>
>
> Some functions define bool *issimple, others bool *issimplep and bool
> issimple.  You might want to standardize the naming.
>
> It's difficult to know what "simple" means in extended_stats.c.  There is
> no file-global comment explaining the concept, and functions like
> compare_scalars_simple don't have correlates named compare_scalars_complex
> or such, so the reader cannot infer by comparison what the difference might
> be between a "simple" case and some non-"simple" case.  The functions'
> issimple (or issimplep) argument are undocumented.
>
> There is a comment:
>
> /*
>  * statext_mcv_clauselist_selectivity
>  *  Estimate clauses using the best multi-column statistics.
>   
>  *
>  * - simple selectivity:  Computed without extended statistics, i.e. as if
> the
>  * columns/clauses were independent.
>  *
> 
>  */
>
> but it takes a while to find if you search for "issimple".
>
>
> In both scalarineqsel_wrapper() and eqsel_internal(), the call to
> matching_restriction_variables() should usually return false, since
> comparing a variable to itself is an unusual case.  The next call is to
> get_restriction_variable(), which repeats the work of examining the left
> and right variables.  So in almost all cases, after throwing away the
> results of:
>
> examine_variable(root, left, varRelid, );
> examine_variable(root, right, varRelid, );
>
> performed in matching_restriction_variables(), we'll do exactly the same
> work again (with one variable named differently) in
> get_restriction_variable():
>
> examine_variable(root, left, varRelid, vardata);
> examine_variable(root, right, varRelid, );
>
> That'd be fine if example_variable() were a cheap function, but it appears
> not to be.  Do you think you could save the results rather than recomputing
> them?  It's a little messy, since these are the only two functions out of
> about ten which follow this pattern, so you'd have to pass NULLs into
> get_restriction_variable() from the other eight callers, but it still looks
> like that would be a win.
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> Hi,
I wonder if the queries with (a=a) or (a

Re: Use extended statistics to estimate (Var op Var) clauses

2021-08-28 Thread Mark Dilger



> On Aug 28, 2021, at 6:52 AM, Tomas Vondra  
> wrote:
> 
> Part 0003 fixes handling of those clauses so that we don't treat them as 
> simple, but it does that by tweaking statext_is_compatible_clause(), as 
> suggested by Dean.

Function examine_opclause_args() doesn't set issimple to anything in the 
IsA(rightop, Const) case, but assigns *issimplep = issimple at the bottom.  The 
compiler is not complaining about using a possibly uninitialized variable, but 
if I change the "return true" on the very next line to "return issimple", the 
compiler complains quite loudly.


Some functions define bool *issimple, others bool *issimplep and bool issimple. 
 You might want to standardize the naming.

It's difficult to know what "simple" means in extended_stats.c.  There is no 
file-global comment explaining the concept, and functions like 
compare_scalars_simple don't have correlates named compare_scalars_complex or 
such, so the reader cannot infer by comparison what the difference might be 
between a "simple" case and some non-"simple" case.  The functions' issimple 
(or issimplep) argument are undocumented.

There is a comment:

/*
 * statext_mcv_clauselist_selectivity
 *  Estimate clauses using the best multi-column statistics.
  
 *
 * - simple selectivity:  Computed without extended statistics, i.e. as if the
 * columns/clauses were independent.
 *

 */

but it takes a while to find if you search for "issimple".


In both scalarineqsel_wrapper() and eqsel_internal(), the call to 
matching_restriction_variables() should usually return false, since comparing a 
variable to itself is an unusual case.  The next call is to 
get_restriction_variable(), which repeats the work of examining the left and 
right variables.  So in almost all cases, after throwing away the results of:

examine_variable(root, left, varRelid, );
examine_variable(root, right, varRelid, );

performed in matching_restriction_variables(), we'll do exactly the same work 
again (with one variable named differently) in get_restriction_variable():

examine_variable(root, left, varRelid, vardata);
examine_variable(root, right, varRelid, );

That'd be fine if example_variable() were a cheap function, but it appears not 
to be.  Do you think you could save the results rather than recomputing them?  
It's a little messy, since these are the only two functions out of about ten 
which follow this pattern, so you'd have to pass NULLs into 
get_restriction_variable() from the other eight callers, but it still looks 
like that would be a win. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Use extended statistics to estimate (Var op Var) clauses

2021-08-28 Thread Zhihong Yu
On Sat, Aug 28, 2021 at 6:53 AM Tomas Vondra 
wrote:

> Hi,
>
> The attached patch series is modified to improve estimates for these
> special clauses (Var op Var with the same var on both sides) without
> extended statistics. This is done in 0001, and it seems fairly simple
> and cheap.
>
> The 0002 part is still the same patch as on 2021/07/20. Part 0003 fixes
> handling of those clauses so that we don't treat them as simple, but it
> does that by tweaking statext_is_compatible_clause(), as suggested by
> Dean. It does work, although it's a bit more invasive than simply
> checking the shape of clause in statext_mcv_clauselist_selectivity.
>
> I do have results for the randomly generated queries, and this does
> improve the situation a lot - pretty much all the queries with (a=a) or
> (a
> That being said, I'm still not sure if this is an issue in real-world
> applications, or whether we're solving something because of synthetic
> queries generated by the randomized generator. But the checks seem
> fairly cheap, so maybe it doesn't matter too much.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
Hi,
For 0001-Improve-estimates-for-Var-op-Var-with-the-s-20210828.patch :

+ * form (variable op variable) with the save variable on both sides.

typo: save -> same

Cheers


Re: Use extended statistics to estimate (Var op Var) clauses

2021-08-28 Thread Tomas Vondra

Hi,

The attached patch series is modified to improve estimates for these 
special clauses (Var op Var with the same var on both sides) without 
extended statistics. This is done in 0001, and it seems fairly simple 
and cheap.


The 0002 part is still the same patch as on 2021/07/20. Part 0003 fixes 
handling of those clauses so that we don't treat them as simple, but it 
does that by tweaking statext_is_compatible_clause(), as suggested by 
Dean. It does work, although it's a bit more invasive than simply 
checking the shape of clause in statext_mcv_clauselist_selectivity.


I do have results for the randomly generated queries, and this does 
improve the situation a lot - pretty much all the queries with (a=a) or 
(a

That being said, I'm still not sure if this is an issue in real-world 
applications, or whether we're solving something because of synthetic 
queries generated by the randomized generator. But the checks seem 
fairly cheap, so maybe it doesn't matter too much.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 708587e495483aff4b84487103a545b6e860d0c0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 26 Aug 2021 23:01:04 +0200
Subject: [PATCH 1/3] Improve estimates for Var op Var with the same Var

When estimating (Var op Var) conditions, we can treat the case with the
same Var on both sides as a special case, and we can provide better
selectivity estimate than for the generic case.

For example for (a = a) we know it's 1.0, because all rows are expected
to match. Similarly for (a != a) , wich has selectivity 0.0. And the
same logic can be applied to inequality comparisons, like (a < a) etc.

In principle, those clauses are a bit strange and queries are unlikely
to use them. But query generators sometimes do silly things, and these
checks are quite cheap so it's likely a win.
---
 src/backend/utils/adt/selfuncs.c | 75 +++-
 1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0c8c05f6c2..0baeaa040d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -210,7 +210,8 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 		 MemoryContext outercontext,
 		 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
-
+static bool matching_restriction_variables(PlannerInfo *root, List *args,
+		   int varRelid);
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -256,6 +257,14 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate)
 		}
 	}
 
+	/*
+	 * It it's (variable = variable) with the same variable on both sides, it's
+	 * a special case and we know it does not filter anything, which means the
+	 * selectivity is 1.0.
+	 */
+	if (matching_restriction_variables(root, args, varRelid))
+		return (negate) ? 0.0 : 1.0;
+
 	/*
 	 * If expression is not variable = something or something = variable, then
 	 * punt and return a default estimate.
@@ -1408,6 +1417,20 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq)
 	Oid			consttype;
 	double		selec;
 
+	/*
+	 * Handle (variable < variable) and (variable <= variable) with the same
+	 * variable on both sides as a special case. The strict inequality should
+	 * not match anything, hence selectivity is 0.0. The other case is about
+	 * the same as equality, so selectivity is 1.0.
+	 */
+	if (matching_restriction_variables(root, args, varRelid))
+	{
+		if (iseq)
+			PG_RETURN_FLOAT8(1.0);
+		else
+			PG_RETURN_FLOAT8(0.0);
+	}
+
 	/*
 	 * If expression is not variable op something or something op variable,
 	 * then punt and return a default estimate.
@@ -4871,6 +4894,56 @@ get_restriction_variable(PlannerInfo *root, List *args, int varRelid,
 	return false;
 }
 
+
+/*
+ * matching_restriction_variable
+ *		Examine the args of a restriction clause to see if it's of the
+ *		form (variable op variable) with the save variable on both sides.
+ *
+ * Inputs:
+ *	root: the planner info
+ *	args: clause argument list
+ *	varRelid: see specs for restriction selectivity functions
+ *
+ * Returns true if the same variable is on both sides, otherwise false.
+ */
+static bool
+matching_restriction_variables(PlannerInfo *root, List *args, int varRelid)
+{
+	Node	   *left,
+			   *right;
+	VariableStatData ldata;
+	VariableStatData rdata;
+	bool		res = false;
+
+	/* Fail if not a binary opclause (probably shouldn't happen) */
+	if (list_length(args) != 2)
+		return false;
+
+	left = (Node *) linitial(args);
+	right = (Node *) lsecond(args);
+
+	/*
+	 * Examine both sides.  Note that when varRelid is nonzero, Vars of other
+	 * relations will be treated as pseudoconstants.
+	 */
+	examine_variable(root, left, varRelid, );
+	examine_variable(root, right, varRelid, );
+
+	/*
+	 * If both sides are variable, and are equal, we win.
+	 */
+	if ((ldata.rel != NULL && rdata.rel 

Re: \dP and \dX use ::regclass without "pg_catalog."

2021-08-28 Thread Álvaro Herrera
On 2021-Aug-27, Justin Pryzby wrote:

> I noticed that for \dP+ since 1c5d9270e, regclass is written without
> "pg_catalog." (Alvaro and I failed to notice it in 421a2c483, too).

Oops, will fix shortly.



-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




pg_receivewal: remove extra conn = NULL; in StreamLog

2021-08-28 Thread Bharath Rupireddy
Hi,

It seems there's a redundant assignment statement conn = NULL in
pg_receivewal's StreamLog function. Attaching a tiny patch herewith.
Thoughts?

Regards,
Bharath Rupireddy.


v1-0001-pg_receivewal-remove-extra-conn-NULL-in-StreamLog.patch
Description: Binary data


Re: pg_receivewal starting position

2021-08-28 Thread Bharath Rupireddy
On Thu, Aug 26, 2021 at 5:45 PM Ronan Dunklau  wrote:
> order to fail early if the replication slot doesn't exist, so please find
> attached v2 for that.

Thanks for the patches. Here are some comments:

1) While the intent of these patches looks good, I have following
concern with new replication command READ_REPLICATION_SLOT: what if
the pg_receivewal exits (because user issued a SIGINT or for some
reason) after flushing  the received WAL to disk, before it sends
sendFeedback to postgres server's walsender so that it doesn't get a
chance to update the restart_lsn in the replication slot via
PhysicalConfirmReceivedLocation. If the pg_receivewal is started
again, isn't it going to get the previous restart_lsn and receive the
last chunk of flushed WAL again?

2) What is the significance of READ_REPLICATION_SLOT for logical
replication slots? I read above that somebody suggested to restrict
the walsender to handle READ_REPLICATION_SLOT for physical replication
slots so that the callers will see a command failure. But I tend to
think that it is clean to have this command common for both physical
and logical replication slots and the callers can have an Assert(type
== 'physical').

3) Isn't it useful to send active, active_pid info of the replication
slot via READ_REPLICATION_SLOT? pg_receivewal can use Assert(active ==
true && active_pid == getpid()) as an assertion to ensure that it is
the sole owner of the replication slot? Also, is it good send
wal_status info

4) I think below messages should start with lower case letter and also
there are some typos:
+ pg_log_warning("Could not fetch the replication_slot \"%s\" information "
+ pg_log_warning("Server does not suport fetching the slot's position, "
something like:
+ pg_log_warning("could not fetch replication slot \"%s\" information, "
+"resuming from current server position instead", replication_slot);
+ pg_log_warning("server does not support fetching replication slot
information, "
+"resuming from current server position instead");

5) How about emitting the above messages in case of "verbose"?

6) How about an assertion like below?
+ if (stream.startpos == InvalidXLogRecPtr)
+ {
+ stream.startpos = serverpos;
+ stream.timeline = servertli;
+ }
+
+Assert(stream.startpos != InvalidXLogRecPtr)>>

7) How about we let pg_receivewal use READ_REPLICATION_SLOT as an option?

8) Just an idea, how about we store pg_receivewal's lastFlushPosition
in a file before pg_receivewal exits and compare it with the
restart_lsn that it received from the replication slot, if
lastFlushPosition == received_restart_lsn well and good, if not, then
something would have happened and we always start at the
lastFlushPosition ?

Regards,
Bharath Rupireddy.




Re: Gather performance analysis

2021-08-28 Thread Zhihong Yu
On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu  wrote:

>
>
> On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar 
> wrote:
>
>> On Tue, Aug 24, 2021 at 8:48 PM Robert Haas 
>> wrote:
>>
>>> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar 
>>> wrote:
>>> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
>>> > 1) Non-parallel (default)
>>> >  Execution Time: 31627.492 ms
>>> >
>>> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>>> >  Execution Time: 37498.672 ms
>>> >
>>> > 3) Same as above (2) but with the patch.
>>> > Execution Time: 23649.287 ms
>>>
>>> This strikes me as an amazingly good result. I guess before seeing
>>> these results, I would have said that you can't reasonably expect
>>> parallel query to win on a query like this because there isn't enough
>>> for the workers to do. It's not like they are spending time evaluating
>>> filter conditions or anything like that - they're just fetching tuples
>>> off of disk pages and sticking them into a queue. And it's unclear to
>>> me why it should be better to have a bunch of processes doing that
>>> instead of just one. I would have thought, looking at just (1) and
>>> (2), that parallelism gained nothing and communication overhead lost 6
>>> seconds.
>>>
>>> But what this suggests is that parallelism gained at least 8 seconds,
>>> and communication overhead lost at least 14 seconds. In fact...
>>>
>>
>> Right, good observation.
>>
>>
>>> > - If I apply both Experiment#1 and Experiment#2 patches together then,
>>> > we can further reduce the execution time to 20963.539 ms (with 4
>>> > workers and 4MB tuple queue size)
>>>
>>> ...this suggests that parallelism actually gained at least 10-11
>>> seconds, and the communication overhead lost at least 15-16 seconds.
>>>
>>
>> Yes
>>
>>
>>> If that's accurate, it's pretty crazy. We might need to drastically
>>> reduce the value of parallel_tuple_cost if these results hold up and
>>> this patch gets committed.
>>>
>>
>> In one of my experiments[Test1] I have noticed that even on the head the
>> force parallel plan is significantly faster compared to the non-parallel
>> plan, but with patch it is even better.  The point is now also there might
>> be some cases where the force parallel plans are faster but we are not sure
>> whether we can reduce the parallel_tuple_cost or not.  But with the patch
>> it is definitely sure that the parallel tuple queue is faster compared to
>> what we have now, So I agree we should consider reducing the
>> parallel_tuple_cost after this patch.
>>
>> Additionally, I've done some more experiments with artificial workloads,
>> as well as workloads where the parallel plan is selected by default, and in
>> all cases I've seen a significant improvement.  The gain is directly
>> proportional to the load on the tuple queue, as expected.
>>
>> Test1: (Worker returns all tuples but only few tuples returns to the
>> client)
>> 
>> INSERT INTO t SELECT i%10, repeat('a', 200) from
>> generate_series(1,2) as i;
>> set max_parallel_workers_per_gather=4;
>>
>> Target Query: SELECT random() FROM t GROUP BY a;
>>
>> Non-parallel (default plan): 77170.421 ms
>> Parallel (parallel_tuple_cost=0):  53794.324 ms
>> Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
>>
>> 20% gain compared force parallel, 45% gain compared to default plan.
>>
>> Test2: (Parallel case with default parallel_tuple_cost)
>> --
>> INSERT INTO t SELECT i, repeat('a', 200) from
>> generate_series(1,2) as i;
>>
>> set max_parallel_workers_per_gather=4;
>> SELECT * from t WHERE a < 1750;
>> Parallel(default plan): 23730.054 ms
>> Parallel with patch (default plan): 21614.251 ms
>>
>> 8 to 10 % gain compared to the default parallel plan.
>>
>> I have done cleanup in the patch and I will add this to the September
>> commitfest.
>>
>> I am planning to do further testing for identifying the optimal batch
>> size in different workloads.  WIth above workload I am seeing similar
>> results with batch size 4k to 16k (1/4 of the ring size) so in the attached
>> patch I have kept as 1/4 of the ring size.  We might change that based on
>> more analysis and testing.
>>
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com
>>
> Hi,
> Some minor comments.
> For shm_mq.c, existing comment says:
>
>  * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
>
> +   Sizemqh_send_pending;
> boolmqh_length_word_complete;
> boolmqh_counterparty_attached;
>
> I wonder if mqh_send_pending should be declared
> after mqh_length_word_complete - this way, the order of fields matches the
> order of explanation for the fields.
>
> +   if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
>
> The above can be written as:
>
> +   if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
>
> so that when force_flush is true, the other 

Re: Gather performance analysis

2021-08-28 Thread Zhihong Yu
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar  wrote:

> On Tue, Aug 24, 2021 at 8:48 PM Robert Haas  wrote:
>
>> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar  wrote:
>> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
>> > 1) Non-parallel (default)
>> >  Execution Time: 31627.492 ms
>> >
>> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
>> >  Execution Time: 37498.672 ms
>> >
>> > 3) Same as above (2) but with the patch.
>> > Execution Time: 23649.287 ms
>>
>> This strikes me as an amazingly good result. I guess before seeing
>> these results, I would have said that you can't reasonably expect
>> parallel query to win on a query like this because there isn't enough
>> for the workers to do. It's not like they are spending time evaluating
>> filter conditions or anything like that - they're just fetching tuples
>> off of disk pages and sticking them into a queue. And it's unclear to
>> me why it should be better to have a bunch of processes doing that
>> instead of just one. I would have thought, looking at just (1) and
>> (2), that parallelism gained nothing and communication overhead lost 6
>> seconds.
>>
>> But what this suggests is that parallelism gained at least 8 seconds,
>> and communication overhead lost at least 14 seconds. In fact...
>>
>
> Right, good observation.
>
>
>> > - If I apply both Experiment#1 and Experiment#2 patches together then,
>> > we can further reduce the execution time to 20963.539 ms (with 4
>> > workers and 4MB tuple queue size)
>>
>> ...this suggests that parallelism actually gained at least 10-11
>> seconds, and the communication overhead lost at least 15-16 seconds.
>>
>
> Yes
>
>
>> If that's accurate, it's pretty crazy. We might need to drastically
>> reduce the value of parallel_tuple_cost if these results hold up and
>> this patch gets committed.
>>
>
> In one of my experiments[Test1] I have noticed that even on the head the
> force parallel plan is significantly faster compared to the non-parallel
> plan, but with patch it is even better.  The point is now also there might
> be some cases where the force parallel plans are faster but we are not sure
> whether we can reduce the parallel_tuple_cost or not.  But with the patch
> it is definitely sure that the parallel tuple queue is faster compared to
> what we have now, So I agree we should consider reducing the
> parallel_tuple_cost after this patch.
>
> Additionally, I've done some more experiments with artificial workloads,
> as well as workloads where the parallel plan is selected by default, and in
> all cases I've seen a significant improvement.  The gain is directly
> proportional to the load on the tuple queue, as expected.
>
> Test1: (Worker returns all tuples but only few tuples returns to the
> client)
> 
> INSERT INTO t SELECT i%10, repeat('a', 200) from
> generate_series(1,2) as i;
> set max_parallel_workers_per_gather=4;
>
> Target Query: SELECT random() FROM t GROUP BY a;
>
> Non-parallel (default plan): 77170.421 ms
> Parallel (parallel_tuple_cost=0):  53794.324 ms
> Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
>
> 20% gain compared force parallel, 45% gain compared to default plan.
>
> Test2: (Parallel case with default parallel_tuple_cost)
> --
> INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,2)
> as i;
>
> set max_parallel_workers_per_gather=4;
> SELECT * from t WHERE a < 1750;
> Parallel(default plan): 23730.054 ms
> Parallel with patch (default plan): 21614.251 ms
>
> 8 to 10 % gain compared to the default parallel plan.
>
> I have done cleanup in the patch and I will add this to the September
> commitfest.
>
> I am planning to do further testing for identifying the optimal batch size
> in different workloads.  WIth above workload I am seeing similar results
> with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
> have kept as 1/4 of the ring size.  We might change that based on more
> analysis and testing.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
Hi,
Some minor comments.
For shm_mq.c, existing comment says:

 * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+   Sizemqh_send_pending;
boolmqh_length_word_complete;
boolmqh_counterparty_attached;

I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.

+   if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+   if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers


Re: Next Steps with Hash Indexes

2021-08-28 Thread Amit Kapila
On Fri, Aug 27, 2021 at 4:27 PM Sadhuprasad Patro  wrote:
>
> IMHO, as discussed above, since other databases also have the
> limitation that if you create a multi-column hash index then the hash
> index can not be used until all the key columns are used in the search
> condition. So my point is that users might be using the hash index
> with this limitation and their use case might be that they want to
> gain the best performance when they use this particular case and they
> might not be looking for much flexibility like we provide in BTREE.
>
> For reference:
> https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html
> https://docs.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/indexes-for-memory-optimized-tables?view=sql-server-ver15
>
> We already know that performance will be better with a single hash for
> multiple columns, but still I just wanted to check the performance
> difference in PG. This might help us to decide the approach we need to
> go for. With a quick POC of both the ideas, I have observed there is a
> major performance advantage with single combined hash for multi-key
> columns.
>
> Performance Test Details: (Used PGBENCH Tool)
> Initialize cmd: “./pgbench -i -s 100 -d postgres"
>
> postgres=# \d+ pgbench_accounts
>
>  Table "public.pgbench_accounts"
>
>   Column  | Type  | Collation | Nullable | Default | Storage
> | Compression | Stats target | Description
>
> --+---+---+--+-+--+-+--+-
>
>  aid  | integer   |   | not null | | plain
> | |  |
>  bid  | integer   |   |  | | plain
> | |  |
>  abalance | integer   |   |  | | plain
> | |  |
>  filler   | character(84) |   |  | | extended
> | |  |
>
> Indexes:
> "pgbench_accounts_idx" hash (aid, bid)
> Access method: heap
> Options: fillfactor=100
>
> Test Command: “./pgbench -j 1 postgres -C -M prepared -S -T 300”
>
> Performance Test Results:
> Idea-1: Single Hash value for multiple key columns
>  TPS = ~292
>
> Idea-2: Separate Hash values for each key column. But use only the
> first one to search the bucket. Other hash values are used as payload
> to get to the matching tuple before going to the heap.
>  TPS = ~212
>
> Note: Here we got near to 25% better performance in a single combine
> hash approach with only TWO key columns.
>

That's a significant difference. Have you checked via perf or some
other way what causes this difference? I have seen that sometimes
single client performance with pgbench is not stable, so can you
please once check with 4 clients or so and possibly with a larger
dataset as well.

One more thing to consider is that it seems that the planner requires
a condition for the first column of an index before considering an
indexscan plan. See Tom's email [1] in this regard. I think it would
be better to see what kind of work is involved there if you want to
explore a single hash value for all columns idea.

[1] - https://www.postgresql.org/message-id/29263.1506483172%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.




Re: Schema variables - new implementation for Postgres 15

2021-08-28 Thread Gilles Darold
Hi,

Review resume:


This patch implements Schema Variables that are database objects that
can hold a single or composite value following the data type used at
variable declaration. Schema variables, like relations, exist within a
schema and their access is controlled via GRANT and REVOKE commands. The
schema variable can be created by the CREATE VARIABLE command, altered
using ALTER VARIABLE and removed using DROP VARIABLE.

The value of a schema variable is local to the current session.
Retrieving a variable's value returns either a NULL or a default value,
unless its value is set to something else in the current session with a
LET command. The content of a variable is not transactional. This is the
same as in regular variables in PL languages.

Schema variables are retrieved by the SELECT SQL command. Their value is
set with the LET SQL command. While schema variables share properties
with tables, their value cannot be updated with an UPDATE command.


The patch apply with the patch command without problem and compilation
reports no warning or errors. Regression tests pass successfully using
make check or make installcheck
It also includes all documentation and regression tests.

Performances are near the set of plpgsql variable settings which is
impressive:

do $$
declare var1 int ; i int;
begin
  for i in 1..100
  loop
    var1 := i;
  end loop;
end;
$$;
DO
Time: 71,515 ms

CREATE VARIABLE var1 AS integer;
do $$
declare i int ;
begin
  for i in 1..100
  loop
    let var1 = i;
  end loop;
end;
$$;
DO
Time: 94,658 ms

There is just one thing that puzzles me.We can use :

    CREATE VARIABLE var1 AS date NOT NULL;
    postgres=# SELECT var1;
    ERROR:  null value is not allowed for NOT NULL schema variable "var1"

which I understand and is the right behavior. But if we use:

    CREATE IMMUTABLE VARIABLE var1 AS date NOT NULL;
    postgres=# SELECT var1;
    ERROR:  null value is not allowed for NOT NULL schema variable "var1"
    DETAIL:  The schema variable was not initialized yet.
    postgres=# LET var1=current_date;
    ERROR:  schema variable "var1" is declared IMMUTABLE

It should probably be better to not allow NOT NULL when IMMUTABLE is
used because the variable can not be used at all.  Also probably
IMMUTABLE without a DEFAULT value should also be restricted as it makes
no sens. If the user wants the variable to be NULL he must use DEFAULT
NULL. This is just a though, the above error messages are explicit and
the user can understand what wrong declaration he have done.

Except that I think this patch is ready for committers, so if there is
no other opinion in favor of restricting the use of IMMUTABLE with NOT
NULL and DEFAULT I will change the status to ready for committers.

-- 
Gilles Darold
http://www.darold.net/



Re: Added schema level support for publication.

2021-08-28 Thread Amit Kapila
On Fri, Aug 27, 2021 at 6:09 PM vignesh C  wrote:
>
> On Fri, Aug 27, 2021 at 4:57 PM Amit Kapila  wrote:
> >
> > On Fri, Aug 27, 2021 at 11:43 AM vignesh C  wrote:
> > >
> > > On Wed, Aug 25, 2021 at 3:07 PM vignesh C  wrote:
> > >
> > > I have implemented this in the 0003 patch, I have kept it separate to
> > > reduce the testing effort and also it will be easier if someone
> > > disagrees with the syntax. I will merge it to the main patch later
> > > based on the feedback. Attached v22 patch has the changes for the
> > > same.
> > >
> >
> > Few comments on v22-0001-Added-schema-level-support-for-publication:
> > 
> > 1. Why in publication_add_schema(), you are registering invalidation
> > for all schema relations? It seems this is to allow rebuilding the
> > publication info for decoding sessions. But that is not required as
> > you are registering rel_sync_cache_publication_cb for
> > publication_schema relation. In rel_sync_cache_publication_cb, we are
> > marking replicate_valid as false for each entry which will allow
> > publication info to be rebuilt in get_rel_sync_entry.
> >
> > I see that it is done for a single relation in the current code in
> > function publication_add_relation but I guess that is also not
> > required. You can once test this. If you still think it is required,
> > can you please explain me why and then we can accordingly add some
> > comments in the patch.
>
> I felt this is required for handling the following concurrency scenario:
> create schema sch1;
> create table sch1.t1(c1 int);
> insert into sch1.t1 values(10);
> update sch1.t1 set c1 = 11;
> # update will be successful and relation cache will update publication
> actions based on the current state i.e no publication
> create publication pub1 for all tables in schema sch1;
> # After the above publication is created the relations present in this
> schema should be invalidated so that the next update should fail.
>

What do you mean by update should fail? I think all the relations in
RelationSyncCache via rel_sync_cache_publication_cb because you have
updated pg_publication_schema and that should register syscache
invalidation.

> If
> the relations are not invalidated the updates will be successful based
> on previous publication actions.
> update sch1.t1 set c1 = 11;
>

I think even without special relcache invalidations the relations
should be invalidated because of syscache invalidation as mentioned in
the previous point. Am I missing something here?

>
> > 5. Don't we need to expose a view for publication schemas similar to
> > pg_publication_tables?
>
> pg_publication_tables is a common view for both "FOR TABLE", "FOR ALL
> TABLES" and "FOR ALL TABLES IN SCHEMA", this view will internally
> access pg_publication_rel and pg_publication_schema to get the
> corresponding tables. I felt we don't need a separate view for
> publication schemas. Thoughts?
>

Don't you think some users might want to know all the schema names for
a publication? I am not completely sure on this point but I think it
is good to have information for users. It might be also useful to have
pg_publication_objects where we can display object types (like table,
schema, sequence, etc) and then object names. If you are not convinced
then we can wait and see what others think about this.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] pgbench: add multiconnect option

2021-08-28 Thread Fabien COELHO


Hello David,


round-robin and random make sense.  I am wondering how round-robin
would work with -C, though?  Would you just reuse the same connection
string as the one chosen at the starting point.


Well, not necessarily, but this is debatable.


My expectation for such a behavior would be that it would reconnect to
a random connstring each time, otherwise what's the point of using
this with -C?  If we needed to forbid some option combinations that is
also an option.


Yep. ISTM that it should follow the connection policy/strategy, what ever 
it is.



I was thinking of providing a allowing a list of conninfo strings with
repeated options, eg --conninfo "foo" --conninfo "bla"…


That was my first thought when reading the subject of this thread:
create a list of connection strings and pass one of them to
doConnect() to grab the properties looked for.  That's a bit confusing
though as pgbench does not support directly connection strings,


They are supported because libpq silently assumes that "dbname" can be a
full connection string.


and we should be careful to keep fallback_application_name intact.


Hmmm. See attached patch, ISTM that it does the right thing.


I guess the multiple --conninfo approach is fine; I personally liked
having the list come from a file, as you could benchmark different
groups/clusters based on a file, much easier than constructing
multiple pgbench invocations depending.  I can see an argument for
both approaches.  The PGSERVICEFILE was an idea I'd had to store
easily indexed groups of connection information in a way that I didn't
need to know all the details, could easily parse, and could later pass
in the ENV so libpq could just pull out the information.


The attached version does work with the service file if the user provides 
"service=whatever" on the command line. The main difference is that it 
sticks to the libpq policy to use an explicit connection string or list of 
connection strings.


Also, note that the patch I sent dropped the --conninfo option. 
Connections are simply tghe last arguments to pgbench.



I'll see if I can take a look at your latest patch.


Thanks!

I was also wondering about how we should handle `pgbench -i` with 
multiple connection strings; currently it would only initialize with the 
first DSN it gets, but it probably makes sense to run initialize against 
all of the databases (or at least attempt to).


I'll tend to disagree on this one. Pgbench whole expectation is to run 
against "one" system, which might be composed of several nodes because of 
replications. I do not think that it is desirable to jump to "serveral 
fully independent databases".


Maybe this is one argument for the multiple --conninfo handling, since 
you could explicitly pass the databases you want.  (Not that it is hard 
to just loop over connection info and `pgbench -i` with ENV, or any 
other number of ways to accomplish the same thing.)


Yep.

--
Fabien.

Summary of GSoC 2021

2021-08-28 Thread Trafalgar Ricardo Lu
Hi Everyone,

I am Jianhui Lu, a student participating in GSoC 2021, and my project is
'add monitoring to pg_stat_statements to pg_systat'. And following is a
summary of the work I have done during the past 10 weeks.

The first part is about adding new features to pg_systat. The first and
most important feature is adding monitoring of pg_stat_statement. It
enables pg_systat to show statistics about query execution. The second
feature is adding monitoring of pg_stat_progress_copy. It's a new feature
in pg14. And the third feature is monitoring of pg_buffercache which tracks
the data in the shared buffer cache.

The second part is about compatibility. Since pg_stat_progress_copy is a
new feature in pg14, we won't show this view when we connect to an older
version. And pg_stat_statements added new columns in pg13 and changed some
column names, the new columns also won't show in the older version.

The third part of my work is about documentation. I made an asciinema video
to show how to use pg_systat. And in the right pane, I have shown the
corresponding table in postgreSQL. So users will know the relationship
between pg_systat and in the database. Last but not least, I rewrote the
readme using rst and added more information including basic introduction
and homepage.

Here are links to my commit [1] and asciinema video[2].

It's really a wonderful experience to work with the community!

Best Regards,

Lu

[1] https://gitlab.com/trafalgar_lu/pg_systat/-/commits/main/

[2] https://asciinema.org/a/427202


Re: Gather performance analysis

2021-08-28 Thread Dilip Kumar
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas  wrote:

> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar  wrote:
> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
> > 1) Non-parallel (default)
> >  Execution Time: 31627.492 ms
> >
> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
> >  Execution Time: 37498.672 ms
> >
> > 3) Same as above (2) but with the patch.
> > Execution Time: 23649.287 ms
>
> This strikes me as an amazingly good result. I guess before seeing
> these results, I would have said that you can't reasonably expect
> parallel query to win on a query like this because there isn't enough
> for the workers to do. It's not like they are spending time evaluating
> filter conditions or anything like that - they're just fetching tuples
> off of disk pages and sticking them into a queue. And it's unclear to
> me why it should be better to have a bunch of processes doing that
> instead of just one. I would have thought, looking at just (1) and
> (2), that parallelism gained nothing and communication overhead lost 6
> seconds.
>
> But what this suggests is that parallelism gained at least 8 seconds,
> and communication overhead lost at least 14 seconds. In fact...
>

Right, good observation.


> > - If I apply both Experiment#1 and Experiment#2 patches together then,
> > we can further reduce the execution time to 20963.539 ms (with 4
> > workers and 4MB tuple queue size)
>
> ...this suggests that parallelism actually gained at least 10-11
> seconds, and the communication overhead lost at least 15-16 seconds.
>

Yes


> If that's accurate, it's pretty crazy. We might need to drastically
> reduce the value of parallel_tuple_cost if these results hold up and
> this patch gets committed.
>

In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better.  The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not.  But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as
well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement.  The gain is directly
proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)

INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,2) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0):  53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
--
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,2)
as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 1750;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September
commitfest.

I am planning to do further testing for identifying the optimal batch size
in different workloads.  WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size.  We might change that based on more
analysis and testing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 1142c46eaa3dcadc4bb28133ce873476ba3067c4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH v1] Optimize parallel tuple send (shm_mq_send_bytes)

Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size.  This
will prevent frequent CPU cache misses, and it will also avoid
frequent SetLatch() calls, which are quite expensive.
---
 src/backend/executor/tqueue.c |  2 +-
 src/backend/libpq/pqmq.c  |  7 +++-
 src/backend/storage/ipc/shm_mq.c  | 65 +--
 src/include/storage/shm_mq.h  |  8 +++--
 src/test/modules/test_shm_mq/test.c   |  7 ++--
 src/test/modules/test_shm_mq/worker.c |  2 +-
 6 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 7af9fbe..eb0cbd7 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -60,7 +60,7 @@