Planner chose a much slower plan in hashjoin, using a large table as the inner table.

2019-11-17 Thread Jinbao Chen
Hi Hackers,

The planner will use big table as inner table in hash join if small table
have fewer unique values.
But this plan is much slower than using small table as inner table. This
problem occurs on master
branch without parallel scan.

For example

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000);
insert into t_big select i%10 from generate_series(1, 1)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

and the plan made by planner is
demo2=# explain select * from t_small, t_big where a = b;
  QUERY PLAN
---
 Hash Join  (cost=3083104.72..3508073.65 rows=3045990 width=8)
   Hash Cond: (t_small.a = t_big.b)
   ->  Seq Scan on t_small  (cost=0.00..44.01 rows=3001 width=4)
   ->  Hash  (cost=1442478.32..1442478.32 rows=10032 width=4)
 ->  Seq Scan on t_big  (cost=0.00..1442478.32 rows=10032
width=4)

and it runs nearly 58s
demo2=# select * from t_small, t_big where a = b;
Time: 58544.525 ms (00:58.545)

But if we do some hack and use the small table as inner. It runs 19s.
demo2=# explain select * from t_small, t_big where a = b;
   QUERY PLAN
-
 Hash Join  (cost=81.52..1723019.82 rows=3045990 width=8)
   Hash Cond: (t_big.b = t_small.a)
   ->  Seq Scan on t_big  (cost=0.00..1442478.32 rows=10032 width=4)
   ->  Hash  (cost=44.01..44.01 rows=3001 width=4)
 ->  Seq Scan on t_small  (cost=0.00..44.01 rows=3001 width=4)

demo2=# select * from t_small, t_big where a = b;
Time: 18751.588 ms (00:18.752)


RCA:

The cost of the inner table mainly comes from creating a hash table.
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
* inner_path_rows;

The cost of the outer table mainly comes from search the hash table.
Calculate the hash value:
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

Traverse the linked list in the bucket and compare:
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

In general, the cost of creating a hash table is higher than the cost of
querying a hash table.
So we tend to use small tables as internal tables. But if the average chain
length of the bucket
is large, the situation is just the opposite.

In the test case above, the small table has 3000 tuples and 100 distinct
values on column ‘a’.
If we use small table as inner table.  The chan length of the bucket is 30.
And we need to
search the whole chain on probing the hash table. So the cost of probing is
bigger than build
hash table, and we need to use big table as inner.

But in fact this is not true. We initialized 620,000 buckets in hashtable.
But only 100 buckets
has chains with length 30. Other buckets are empty. Only hash values need
to be compared.
Its costs are very small. We have 100,000 distinct key and 100,000,000
tuple on outer table.
Only (100/10)* tuple_num tuples will search the whole chain. The other
tuples
(number = (98900/10)*tuple_num*) in outer
table just compare with the hash value. So the actual cost is much smaller
than the planner
calculated. This is the reason why using a small table as inner is faster.


Re: Conflict handling for COPY FROM

2019-11-17 Thread Surafel Temesgen
On Fri, Nov 15, 2019 at 6:24 PM Alexey Kondratov 
wrote:

> On 11.11.2019 16:00, Surafel Temesgen wrote:
> >
> >
> > Next, you use DestRemoteSimple for returning conflicting tuples back:
> >
> > +dest = CreateDestReceiver(DestRemoteSimple);
> > +dest->rStartup(dest, (int) CMD_SELECT, tupDesc);
> >
> > However, printsimple supports very limited subset of built-in
> > types, so
> >
> > CREATE TABLE large_test (id integer primary key, num1 bigint, num2
> > double precision);
> > COPY large_test FROM '/path/to/copy-test.tsv';
> > COPY large_test FROM '/path/to/copy-test.tsv' ERROR 3;
> >
> > fails with following error 'ERROR:  unsupported type OID: 701', which
> > seems to be very confusing from the end user perspective. I've
> > tried to
> > switch to DestRemote, but couldn't figure it out quickly.
> >
> >
> > fixed
>
> Thanks, now it works with my tests.
>
> 1) Maybe it is fine, but now I do not like this part:
>
> +portal = GetPortalByName("");
> +dest = CreateDestReceiver(DestRemote);
> +SetRemoteDestReceiverParams(dest, portal);
> +dest->rStartup(dest, (int) CMD_SELECT, tupDesc);
>
> Here you implicitly use the fact that portal with a blank name is always
> created in exec_simple_query before we get to this point. Next, you
> create new DestReceiver and set it to this portal, but it is also
> already created and set in the exec_simple_query.
>
> Would it be better if you just explicitly pass ready DestReceiver to
> DoCopy (similarly to how it is done for T_ExecuteStmt / ExecuteQuery),
>
>
Good idea .Thank you


>
> 2) My second concern is that you use three internal flags to track
> errors limit:
>
> +interror_limit;/* total number of error to ignore */
> +boolignore_error;/* is ignore error specified? */
> +boolignore_all_error;/* is error_limit -1 (ignore all
> error)
> + * specified? */
>
> Though it seems that we can just leave error_limit as a user-defined
> constant and track errors with something like errors_count. In that case
> you do not need auxiliary ignore_all_error flag. But probably it is a
> matter of personal choice.
>
>
using bool flags will save as from using integer type as a boolean and hold
the fact
error limit was specified even if it became zero and it seems to me it is
straightforward
to treat ignore_all_error separately.
Attache is the patch that use already created DestReceiver

regards
Surafel
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index d9b7c4d0d4..ffcfe1e8d3 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { table_name [ ( column_name [, ...] )
 FORCE_NULL ( column_name [, ...] )
 ENCODING 'encoding_name'
+ERROR_LIMIT 'limit_number'
 
  
 
@@ -355,6 +356,23 @@ COPY { table_name [ ( 

 
+   
+ERROR_LIMIT
+
+ 
+  Specifies to return error record up to limit_number number.
+  specifying it to -1 returns all error record.
+ 
+
+ 
+  Currently, only unique or exclusion constraint violation
+  and same record formatting error is ignored.
+ 
+
+
+   
+

 WHERE
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e17d8c760f..c911b3d0c2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -24,6 +24,7 @@
 #include "access/tableam.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/printtup.h"
 #include "catalog/dependency.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -48,7 +49,9 @@
 #include "port/pg_bswap.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/fd.h"
+#include "storage/lmgr.h"
 #include "tcop/tcopprot.h"
+#include "tcop/pquery.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -154,6 +157,7 @@ typedef struct CopyStateData
 	List	   *convert_select; /* list of column names (can be NIL) */
 	bool	   *convert_select_flags;	/* per-column CSV/TEXT CS flags */
 	Node	   *whereClause;	/* WHERE condition (or NULL) */
+	int			error_limit;	/* total number of error to ignore */
 
 	/* these are just for error messages, see CopyFromErrorCallback */
 	const char *cur_relname;	/* table name for error messages */
@@ -183,6 +187,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 	ExprState  *qualexpr;
+	bool		ignore_error;	/* is ignore error specified? */
+	bool		ignore_all_error;	/* is error_limit -1 (ignore all error)
+	 * specified? */
 
 	TransitionCaptureState *transition_capture;
 
@@ -837,7 +844,7 @@ CopyLoadRawBuf(CopyState cstate)
 void
 DoCopy(ParseState *pstate, const CopyStmt *stmt,
 	   int stmt_location, int stmt_len,
-	   uint64 *processed)
+	   uint64 *processed, DestReceiver *dest)
 {
 	CopyState	cstate;
 	bool		is_from = stmt->is_from;

Re: cost based vacuum (parallel)

2019-11-17 Thread Masahiko Sawada
On Fri, 15 Nov 2019 at 11:54, Amit Kapila  wrote:
>
> On Wed, Nov 13, 2019 at 10:02 AM Masahiko Sawada
>  wrote:
> >
> > I've done some tests while changing shared buffer size, delays and
> > number of workers. The overall results has the similar tendency as the
> > result shared by Dilip and looks reasonable to me.
> >
>
> Thanks, Sawada-san for repeating the tests.  I can see from yours,
> Dilip and Mahendra's testing that the delay is distributed depending
> upon the I/O done by a particular worker and the total I/O is also as
> expected in various kinds of scenarios.  So, I think this is a better
> approach.  Do you agree or you think we should still investigate more
> on another approach as well?
>
> I would like to summarize this approach.  The basic idea for parallel
> vacuum is to allow the parallel workers and master backend to have a
> shared view of vacuum cost related parameters (mainly
> VacuumCostBalance) and allow each worker to update it and then based
> on that decide whether it needs to sleep.  With this basic idea, we
> found that in some cases the throttling is not accurate as explained
> with an example in my email above [1] and then the tests performed by
> Dilip and others in the following emails (In short, the workers doing
> more I/O can be throttled less).  Then as discussed in an email later
> [2], we tried a way to avoid letting the workers sleep which has done
> less or no I/O as compared to other workers.  This ensured that
> workers who are doing more I/O got throttled more.  The idea is to
> allow any worker to sleep only if it has performed the I/O above a
> certain threshold and the overall balance is more than the cost_limit
> set by the system.  Then we will allow the worker to sleep
> proportional to the work done by it and reduce the
> VacuumSharedCostBalance by the amount which is consumed by the current
> worker.  This scheme leads to the desired throttling by different
> workers based on the work done by the individual worker.
>
> We have tested this idea with various kinds of workloads like by
> varying shared buffer size, delays and number of workers.  Then also,
> we have tried with a different number of indexes and workers.  In all
> the tests, we found that the workers are throttled proportional to the
> I/O being done by a particular worker.

Thank you for summarizing!

I agreed to this approach.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: dropdb --force

2019-11-17 Thread Pavel Stehule
po 18. 11. 2019 v 7:37 odesílatel Amit Kapila 
napsal:

> On Mon, Nov 18, 2019 at 10:59 AM Pavel Stehule 
> wrote:
> > po 18. 11. 2019 v 6:24 odesílatel Amit Kapila 
> napsal:
> >>
> >> On Mon, Nov 18, 2019 at 10:33 AM Pavel Stehule 
> wrote:
> >> >
> >> > po 18. 11. 2019 v 4:43 odesílatel vignesh C 
> napsal:
> >> >>
> >> >>
> >> >> I had seen that isolation test(src/test/isolation) has a framework to
> >> >> support this. You can have a look to see if it can be handled using
> >> >> that.
> >> >
> >> >
> >> > I'll look there
> >> >
> >>
> >> If we want to have a test for this, then you might want to look at
> >> test src/test/recovery/t/013_crash_restart.  In that test, we keep a
> >> connection open and then validate whether it is terminated.  Having
> >> said that, I think it might be better to add this as a separate test
> >> patch apart from main patch because it is a bit of a timing-dependent
> >> test and might fail on some slow machines.  We can always revert this
> >> if it turns out to be an unstable test.
> >
> >
> > +1
> >
>
> So, are you planning to give it a try?  I think even if we want to
> commit this separately, it is better to have a test for this before we
> commit.
>

I'll send this test today

Pavel


>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: [HACKERS] Block level parallel vacuum

2019-11-17 Thread Masahiko Sawada
On Mon, 18 Nov 2019 at 15:34, Amit Kapila  wrote:
>
> On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, 13 Nov 2019 at 14:31, Amit Kapila  wrote:
> > >
> > >
> > > Based on these needs, we came up with a way to allow users to specify
> > > this information for IndexAm's. Basically, Indexam will expose a
> > > variable amparallelvacuumoptions which can have below options
> > >
> > > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete nor
> > > vacuumcleanup) can't be performed in parallel
> >
> > I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who don't
> > want to support parallel vacuum don't have to set anything.
> >
>
> make sense.
>
> > > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done in
> > > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set this
> > > flag)
> > > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> > > done in parallel if bulkdelete is not performed (Indexes nbtree, brin,
> > > gin, gist,
> > > spgist, bloom will set this flag)
> > > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be done in
> > > parallel even if bulkdelete is already performed (Indexes gin, brin,
> > > and bloom will set this flag)
> >
> > I think gin and bloom don't need to set both but should set only
> > VACUUM_OPTION_PARALLEL_CLEANUP.
> >
> > And I'm going to disallow index AMs to set both
> > VACUUM_OPTION_PARALLEL_COND_CLEANUP and VACUUM_OPTION_PARALLEL_CLEANUP
> > by assertions, is that okay?
> >
>
> Sounds reasonable to me.
>
> Are you planning to include the changes related to I/O throttling
> based on the discussion in the nearby thread [1]?  I think you can do
> that if you agree with the conclusion in the last email[1], otherwise,
> we can explore it separately.

Yes I agreed. I'm going to include that changes in the next version
patches. And I think we will be able to do more discussion based on
the patch.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: dropdb --force

2019-11-17 Thread Amit Kapila
On Mon, Nov 18, 2019 at 10:59 AM Pavel Stehule  wrote:
> po 18. 11. 2019 v 6:24 odesílatel Amit Kapila  
> napsal:
>>
>> On Mon, Nov 18, 2019 at 10:33 AM Pavel Stehule  
>> wrote:
>> >
>> > po 18. 11. 2019 v 4:43 odesílatel vignesh C  napsal:
>> >>
>> >>
>> >> I had seen that isolation test(src/test/isolation) has a framework to
>> >> support this. You can have a look to see if it can be handled using
>> >> that.
>> >
>> >
>> > I'll look there
>> >
>>
>> If we want to have a test for this, then you might want to look at
>> test src/test/recovery/t/013_crash_restart.  In that test, we keep a
>> connection open and then validate whether it is terminated.  Having
>> said that, I think it might be better to add this as a separate test
>> patch apart from main patch because it is a bit of a timing-dependent
>> test and might fail on some slow machines.  We can always revert this
>> if it turns out to be an unstable test.
>
>
> +1
>

So, are you planning to give it a try?  I think even if we want to
commit this separately, it is better to have a test for this before we
commit.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Block level parallel vacuum

2019-11-17 Thread Amit Kapila
On Mon, Nov 18, 2019 at 11:37 AM Masahiko Sawada
 wrote:
>
> On Wed, 13 Nov 2019 at 14:31, Amit Kapila  wrote:
> >
> >
> > Based on these needs, we came up with a way to allow users to specify
> > this information for IndexAm's. Basically, Indexam will expose a
> > variable amparallelvacuumoptions which can have below options
> >
> > VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete nor
> > vacuumcleanup) can't be performed in parallel
>
> I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who don't
> want to support parallel vacuum don't have to set anything.
>

make sense.

> > VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done in
> > parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set this
> > flag)
> > VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> > done in parallel if bulkdelete is not performed (Indexes nbtree, brin,
> > gin, gist,
> > spgist, bloom will set this flag)
> > VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be done in
> > parallel even if bulkdelete is already performed (Indexes gin, brin,
> > and bloom will set this flag)
>
> I think gin and bloom don't need to set both but should set only
> VACUUM_OPTION_PARALLEL_CLEANUP.
>
> And I'm going to disallow index AMs to set both
> VACUUM_OPTION_PARALLEL_COND_CLEANUP and VACUUM_OPTION_PARALLEL_CLEANUP
> by assertions, is that okay?
>

Sounds reasonable to me.

Are you planning to include the changes related to I/O throttling
based on the discussion in the nearby thread [1]?  I think you can do
that if you agree with the conclusion in the last email[1], otherwise,
we can explore it separately.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BuDgLwfnAhQWGpAe66D85PdkeBygZGVyX96%2BovN1PbOg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: SimpleLruTruncate() mutual exclusion

2019-11-17 Thread Noah Misch
On Sun, Nov 17, 2019 at 12:56:52PM +0100, Dmitry Dolgov wrote:
> > On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote:
> > > Also, if I understand the data-loss hazard properly, it's what you
> > > said in the other thread: the latest_page_number could advance after
> > > we make our decision about what to truncate, and then maybe we could
> > > truncate newly-added data.  We surely don't want to lock out the
> > > operations that can advance last_page_number, so how does serializing
> > > vac_truncate_clog help?
> > >
> > > I wonder whether what we need to do is add some state in shared
> > > memory saying "it is only safe to create pages before here", and
> > > make SimpleLruZeroPage or its callers check that.  The truncation
> > > logic would advance that value, but only after completing a scan.
> > > (Oh ..., hmm, maybe the point of serializing truncations is to
> > > ensure that we know that we can safely advance that?)
> >
> > vac_truncate_clog() already records "it is only safe to create pages before
> > here" in ShmemVariableCache->xidWrapLimit, which it updates after any 
> > unlinks.
> > The trouble comes when two vac_truncate_clog() run in parallel and you get a
> > sequence of events like this:
> >
> > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to 
> > unlink
> > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to 
> > unlink
> > vac_truncate_clog() instance 1 unlinks segment ABCD
> > vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> > vac_truncate_clog() instance 1 finishes
> > some backend calls SimpleLruZeroPage(), creating segment ABCD
> > vac_truncate_clog() instance 2 unlinks segment ABCD
> >
> > Serializing vac_truncate_clog() fixes that.
> 
> I'm probably missing something, so just wanted to clarify. Do I
> understand correctly, that thread [1] and this one are independent, and
> it is assumed in the scenario above that we're at the end of XID space,
> but not necessarily having rounding issues? I'm a bit confused, since
> the reproduce script does something about cutoffPage, and I'm not sure
> if it's important or not.

I think the repro recipe contained an early fix for the other thread's bug.
While they're independent in principle, I likely never reproduced this bug
without having a fix in place for the other thread's bug.  The bug in the
other thread was easier to hit.

> > > Can you post whatever script you used to detect/reproduce the problem
> > > in the first place?
> >
> > I've attached it, but it's pretty hackish.  Apply the patch on commit 
> > 7170268,
> > make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
> > directory, and run "make trunc-check".  This incorporates xid-burn
> > acceleration code that Jeff Janes shared with -hackers at some point.
> 
> I've tried to use it to understand the problem better, but somehow
> couldn't reproduce via suggested script. I've applied it to 7170268
> (tried also apply rebased to the master with the same results) with the
> conf-test-pg in place, but after going through all steps there are no
> errors like:

That is unfortunate.

> Is there anything extra one needs to do to reproduce the problem, maybe
> adjust delays or something?

It wouldn't surprise me.  I did all my testing on one or two systems; the
hard-coded delays suffice there, but I'm sure there exist systems needing
different delays.

Though I did reproduce this bug, I'm motivated by the abstract problem more
than any particular way to reproduce it.  Commit 996d273 inspired me; by
removing a GetCurrentTransactionId(), it allowed the global xmin to advance at
times it previously could not.  That subtly changed the concurrency
possibilities.  I think safe, parallel SimpleLruTruncate() is difficult to
maintain and helps too rarely to justify such maintenance.  That's why I
propose eliminating the concurrency.

> [1]: 
> https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com




Re: [HACKERS] Block level parallel vacuum

2019-11-17 Thread Masahiko Sawada
On Wed, 13 Nov 2019 at 14:31, Amit Kapila  wrote:
>
> On Wed, Nov 13, 2019 at 9:48 AM Amit Kapila  wrote:
> >
> > Yeah, 0,2,3 and 4 sounds reasonable to me.  Earlier, Dilip also got
> > confused with option 1.
> >
>
> Let me try to summarize the discussion on this point and see if others
> have any opinion on this matter.

Thank you for summarizing.

>
> We need a way to allow IndexAm to specify whether it can participate
> in a parallel vacuum.  As we know there are two phases of
> index-vacuum, bulkdelete and vacuumcleanup and in many cases, the
> bulkdelete performs the main deletion work and then vacuumcleanup just
> returns index statistics. So, for such cases, we don't want the second
> phase to be performed by a parallel vacuum worker.  Now, if the
> bulkdelete phase is not performed, then vacuumcleanup can process the
> entire index in which case it is better to do that phase via parallel
> worker.
>
> OTOH, in some cases vacuumcleanup takes another pass over-index to
> reclaim empty pages and update record the same in FSM even if
> bulkdelete is performed.  This happens in gin and bloom indexes.
> Then, we have an index where we do all the work in cleanup phase like
> in the case of brin indexes.  Now, for this category of indexes, we
> want vacuumcleanup phase to be also performed by a parallel worker.
>
> In short different indexes have different requirements for which phase
> of index vacuum can be performed in parallel.  Just to be clear, we
> can't perform both the phases (bulkdelete and cleanup) in one-go as
> bulk-delete can happen multiple times on a large index whereas
> vacuumcleanup is done once at the end.
>
> Based on these needs, we came up with a way to allow users to specify
> this information for IndexAm's. Basically, Indexam will expose a
> variable amparallelvacuumoptions which can have below options
>
> VACUUM_OPTION_NO_PARALLEL   1 << 0 # vacuum (neither bulkdelete nor
> vacuumcleanup) can't be performed in parallel

I think VACUUM_OPTION_NO_PARALLEL can be 0 so that index AMs who don't
want to support parallel vacuum don't have to set anything.

> VACUUM_OPTION_PARALLEL_BULKDEL   1 << 1 # bulkdelete can be done in
> parallel (Indexes nbtree, hash, gin, gist, spgist, bloom will set this
> flag)
> VACUUM_OPTION_PARALLEL_COND_CLEANUP  1 << 2 # vacuumcleanup can be
> done in parallel if bulkdelete is not performed (Indexes nbtree, brin,
> gin, gist,
> spgist, bloom will set this flag)
> VACUUM_OPTION_PARALLEL_CLEANUP  1 << 3 # vacuumcleanup can be done in
> parallel even if bulkdelete is already performed (Indexes gin, brin,
> and bloom will set this flag)

I think gin and bloom don't need to set both but should set only
VACUUM_OPTION_PARALLEL_CLEANUP.

And I'm going to disallow index AMs to set both
VACUUM_OPTION_PARALLEL_COND_CLEANUP and VACUUM_OPTION_PARALLEL_CLEANUP
by assertions, is that okay?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: dropdb --force

2019-11-17 Thread Pavel Stehule
po 18. 11. 2019 v 6:24 odesílatel Amit Kapila 
napsal:

> On Mon, Nov 18, 2019 at 10:33 AM Pavel Stehule 
> wrote:
> >
> > po 18. 11. 2019 v 4:43 odesílatel vignesh C 
> napsal:
> >>
> >>
> >> When we don't specify -e option, the query used to drop db will not be
> >> printed like below:
> >> ./dropdb testdb1
> >> When we specify -e option, the query used to drop db will be printed
> like below:
> >> ./dropdb -e testdb2
> >> SELECT pg_catalog.set_config('search_path', '', false);
> >> DROP DATABASE testdb2;
> >> If we specify -e option, the query that is being used to drop db will
> >> be printed. In the existing test I could not see the inclusion of -e
> >> option. I was thinking to add a test including -e that way the query
> >> that includes force option gets validated.
> >
> >
> > still I don't understand. The created query is tested already by current
> test.
> >
> > Do you want to test just -e option?
> >
>
> Yeah, it seems Vignesh wants to do that.  It will help in verifying
> that the command generated by code is correct.  However, I think there
> is no pressing need to have an additional test for this.
>
> > Then it should be done as separate issue.
> >
>
> Yeah, I agree.  I think this can be done as a separate test patch to
> improve coverage if someone wants.
>
> >>
> >> >>
> >> >> Also should we include one test where one session is connected to db
> >> >> and another session tries dropping with -f option?
> >> >
> >> >
> >> > I afraid so test API doesn't allow asynchronous operations. Do you
> have any idea, how to it?
> >> >
> >>
> >> I had seen that isolation test(src/test/isolation) has a framework to
> >> support this. You can have a look to see if it can be handled using
> >> that.
> >
> >
> > I'll look there
> >
>
> If we want to have a test for this, then you might want to look at
> test src/test/recovery/t/013_crash_restart.  In that test, we keep a
> connection open and then validate whether it is terminated.  Having
> said that, I think it might be better to add this as a separate test
> patch apart from main patch because it is a bit of a timing-dependent
> test and might fail on some slow machines.  We can always revert this
> if it turns out to be an unstable test.
>

+1


> I have slightly modified the main patch and attached is the result.
> Basically, I don't see any need to repeat what is mentioned in the
> Drop Database page.  Let me know what you guys think?
>

+ 1 from me - all has sense

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


Re: dropdb --force

2019-11-17 Thread Amit Kapila
On Mon, Nov 18, 2019 at 10:33 AM Pavel Stehule  wrote:
>
> po 18. 11. 2019 v 4:43 odesílatel vignesh C  napsal:
>>
>>
>> When we don't specify -e option, the query used to drop db will not be
>> printed like below:
>> ./dropdb testdb1
>> When we specify -e option, the query used to drop db will be printed like 
>> below:
>> ./dropdb -e testdb2
>> SELECT pg_catalog.set_config('search_path', '', false);
>> DROP DATABASE testdb2;
>> If we specify -e option, the query that is being used to drop db will
>> be printed. In the existing test I could not see the inclusion of -e
>> option. I was thinking to add a test including -e that way the query
>> that includes force option gets validated.
>
>
> still I don't understand. The created query is tested already by current test.
>
> Do you want to test just -e option?
>

Yeah, it seems Vignesh wants to do that.  It will help in verifying
that the command generated by code is correct.  However, I think there
is no pressing need to have an additional test for this.

> Then it should be done as separate issue.
>

Yeah, I agree.  I think this can be done as a separate test patch to
improve coverage if someone wants.

>>
>> >>
>> >> Also should we include one test where one session is connected to db
>> >> and another session tries dropping with -f option?
>> >
>> >
>> > I afraid so test API doesn't allow asynchronous operations. Do you have 
>> > any idea, how to it?
>> >
>>
>> I had seen that isolation test(src/test/isolation) has a framework to
>> support this. You can have a look to see if it can be handled using
>> that.
>
>
> I'll look there
>

If we want to have a test for this, then you might want to look at
test src/test/recovery/t/013_crash_restart.  In that test, we keep a
connection open and then validate whether it is terminated.  Having
said that, I think it might be better to add this as a separate test
patch apart from main patch because it is a bit of a timing-dependent
test and might fail on some slow machines.  We can always revert this
if it turns out to be an unstable test.

I have slightly modified the main patch and attached is the result.
Basically, I don't see any need to repeat what is mentioned in the
Drop Database page.  Let me know what you guys think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


dropdb-f-20191118.amit.patch
Description: Binary data


Re: dropdb --force

2019-11-17 Thread Pavel Stehule
po 18. 11. 2019 v 4:43 odesílatel vignesh C  napsal:

> On Sat, Nov 16, 2019 at 1:25 PM Pavel Stehule 
> wrote:
> >> >
> >> > updated patch attached
> >> >
> >>
> >> Thanks Pavel  for providing updated version.
> >> Few comments:
> >> I felt the help text  seems incomplete:
> >> @@ -159,6 +167,7 @@ help(const char *progname)
> >>  printf(_("\nOptions:\n"));
> >>  printf(_("  -e, --echoshow the commands being
> >> sent to the server\n"));
> >>  printf(_("  -i, --interactive prompt before deleting
> anything\n"));
> >> +printf(_("  -f, --force   try to terminate other
> >> connection before\n"));
> >>  printf(_("  -V, --version output version information,
> >> then exit\n"));
> >> we can change to:
> >> printf(_("  -f, --force   try to terminate other
> >> connection before dropping\n"));
> >>
> >
> > done. maybe alternative can be "first try to terminate other
> connections". It is shorter. The current text has 78 chars, what should be
> acceptable
> >
> >>
> >> We can add one test including -e option which validates the command
> >> generation including WITH (FORCE):
> >> +$node->safe_psql('postgres', 'CREATE DATABASE foobar2');
> >> +$node->issues_sql_like(
> >> +[ 'dropdb', '--force', 'foobar2' ],
> >> +qr/statement: DROP DATABASE foobar2 WITH \(FORCE\);/,
> >> +'SQL DROP DATABASE (FORCE) run');
> >> +
> >
> >
> > I don't understand to this point. It is effectively same like existing
> test
> >
>
> When we don't specify -e option, the query used to drop db will not be
> printed like below:
> ./dropdb testdb1
> When we specify -e option, the query used to drop db will be printed like
> below:
> ./dropdb -e testdb2
> SELECT pg_catalog.set_config('search_path', '', false);
> DROP DATABASE testdb2;
> If we specify -e option, the query that is being used to drop db will
> be printed. In the existing test I could not see the inclusion of -e
> option. I was thinking to add a test including -e that way the query
> that includes force option gets validated.
>

still I don't understand. The created query is tested already by current
test.

Do you want to test just -e option? Then it should be done as separate
issue. Do this now is little bit messy.


> >>
> >> Also should we include one test where one session is connected to db
> >> and another session tries dropping with -f option?
> >
> >
> > I afraid so test API doesn't allow asynchronous operations. Do you have
> any idea, how to it?
> >
>
> I had seen that isolation test(src/test/isolation) has a framework to
> support this. You can have a look to see if it can be handled using
> that.
>

I'll look there


> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>


Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-17 Thread Noah Misch
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> I started pre-commit editing on 2019-10-28, and comment+README updates have
> been the largest part of that.  I'll check my edits against the things you
> list here, and I'll share on-list before committing.  I've now marked the CF
> entry Ready for Committer.

Having dedicated many days to that, I am attaching v24nm.  I know of two
remaining defects:

=== Defect 1: gistGetFakeLSN()

When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN().  One could reproduce the problem just by running this
sequence in psql:

  begin;
  create table gist_point_tbl(id int4, p point);
  create index gist_pointidx on gist_point_tbl using gist(p);
  insert into gist_point_tbl (id, p)
  select g, point(g*10, g*10) from generate_series(1, 1000) g;

I've included a wrong-in-general hack to make the test pass.  I see two main
options for fixing this:

(a) Introduce an empty WAL record that reserves an LSN and has no other
effect.  Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible.  For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN.  That optimization is probably too
clever, though it would make the new WAL record almost never appear.

(b) Exempt GiST from most WAL skipping.  GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.

Overall, I like the cleanliness of (a).  The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods.  (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.)  Overall, I
lean toward (a).  Any other ideas or preferences?

=== Defect 2: repetitive work when syncing many relations

For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan.  Commit 279628a
introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions.  (One could,
however, work around the problem by raising wal_skip_threshold.)  Kyotaro, if
you agree, could you modify v24nm to implement that?


Notable changes in v24nm:

- Wrote section "Skipping WAL for New RelFileNode" in
  src/backend/access/transam/README to be the main source concerning the new
  coding rules.

- Updated numerous comments and doc sections.

- Eliminated the pendingSyncs list in favor of a "sync" field in
  pendingDeletes.  I mostly did this to eliminate the possibility of the lists
  getting out of sync.  This removed considerable parallel code for managing a
  second list at end-of-xact.  We now call smgrDoPendingSyncs() only when
  committing or preparing a top-level transaction.

- Whenever code sets an rd_*Subid field of a Relation, it must call
  EOXactListAdd().  swap_relation_files() was not doing so, so the field
  remained set during the next transaction.  I introduced
  RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
  so it also affects the mapped relation case.

- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
  rd_createSubid remained set.  (That happened before this patch, but it has
  been harmless.)  I fixed this in heap_create().

- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM.  A sync is necessary
  when checksums are enabled.  Observe the precedent that
  RelationCopyStorage() has not been exempting FSM_FORKNUM.

- Pass log_newpage_range() a "false" for page_std, for the same reason
  RelationCopyStorage() does.

- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
  the wrong data for non-main forks.  Before this patch, callers always passed
  MAIN_FORKNUM and "true", hence the lack of complaints.

- Restored table_finish_bulk_insert(), though heapam no longer provides a
  callback.  The API is still well-defined, and other table AMs might have use
  for it.  Removing it feels like a separate proposal.

- Removed TABLE_INSERT_SKIP_WAL.  Any out-of-tree code using it should revisit
  itself in light of this patch.

- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
  it was overcounting.

- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.

- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
  "Write Ahead Log" -> "Settings", between similar settings
  

Re: dropdb --force

2019-11-17 Thread vignesh C
On Sat, Nov 16, 2019 at 1:25 PM Pavel Stehule  wrote:
>> >
>> > updated patch attached
>> >
>>
>> Thanks Pavel  for providing updated version.
>> Few comments:
>> I felt the help text  seems incomplete:
>> @@ -159,6 +167,7 @@ help(const char *progname)
>>  printf(_("\nOptions:\n"));
>>  printf(_("  -e, --echoshow the commands being
>> sent to the server\n"));
>>  printf(_("  -i, --interactive prompt before deleting 
>> anything\n"));
>> +printf(_("  -f, --force   try to terminate other
>> connection before\n"));
>>  printf(_("  -V, --version output version information,
>> then exit\n"));
>> we can change to:
>> printf(_("  -f, --force   try to terminate other
>> connection before dropping\n"));
>>
>
> done. maybe alternative can be "first try to terminate other connections". It 
> is shorter. The current text has 78 chars, what should be acceptable
>
>>
>> We can add one test including -e option which validates the command
>> generation including WITH (FORCE):
>> +$node->safe_psql('postgres', 'CREATE DATABASE foobar2');
>> +$node->issues_sql_like(
>> +[ 'dropdb', '--force', 'foobar2' ],
>> +qr/statement: DROP DATABASE foobar2 WITH \(FORCE\);/,
>> +'SQL DROP DATABASE (FORCE) run');
>> +
>
>
> I don't understand to this point. It is effectively same like existing test
>

When we don't specify -e option, the query used to drop db will not be
printed like below:
./dropdb testdb1
When we specify -e option, the query used to drop db will be printed like below:
./dropdb -e testdb2
SELECT pg_catalog.set_config('search_path', '', false);
DROP DATABASE testdb2;
If we specify -e option, the query that is being used to drop db will
be printed. In the existing test I could not see the inclusion of -e
option. I was thinking to add a test including -e that way the query
that includes force option gets validated.

>>
>> Also should we include one test where one session is connected to db
>> and another session tries dropping with -f option?
>
>
> I afraid so test API doesn't allow asynchronous operations. Do you have any 
> idea, how to it?
>

I had seen that isolation test(src/test/isolation) has a framework to
support this. You can have a look to see if it can be handled using
that.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-17 Thread David Fetter
On Sun, Nov 17, 2019 at 11:23:23PM +, Andrew Gierth wrote:
> > "David" == David Fetter  writes:
> 
> First, in testing the patch I found there were indeed some missing
> cases: the sortsupport version of the comparator needs to be fixed too.
> I attach a draft addition to your patch, you should probably look at
> adding test cases that need this to work.
> 
>  David> (a, b, c) < ($1, $2 COLLATE "C_backwards", $3)
>  David> ...
>  David> ORDER BY a, b DESC, c
> 
> That would have to be:
> 
>  WHERE (a, b COLLATE "C_backwards", c) < ($1, $2, $3)
>  ...
>  ORDER BY a, b COLLATE "C_backwards", c
> 
> Adding the below patch to yours, I can get this on the regression test
> db (note that this is a -O0 asserts build, timings may be slow relative
> to a production build):
> 
> create collation "C_rev" ( LOCALE = "C", REVERSE = true );
> create index on tenk1 (hundred, (stringu1::text collate "C_rev"), string4);
> 
> explain analyze
>   select hundred, stringu1::text, string4
> from tenk1
>where (hundred, stringu1::text COLLATE "C_rev", string4)
>> (10, 'WK', 'xx')
>order by hundred, (stringu1::text collate "C_rev"), string4
>limit 5;
>QUERY 
> PLAN   
> 
>  Limit  (cost=0.29..1.28 rows=5 width=132) (actual time=0.029..0.038 rows=5 
> loops=1)
>->  Index Scan using tenk1_hundred_stringu1_string4_idx on tenk1  
> (cost=0.29..1768.49 rows=8900 width=132) (actual time=0.028..0.036 rows=5 
> loops=1)
>  Index Cond: (ROW(hundred, ((stringu1)::text)::text, string4) > 
> ROW(10, 'WK'::text, 'xx'::name))
>  Planning Time: 0.225 ms
>  Execution Time: 0.072 ms
> (5 rows)
> 
> and I checked the results, and they look correct now.

Here's that patch with your correction rolled in.

This will need more tests, and possibly more documentation.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 2096ea540a23c60b242e9f9c058d25e0d075f20a Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sat, 16 Nov 2019 09:29:14 -0800
Subject: [PATCH v2] Reverse collations
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.23.0"

This is a multi-part message in MIME format.
--2.23.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


Make it possible to define collations which reverse the usual meanings
of <, <=, >, and >= for their corresponding collations.

This in turn makes it easier to do keyset pagination on text with mixes
of ASC and DESC in the ORDER BY.

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4368..3086936515 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2106,6 +2106,13 @@ SCRAM-SHA-256$iteration count:
   Is the collation deterministic?
  
 
+ 
+  collisreverse
+  bool
+  
+  Is the collation reversed?
+ 
+
  
   collencoding
   int4
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index def4dda6e8..cb913871b7 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -24,7 +24,8 @@ CREATE COLLATION [ IF NOT EXISTS ] name (
 [ LC_CTYPE = lc_ctype, ]
 [ PROVIDER = provider, ]
 [ DETERMINISTIC = boolean, ]
-[ VERSION = version ]
+[ VERSION = version, ]
+[ REVERSE = boolean ]
 )
 CREATE COLLATION [ IF NOT EXISTS ] name FROM existing_collation
 
@@ -166,6 +167,16 @@ CREATE COLLATION [ IF NOT EXISTS ] name FROM 
 
 
+
+ REVERSE
+
+ 
+  
+   Specifies whether the collation should sort in reverse. Defaults to false.
+  
+ 
+
+
 
  existing_collation
 
@@ -225,6 +236,13 @@ CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');
 
   
 
+  
+   To create a collation from the C locale that sorts backwards:
+
+CREATE COLLATION C_sdrawckab (locale = 'C', reverse = true);
+
+  
+
   
To create a collation from an existing collation:
 
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index dd99d53547..6198f77f36 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -47,6 +47,7 @@ CollationCreate(const char *collname, Oid collnamespace,
 Oid collowner,
 char collprovider,
 bool collisdeterministic,
+bool collisreverse,
 int32 collencoding,
 const char *collcollate, const char *collctype,
 const char *collversion,
@@ -162,6 +163,7 @@ CollationCreate(const char *collname, Oid collnamespace,
 	

Re: Invisible PROMPT2

2019-11-17 Thread Thomas Munro
On Mon, Nov 18, 2019 at 1:49 PM Alvaro Herrera  wrote:
> On 2019-Nov-18, Thomas Munro wrote:
> > Nice idea.  Here's one like that, that just does the counting at the
> > end and looks out for readline control codes.  It's pretty naive about
> > what "width" means though: you'll get two spaces for UTF-8 encoded é,
> > and I suppose a complete implementation would know about the half
> > width/full width thing for Chinese and Japanese etc.
>
> Hmm ... is this related to what Juan José posted at
> https://postgr.es/m/CAC+AXB28ADgwdNRA=aaowdypqo1dzr+5nto8ixgssfrxyvp...@mail.gmail.com
> ?  That's backend code of course, though.

Yeah.  Maybe pg_wcswidth() would be OK though, and it's available in
psql, though I guess you'd have to make a copy with the escaped bits
stripped out.




Re: Invisible PROMPT2

2019-11-17 Thread Alvaro Herrera
On 2019-Nov-18, Thomas Munro wrote:

> Nice idea.  Here's one like that, that just does the counting at the
> end and looks out for readline control codes.  It's pretty naive about
> what "width" means though: you'll get two spaces for UTF-8 encoded é,
> and I suppose a complete implementation would know about the half
> width/full width thing for Chinese and Japanese etc.

Hmm ... is this related to what Juan José posted at
https://postgr.es/m/CAC+AXB28ADgwdNRA=aaowdypqo1dzr+5nto8ixgssfrxyvp...@mail.gmail.com
?  That's backend code of course, though.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-17 Thread Andrew Gierth
> "David" == David Fetter  writes:

First, in testing the patch I found there were indeed some missing
cases: the sortsupport version of the comparator needs to be fixed too.
I attach a draft addition to your patch, you should probably look at
adding test cases that need this to work.

 David> (a, b, c) < ($1, $2 COLLATE "C_backwards", $3)
 David> ...
 David> ORDER BY a, b DESC, c

That would have to be:

 WHERE (a, b COLLATE "C_backwards", c) < ($1, $2, $3)
 ...
 ORDER BY a, b COLLATE "C_backwards", c

Adding the below patch to yours, I can get this on the regression test
db (note that this is a -O0 asserts build, timings may be slow relative
to a production build):

create collation "C_rev" ( LOCALE = "C", REVERSE = true );
create index on tenk1 (hundred, (stringu1::text collate "C_rev"), string4);

explain analyze
  select hundred, stringu1::text, string4
from tenk1
   where (hundred, stringu1::text COLLATE "C_rev", string4)
   > (10, 'WK', 'xx')
   order by hundred, (stringu1::text collate "C_rev"), string4
   limit 5;
   QUERY 
PLAN   

 Limit  (cost=0.29..1.28 rows=5 width=132) (actual time=0.029..0.038 rows=5 
loops=1)
   ->  Index Scan using tenk1_hundred_stringu1_string4_idx on tenk1  
(cost=0.29..1768.49 rows=8900 width=132) (actual time=0.028..0.036 rows=5 
loops=1)
 Index Cond: (ROW(hundred, ((stringu1)::text)::text, string4) > ROW(10, 
'WK'::text, 'xx'::name))
 Planning Time: 0.225 ms
 Execution Time: 0.072 ms
(5 rows)

and I checked the results, and they look correct now.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 02cbcbd23d..61ab9720c5 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -84,6 +84,7 @@ typedef struct
 	int			last_returned;	/* Last comparison result (cache) */
 	bool		cache_blob;		/* Does buf2 contain strxfrm() blob, etc? */
 	bool		collate_c;
+	bool		reverse;
 	Oid			typid;			/* Actual datatype (text/bpchar/bytea/name) */
 	hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
 	hyperLogLogState full_card; /* Full key cardinality state */
@@ -2090,6 +2091,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		/* Initialize */
 		sss->last_returned = 0;
 		sss->locale = locale;
+		sss->reverse = (locale != 0) && locale->reverse;
 
 		/*
 		 * To avoid somehow confusing a strxfrm() blob and an original string,
@@ -2401,6 +2403,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		(!sss->locale || sss->locale->deterministic))
 		result = strcmp(sss->buf1, sss->buf2);
 
+	if (sss->reverse)
+		INVERT_COMPARE_RESULT(result);
+
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
 	sss->last_returned = result;
@@ -2663,6 +2668,13 @@ done:
 	 */
 	res = DatumBigEndianToNative(res);
 
+	/*
+	 * Account for reverse-ordering locales by flipping the bits. Note that
+	 * Datum is an unsigned type (uintptr_t).
+	 */
+	if (sss->reverse)
+		res ^= ~(Datum)0;
+
 	/* Don't leak memory here */
 	if (PointerGetDatum(authoritative) != original)
 		pfree(authoritative);


Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-17 Thread David Fetter
On Sun, Nov 17, 2019 at 02:30:35PM -0500, Tom Lane wrote:
> David Fetter  writes:
> > Please find attached a patch for $Subject.
> 
> I think there's a reason why this hasn't been proposed before.
> 
> Back before we had full support of ASC/DESC index sort order, there was
> interest in having reverse-sort operator classes, and there are bits and
> pieces still in the code that tried to cater to that.  But we never got
> it to the point where such things would really be pleasant to use.
> Now that we have ASC/DESC indexes, there's no value in a reverse-sort
> operator class, so the idea's pretty much been consigned to the dustbin.
> 
> This looks to me like it's trying to go down that same path at the
> collation level, and it seems like just as bad of an idea here.
> 
> The fundamental problem with what you propose is that it'd require
> a bunch of infrastructure (which you haven't even attempted) to teach
> the planner about the relationships between forward- and reverse-sort
> collation pairs, so that it could figure out that scanning some index
> backwards would satisfy a request for the reverse-sort collation,
> or vice versa.  Without such infrastructure, the feature is really
> just a gotcha, because queries won't get optimized the way users
> would expect them to.
> 
> And no, I don't think we should accept the feature and then go write
> that infrastructure.  If we couldn't make it work well at the opclass
> level, I don't think things will go better at the collation level.
> 
> Lastly, your proposed use-case has some attraction, but this proposal
> only supports it if the column you need to be differently sorted is
> textual.  What if the sort columns are all numerics and timestamps?

Those are pretty straightforward to generate: -column, and
-extract('epoch' FROM column), respectively.

> Thinking about that, it seems like what we'd want is some sort of
> more-general notion of row comparison, to express "bounded below in
> an arbitrary ORDER BY ordering".  Not quite sure what it ought to
> look like.

I'm not, either, but the one I'm proposing seems like a lot less
redundant code (and hence a lot less room for errors) than what people
generally see proposed for this use case, to wit:

(a, b, c) < ($1, $2 COLLATE "C_backwards", $3)
...
ORDER BY a, b DESC, c

as opposed to the "standard" way to do it

(a > $1) OR
(a = $1 AND b < $2) OR
(a = $1 AND b = $2 AND c > $3)
...
ORDER BY a, b DESC, c

which may not even get optimized correctly.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Invisible PROMPT2

2019-11-17 Thread Thomas Munro
On Fri, Nov 15, 2019 at 3:58 AM Tom Lane  wrote:
> Kyotaro Horiguchi  writes:
> > This seems assuming %x are a kind of stable (until semicolon)
> > function. But at least %`..` can be volatile.  So, I think the %w
> > thing in PROMPT2 should be able to refer the actual prompt string
> > resulted from PROMPT1.
>
> Oh, that's a good point.  But it actually leads to a much simpler
> definition and implementation than the other ideas we've kicked
> around: define %w as "whitespace equal to the length of the
> last-generated PROMPT1 string (initially empty)", and we just
> have to save PROMPT1 each time we generate it.
>
> Except ... I'm not sure how to deal with hidden escape sequences.
> We should probably assume that anything inside %[...%] has width
> zero, but how would we remember that?
>
> Maybe count the width of non-escape characters whenever we
> generate PROMPT1, and just save that number not the string?
> It'd add overhead that's useless when there's no %w, but
> probably not enough to care about.

Nice idea.  Here's one like that, that just does the counting at the
end and looks out for readline control codes.  It's pretty naive about
what "width" means though: you'll get two spaces for UTF-8 encoded é,
and I suppose a complete implementation would know about the half
width/full width thing for Chinese and Japanese etc.


0001-Allow-invisible-PROMPT2-in-psql.patch
Description: Binary data


Re: could not stat promote trigger file leads to shutdown

2019-11-17 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Nov-15, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes.  So it's
>> still unsolved, but at least we've got one unsolved problem not two.

> I am now against this kind of behavior, because nowadays it is common
> to have external orchestrating systems stopping and starting postmaster
> on their own volition.

> If this kind of misconfiguration causes postmaster refuse to start, it
> can effectively become a service-shutdown scenario which requires the
> DBA to go temporarily mad.

By that argument, postgresql.conf could contain complete garbage
and the postmaster should still start.  I'm not willing to say
that an "external orchestrating system" doesn't need to take
responsibility for putting valid info into the config file.

regards, tom lane




Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-17 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 Tom> Lastly, your proposed use-case has some attraction, but this
 Tom> proposal only supports it if the column you need to be differently
 Tom> sorted is textual. What if the sort columns are all numerics and
 Tom> timestamps?

There are already trivial ways to reverse the orders of those, viz.
(-number) and (-extract(epoch from timestampcol)). The lack of any
equivalent method for text is what prompted this idea.

 Tom> Thinking about that, it seems like what we'd want is some sort of
 Tom> more-general notion of row comparison, to express "bounded below
 Tom> in an arbitrary ORDER BY ordering". Not quite sure what it ought
 Tom> to look like.

Well, one obvious completely general method is to teach the planner
(somehow) to spot conditions of the form

  (a > $1 OR (a = $1 AND b > $2) OR (a = $1 AND b = $2 AND c > $3) ...)
  
etc. and make them indexable if the sense of the > or < operator at
each step matched an ASC or DESC column in the index.

This would be a substantial win, because this kind of condition is one
often (incorrectly, for current PG) shown as an example of how to do
keyset pagination on multiple columns. But it would require some amount
of new logic in both the planner and, afaik, in the btree AM; I haven't
looked at how much.

-- 
Andrew (irc:RhodiumToad)




Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-17 Thread Tom Lane
David Fetter  writes:
> Please find attached a patch for $Subject.

I think there's a reason why this hasn't been proposed before.

Back before we had full support of ASC/DESC index sort order, there was
interest in having reverse-sort operator classes, and there are bits and
pieces still in the code that tried to cater to that.  But we never got
it to the point where such things would really be pleasant to use.
Now that we have ASC/DESC indexes, there's no value in a reverse-sort
operator class, so the idea's pretty much been consigned to the dustbin.

This looks to me like it's trying to go down that same path at the
collation level, and it seems like just as bad of an idea here.

The fundamental problem with what you propose is that it'd require
a bunch of infrastructure (which you haven't even attempted) to teach
the planner about the relationships between forward- and reverse-sort
collation pairs, so that it could figure out that scanning some index
backwards would satisfy a request for the reverse-sort collation,
or vice versa.  Without such infrastructure, the feature is really
just a gotcha, because queries won't get optimized the way users
would expect them to.

And no, I don't think we should accept the feature and then go write
that infrastructure.  If we couldn't make it work well at the opclass
level, I don't think things will go better at the collation level.

Lastly, your proposed use-case has some attraction, but this proposal
only supports it if the column you need to be differently sorted is
textual.  What if the sort columns are all numerics and timestamps?
Thinking about that, it seems like what we'd want is some sort of
more-general notion of row comparison, to express "bounded below in
an arbitrary ORDER BY ordering".  Not quite sure what it ought to
look like.

regards, tom lane




Reverse collations (initially for making keyset pagination cover more cases)

2019-11-17 Thread David Fetter
Folks,

Please find attached a patch for $Subject.

Motivation:

When people are doing keyset pagination, the simple cases redound to
adding a WHERE that looks like

(a, b, c) > (most_recent_a, most_recent_b, most_recent_c)

which corresponds to an ORDER BY clause that looks like

ORDER BY a, b, c

The fun starts when there are mixes of ASC and DESC in the ORDER BY
clause. Reverse collations make this simpler by inverting the meaning
of > (or similar), which makes the rowtypes still sortable in a new
dictionary order, so the pagination would look like:


(a, b, c) > (most_recent_a, most_recent_b COLLATE "C_backwards", 
most_recent_c)

with an ORDER BY like:

ORDER BY a, b DESC, c

What say?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From cd603421da0d8e4fc19401f24f46da8a26d88d3c Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sat, 16 Nov 2019 09:29:14 -0800
Subject: [PATCH v1] Reverse collations
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.23.0"

This is a multi-part message in MIME format.
--2.23.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


Make it possible to define collations which reverse the usual meanings
of <, <=, >, and >= for their corresponding collations.

This in turn makes it easier to do keyset pagination on text with mixes
of ASC and DESC in the ORDER BY.

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4368..3086936515 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2106,6 +2106,13 @@ SCRAM-SHA-256$iteration count:
   Is the collation deterministic?
  
 
+ 
+  collisreverse
+  bool
+  
+  Is the collation reversed?
+ 
+
  
   collencoding
   int4
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index def4dda6e8..cb913871b7 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -24,7 +24,8 @@ CREATE COLLATION [ IF NOT EXISTS ] name (
 [ LC_CTYPE = lc_ctype, ]
 [ PROVIDER = provider, ]
 [ DETERMINISTIC = boolean, ]
-[ VERSION = version ]
+[ VERSION = version, ]
+[ REVERSE = boolean ]
 )
 CREATE COLLATION [ IF NOT EXISTS ] name FROM existing_collation
 
@@ -166,6 +167,16 @@ CREATE COLLATION [ IF NOT EXISTS ] name FROM 
 
 
+
+ REVERSE
+
+ 
+  
+   Specifies whether the collation should sort in reverse. Defaults to false.
+  
+ 
+
+
 
  existing_collation
 
@@ -225,6 +236,13 @@ CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');
 
   
 
+  
+   To create a collation from the C locale that sorts backwards:
+
+CREATE COLLATION C_sdrawckab (locale = 'C', reverse = true);
+
+  
+
   
To create a collation from an existing collation:
 
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index dd99d53547..6198f77f36 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -47,6 +47,7 @@ CollationCreate(const char *collname, Oid collnamespace,
 Oid collowner,
 char collprovider,
 bool collisdeterministic,
+bool collisreverse,
 int32 collencoding,
 const char *collcollate, const char *collctype,
 const char *collversion,
@@ -162,6 +163,7 @@ CollationCreate(const char *collname, Oid collnamespace,
 	values[Anum_pg_collation_collowner - 1] = ObjectIdGetDatum(collowner);
 	values[Anum_pg_collation_collprovider - 1] = CharGetDatum(collprovider);
 	values[Anum_pg_collation_collisdeterministic - 1] = BoolGetDatum(collisdeterministic);
+	values[Anum_pg_collation_collisreverse - 1] = BoolGetDatum(collisreverse);
 	values[Anum_pg_collation_collencoding - 1] = Int32GetDatum(collencoding);
 	namestrcpy(_collate, collcollate);
 	values[Anum_pg_collation_collcollate - 1] = NameGetDatum(_collate);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 919e092483..ab37705b27 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -60,11 +60,13 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 	DefElem*lcctypeEl = NULL;
 	DefElem*providerEl = NULL;
 	DefElem*deterministicEl = NULL;
+	DefElem*reverseEl = NULL;
 	DefElem*versionEl = NULL;
 	char	   *collcollate = NULL;
 	char	   *collctype = NULL;
 	char	   *collproviderstr = NULL;
 	bool		collisdeterministic = true;
+	bool		collisreverse = false;
 	int			collencoding = 0;
 	char		collprovider = 0;
 	char	   *collversion = NULL;
@@ -95,6 +97,8 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 			defelp = 
 		else if (strcmp(defel->defname, "deterministic") == 0)
 			defelp = 
+		else if 

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-11-17 Thread Alexander Korotkov
On Mon, Nov 11, 2019 at 2:42 AM Alexander Korotkov
 wrote:
> I'm sorry for late reply.  I was busy with various things.  Also
> digging into these details took some time.  Please find my explanation
> below.
>
> On Wed, Oct 30, 2019 at 2:34 AM Peter Geoghegan  wrote:
> > In general, it seems very important to be clear about exactly how the
> > key space works. The nbtree work for v12 greatly benefitted from
> > defining comparisons in a way that didn't really change how nbtree
> > worked, while at the same time minimizing I/O and making nbtree
> > faithful to Lehman & Yao's original design. It isn't obvious how
> > valuable it is to really carefully define how invariants and key
> > comparisons work, but it seems possible to solve a lot of problems
> > that way.
>
> gin packs downlinks and pivot key into tuples in the different way
> than nbtree does.  Let's see the logical structure of internal B-tree
> page.  We can represent it as following.
>
> downlink_1, key_1_2, downlink_2, key_2_3,  , downlink_n, highkey
>
> key_{i-1}_i is pivot key, which splits key space between
> downlink_{i-1} and downlink_i.  So, every key under downlink_{i-1} is
> <= key_{i-1}_i.  Every key under downlink_i is > key_{i-1}_i.  And all
> underlying keys are <= highkey.
>
> nbtree packs that into tuples as followings (tuples are shown in parentheses).
>
> (highkey), (-inf, downlink_1), (key_1_2, downlink_2), ...
> (key_{n-1}_n,  downlink_n)
>
> So, we store highkey separately.  key_{i-1}_i and downlink_i form a
> tuple together.  downlink_1 is coupled with -inf key.
>
> gin packs tuples in a different way.
>
> (downlink_1, key_1_2), (downlink_2, key_2_3), ... , (downlink_n, highkey)
>
> So, in GIN key_{i-1}_i and downlink_{i-1} form a tuple.  Highkey is
> coupled with downlink_n.  And -inf key is not needed here.
>
> But there is couple notes about highkey:
> 1) In entry tree rightmost page, a key coupled with downlink_n doesn't
> really matter.  Highkey is assumed to be infinity.
> 2) In posting tree, a key coupled with downlink_n always doesn't
> matter.  Highkey for non-rightmost pages is stored separately and
> accessed via GinDataPageGetRightBound().
>
> I think this explains following:
> 1) The invariants in gin btree are same as they are in nbtree.  Just
> page layout is different.
> 2) The way entryLocateEntry() works.  In particular why it's OK to go
> the mid downlink if corresponding key equals.
> 3) There is no "negative infinity" item in internal pages.
>
> Does the explanation of above looks clear for you?  If so, I can put
> it into the README.

So, I've put this explanation into README patch.  I just change
designation to better match with Lehman & Yao paper and did some minor
enchantments.

I'm going to push this patchset if no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0002-Fix-traversing-to-the-deleted-GIN-page-via-downlin-4.patch
Description: Binary data


0001-Fix-deadlock-between-ginDeletePage-and-ginStepRigh-4.patch
Description: Binary data


0003-Revise-GIN-README-4.patch
Description: Binary data


Re: global / super barriers (for checksums)

2019-11-17 Thread Magnus Hagander
On Wed, Nov 13, 2019 at 8:45 PM Andres Freund  wrote:

> Hi,
>
> On 2019-11-13 12:26:34 -0500, Robert Haas wrote:
> > TL;DR: I'm not sure that we need 0001; I propose to commit 0002; and I
> > have some concerns about 0003 and am interested in working further on
> > it.
>
> Thanks for looking at the patch!
>

+1 (well, +, but there is a quota)


> - The patch needs some general tidying up, like comments and naming
> > consistency and stuff like that.
>
> Yea. It really was a prototype to allow Magnus to continue...
>

And a very useful one! :) So thanks for that one as well.


> Andres, Magnus, if neither of you are planning to work on this soon, I
> > might like to jump in and run with this. Please let me know your
> > thoughts.
>
> I'm not planning to do so in the near term - so I'd more than welcome
> you to do so.
>

I'm definitely happy to work with it, but I did not and do not feel I have
the skills for doing the "proper review" needed for it. So I am also very
happy for you to pick it up and run with it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: SimpleLruTruncate() mutual exclusion

2019-11-17 Thread Dmitry Dolgov
> On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote:

Hi,

> > Also, if I understand the data-loss hazard properly, it's what you
> > said in the other thread: the latest_page_number could advance after
> > we make our decision about what to truncate, and then maybe we could
> > truncate newly-added data.  We surely don't want to lock out the
> > operations that can advance last_page_number, so how does serializing
> > vac_truncate_clog help?
> >
> > I wonder whether what we need to do is add some state in shared
> > memory saying "it is only safe to create pages before here", and
> > make SimpleLruZeroPage or its callers check that.  The truncation
> > logic would advance that value, but only after completing a scan.
> > (Oh ..., hmm, maybe the point of serializing truncations is to
> > ensure that we know that we can safely advance that?)
>
> vac_truncate_clog() already records "it is only safe to create pages before
> here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
> The trouble comes when two vac_truncate_clog() run in parallel and you get a
> sequence of events like this:
>
> vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to 
> unlink
> vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to 
> unlink
> vac_truncate_clog() instance 1 unlinks segment ABCD
> vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> vac_truncate_clog() instance 1 finishes
> some backend calls SimpleLruZeroPage(), creating segment ABCD
> vac_truncate_clog() instance 2 unlinks segment ABCD
>
> Serializing vac_truncate_clog() fixes that.

I'm probably missing something, so just wanted to clarify. Do I
understand correctly, that thread [1] and this one are independent, and
it is assumed in the scenario above that we're at the end of XID space,
but not necessarily having rounding issues? I'm a bit confused, since
the reproduce script does something about cutoffPage, and I'm not sure
if it's important or not.

> > Can you post whatever script you used to detect/reproduce the problem
> > in the first place?
>
> I've attached it, but it's pretty hackish.  Apply the patch on commit 7170268,
> make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
> directory, and run "make trunc-check".  This incorporates xid-burn
> acceleration code that Jeff Janes shared with -hackers at some point.

I've tried to use it to understand the problem better, but somehow
couldn't reproduce via suggested script. I've applied it to 7170268
(tried also apply rebased to the master with the same results) with the
conf-test-pg in place, but after going through all steps there are no
errors like:

   ERROR:  could not access status of transaction 2149484247
   DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

Is there anything extra one needs to do to reproduce the problem, maybe
adjust delays or something?

[1]: 
https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com




Re: SQL/JSON: JSON_TABLE

2019-11-17 Thread Pavel Stehule
Hi

út 12. 11. 2019 v 22:51 odesílatel Nikita Glukhov 
napsal:

> On 12.11.2019 20:54, Pavel Stehule wrote:
>
> > Hi
> >
> > please, can you rebase 0001-SQL-JSON-functions-v40.patch. I have a
> > problem with patching
> >
> > Pavel
>
> Attached 41th version of the patches rebased onto current master.
>

I testing functionality - randomly testing some examples that I found on
internet.

I found:

a) Oracle & MySQL (Oracle) supports EXISTS clause, this implementation not.
I think should be useful support this clause too.

SELECT * FROM JSON_TABLE('...', '...' COLUMNS x INT EXISTS PATH ...

There is a question how to map boolean result to other data types.

b) When searched value is not scalar, then it returns null. This behave can
be suppressed by clause FORMAT Json. I found a different behave, and maybe
I found a bug. On MySQL this clause is by default for JSON values (what has
sense).

SELECT *
 FROM
  JSON_TABLE(
'[{"a":[1,2]}]',
'$[*]'
COLUMNS(
 aj JSON PATH '$.a' DEFAULT '{"x": 333}' ON EMPTY
)
  ) AS tt;

It returns null, although it should to return [1,2].

There is another bug maybe. Although there is DEFAULT clause. It returns
NULL.

I got correct result when I used FORMAT JSON clause. I think it should be
default behave for json and jsonb columns.

Another question - when I used FORMAT JSON clause, then I got syntax error
on DEFAULT keyword .. . Is it correct? Why I cannot to use together FORMAT
JSON and DEFAULT clauses?

Note - this behave is not described in documentation.

Regards

Pavel




>
> --
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


Re: Append with naive multiplexing of FDWs

2019-11-17 Thread Thomas Munro
On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian  wrote:
> On Wed, Sep  4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:
> > A few years back[1] I experimented with a simple readiness API that
> > would allow Append to start emitting tuples from whichever Foreign
> > Scan has data available, when working with FDW-based sharding.  I used
> > that primarily as a way to test Andres's new WaitEventSet stuff and my
> > kqueue implementation of that, but I didn't pursue it seriously
> > because I knew we wanted a more ambitious async executor rewrite and
> > many people had ideas about that, with schedulers capable of jumping
> > all over the tree etc.
> >
> > Anyway, Stephen Frost pinged me off-list to ask about that patch, and
> > asked why we don't just do this naive thing until we have something
> > better.  It's a very localised feature that works only between Append
> > and its immediate children.  The patch makes it work for postgres_fdw,
> > but it should work for any FDW that can get its hands on a socket.
> >
> > Here's a quick rebase of that old POC patch, along with a demo.  Since
> > 2016, Parallel Append landed, but I didn't have time to think about
> > how to integrate with that so I did a quick "sledgehammer" rebase that
> > disables itself if parallelism is in the picture.
>
> Yes, sharding has been waiting on parallel FDW scans.  Would this work
> for parallel partition scans if the partitions were FDWs?

Yeah, this works for partitions that are FDWs (as shown), but only for
Append, not for Parallel Append.  So you'd have parallelism in the
sense that your N remote shard servers are all doing stuff at the same
time, but it couldn't be in a parallel query on your 'home' server,
which is probably good for things that push down aggregation and bring
back just a few tuples from each shard, but bad for anything wanting
to ship back millions of tuples to chew on locally.  Do you think
that'd be useful enough on its own?

The problem is that parallel safe non-partial plans (like postgres_fdw
scans) are exclusively 'claimed' by one process under Parallel Append,
so with the patch as posted, if you modify it to allow parallelism
then it'll probably give correct answers but nothing prevents a single
process from claiming and starting all the scans and then waiting for
them to be ready, while the other processes miss out on doing any work
at all.  There's probably some kludgy solution involving not letting
any one worker start more than X, and some space cadet solution
involving passing sockets around and teaching libpq to hand over
connections at certain controlled phases of the protocol (due to lack
of threads), but nothing like that has jumped out as the right path so
far.

One idea that seems promising but requires a bunch more infrastructure
is to offload the libpq multiplexing to a background worker that owns
all the sockets, and have it push tuples into a multi-consumer shared
memory queue that regular executor processes could read from.  I have
been wondering if that would be best done by each FDW implementation,
or if there is a way to make a generic infrastructure for converting
parallel-safe executor nodes into partial plans by the use of a
'Scatter' (opposite of Gather) node that can spread the output of any
node over many workers.

If you had that, you'd still want a way for Parallel Append to be
readiness-based, but it would probably look a bit different to this
patch because it'd need to use (vapourware) multiconsumer shm queue
readiness, not fd readiness.  And another kind of fd-readiness
multiplexing would be going on inside the new (vapourware) worker that
handles all the libpq connections (and maybe other kinds of work for
other FDWs that are able to expose a socket).