date:20201005

Re: New statistics for tuning WAL buffer size

2020-10-05 Thread Masahiro Ikeda


Hi,

I think it's better to add other WAL statistics to the pg_stat_wal view.
I'm thinking to add the following statistics. Please let me know your 
thoughts.


1.  Basic wal statistics

* wal_records: Total number of WAL records generated
* wal_fpi: Total number of WAL full page images generated
* wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance 
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune 
"wal_compression", "checkpoint_timeout" and so on.


Although users can check the above statistics via EXPLAIN, auto_explain, 
autovacuum
and pg_stat_statements now, if users want to see the performance trends 
for the entire database,

they must preprocess the statistics.

Is it useful to add the sum of the above statistics to the pg_stat_wal 
view?



2.  Number of when new WAL file is created and zero-filled.

As Fujii-san already commented, I think it's good for tuning.

Just idea; it may be worth exposing the number of when new WAL file is 
created and zero-filled. This initialization may have impact on the 
performance of write-heavy workload generating lots of WAL. If this 
number is reported high, to reduce the number of this initialization, 
we can tune WAL-related parameters so that more "recycled" WAL files 
can be hold.



3.  Number of when to switch the WAL logfile segment.

This is similar to 2, but this counts the number of when WAL file is 
recylcled too.

I think it's useful for tuning "wal_segment_size"
if the number is high　relative to the startup time, "wal_segment_size" 
must be bigger.



4. Number of when WAL is flushed

I think it's useful for tuning "synchronous_commit" and "commit_delay" 
for query executions.
If the number of WAL is flushed is high, users can know 
"synchronous_commit" is useful for the workload.


Also, it's useful for tuning "wal_writer_delay" and 
"wal_writer_flush_after" for wal writer.

If the number is high, users can change the parameter for performance.

I think it's better to separate this for backends and wal writer.


5.  Wait time when WAL is flushed.

This is the accumulated time when wal is flushed.
If the time becomes much higher,　users can detect the possibility of 
disk failure.


Since users can see how much flash time occupies of the query execution 
time,

it may lead to query tuning and so on.

Since there is the above reason, I think it's better to separate this 
for backends and wal writer.



Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Logical Replication - detail message with names of missing columns

2020-10-05 Thread Bharath Rupireddy

Thanks Amit for the review comments.

On Mon, Oct 5, 2020 at 9:39 AM Amit Kapila  wrote:
>
> Few comments:
> ==
> 1.
> + /* Report error with names of the missing localrel column(s). */
> + if (!bms_is_empty(missingatts))
> + {
> + StringInfoData missingattsbuf;
> + intmissingattcnt = 0;
> + remoterel->nspname,
> + remoterel->relname,
> + missingattsbuf.data)));
> + }
>
> I think it is better to move the above code in a separate function
> (say logicalrep_report_missing_attrs or something like that).
>

Added a new function logicalrep_report_missing_attrs().

>
> 2. I think we always need to call bms_free(missingatts) because it is
> possible that there is no missing attribute and in that case, we won't
> free the memory allocated in bms_add_range.
>

Done. Yes we palloc memory for missingatts bitmap irrespective of missing
attributes. Added bms_free() out of if(!bms_is_empty(missingatts)) as well.
I also kept bms_free() before ereport(ERROR,..) to free up before throwing
the error. In anycase, only one bms_free() would get hit.

if (!bms_is_empty(missingatts))
{
StringInfoData missingattsbuf;
int   missingattcnt = 0;

bms_free(missingatts);
ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg_plural("logical replication target relation \"%s.%s\" is missing
replicated column: %s",
 "logical replication target relation \"%s.%s\" is missing replicated
columns: %s",
 missingattcnt,
 remoterel->nspname,
 remoterel->relname,
 missingattsbuf.data)));
}
bms_free(missingatts);

>
> 3. The patch doesn't seem to be freeing the memory allocated for
missingattsbuf.
>

I don't think we need to do that. We are passing missingattsbuf.data to
ereport and we are safe without freeing up missingattsbuf(we don't reach
the code after ereprot(ERROR,...)as the table sync worker anyways goes away
after throwing missing attributes error( if (sigsetjmp(local_sigjmp_buf, 1)
!= 0) in StartBackgroundWorker and then proc_exit(1)). Each time a new
table sync bg worker is spawned.

2020-10-06 10:18:27.063 IST [1599963] ERROR:  logical replication target
relation "public.t1" is missing replicated column: "@C1"
2020-10-06 10:18:47.179 IST [1600134] ERROR:  logical replication target
relation "public.t1" is missing replicated column: "@C1"
2020-10-06 10:18:57.234 IST [1600214] ERROR:  logical replication target
relation "public.t1" is missing replicated column: "@C1"

2020-10-06 10:19:27.415 IST [1600458] ERROR:  logical replication target
relation "public.t1" is missing replicated columns: "%b1", "@C1"
2020-10-06 10:19:42.506 IST [1600588] ERROR:  logical replication target
relation "public.t1" is missing replicated columns: "%b1", "@C1"
2020-10-06 10:19:52.565 IST [1600669] ERROR:  logical replication target
relation "public.t1" is missing replicated columns: "%b1", "@C1"

>
> 4.
>   ereport(ERROR,
>   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> - errmsg("logical replication target relation \"%s.%s\" is missing "
>
> From the second line onwards, the message lines are not aligned in
> errmsg_plural.
>

Done.

>
> 5. Also, in the above message, keep the error string in a single line.
> For ex. see one of the existing messages:
> errmsg_plural("WAL segment size must be a power of two between 1 MB
> and 1 GB, but the control file specifies %d byte", .. I think it will
> be easy to read that way. I know this is not exactly related to your
> patch but improving it while changing this message seems fine.
>

Done.

Attaching v7 patch please consider it for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From c050262f5ca70712f324129c32ccd4527657b2c5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 6 Oct 2020 10:49:25 +0530
Subject: [PATCH v7] Detail message with names of missing columns in logical
 replication

In logical replication when a subscriber is missing some columns,
it currently emits an error message that says "some" columns are
missing(see logicalrep_rel_open()), but it doesn't specify what
the missing column names are. The comment there also says that
finding the missing column names is a todo item(/* TODO, detail
message with names of missing columns */).

This patch finds the missing columns on the subscriber relation
using the publisher relation columns and show them in the error
message which makes error to be more informative to the user.
---
 src/backend/replication/logical/relation.c | 56 ++
 1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2bb8e7d57b..123e49d953 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -228,6 +228,47 @@ logicalrep_rel_att_by_name(LogicalRepRelation *remoterel, const char *attname)
 	return -1;
 }

+/*
+ * Report error with names of the missing local relation column(s) if any,
+ * otherwise return.
+ */
+static void

Re: libpq compression

2020-10-05 Thread Andrey M. Borodin



> 26 марта 2019 г., в 19:46, Konstantin Knizhnik  
> написал(а):
> 
> Version of the patch correctly working when no compression algorithm are 
> avaiable.

Thanks for this work, Konstantin.
PFA rebased version of this patch.

This compression seems very important to reduce network-induced replication lag.
Recently I've found out that small installation suffer from huge latency spike 
when archive_timeout occurs.
How this happens? archive_timeout emits segment switch, which pads current 
segment with zeroes. Now these zeroes need to be replicated to standbys, 
without compression. Transfer of zeroes causes long waits for synchronous 
replication (up to 1s with network restricted to 16Mbps).

Also replication compression will reduce overall cross-AZ traffic of HA 
installations.

So I'm considering following plan for 2020: implement this protocol in Odyssey 
and send patches to drivers to enable early access to this feature.

Best regards, Andrey Borodin.


v16-0001-libpq-compression.patch
Description: Binary data

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-10-05 Thread Bharath Rupireddy

On Tue, Oct 6, 2020 at 11:20 AM Fujii Masao  wrote:
>
> >>
> >> Probably we can also replace sigHupHandler() in syslogger.c with
> >> SignalHandlerForConfigReload()? This would be separate patch, though.
> >>
> >
> > +1 to replace sigHupHandler() with SignalHandlerForConfigReload() as
> > the latch and the functionality are pretty much the same.
> >
> > WalReceiverMai(): I think we can also replace WalRcvShutdownHandler()
> > with SignalHandlerForShutdownRequest() because walrcv->latch point to
> > &MyProc->procLatch which in turn point to MyLatch.
> >
> > Thoughts? If okay, we can combine these into a single patch. I will
> > post an updated patch soon.
>
> +1 Or it's also ok to make each patch separately.
> Anyway, thanks!
>

Thanks. +1 to have separate patches. I will post two separate patches
for autoprewarm and WalRcvShutdownHandler for further review. The
other two patches can be for startup.c and syslogger.c.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-10-05 Thread Michael Paquier

On Mon, Oct 05, 2020 at 11:34:05PM +0900, Fujii Masao wrote:
> ISTM that we can also replace StartupProcSigHupHandler() in startup.c
> with SignalHandlerForConfigReload() by making the startup process use
> the general shared latch instead of its own one. POC patch attached.
> Thought?

That looks good to me.  Nice cleanup.

> Probably we can also replace sigHupHandler() in syslogger.c with
> SignalHandlerForConfigReload()? This would be separate patch, though.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-10-05 Thread Fujii Masao





On 2020/10/06 1:18, Bharath Rupireddy wrote:

On Mon, Oct 5, 2020 at 8:04 PM Fujii Masao  wrote:


On 2020/10/05 19:45, Bharath Rupireddy wrote:

Hi,

Autoprewarm module is using it's own SIGHUP(apw_sigterm_handler, got_sigterm) 
and SIGTERM(apw_sighup_handler, got_sighup) handlers which are similar to 
standard signal handlers(except for a difference [1]). Isn't it good to remove 
them and  use standard SignalHandlerForConfigReload and 
SignalHandlerForShutdownRequest?

Attaching the patch for the above changes.

Looks like the commit[2] replaced custom handlers with standard handlers.

Thoughts?


+1

The patch looks good to me.



Thanks.



ISTM that we can also replace StartupProcSigHupHandler() in startup.c
with SignalHandlerForConfigReload() by making the startup process use
the general shared latch instead of its own one. POC patch attached.
Thought?



I'm not quite sure whether it breaks something or not. I see that
WakeupRecovery() with XLogCtl->recoveryWakeupLatch latch from the
startup process is also being used in the walreceiver process. I may
be wrong, but have some concern if the behaviour is different in case
of EXEC_BACKEND and Windows?


Unless I'm wrong, regarding MyLatch, the behavior is not different
whether in EXEC_BACKEND or not. In both cases, SwitchToSharedLatch()
is called and MyLatch is set to the shared latch, i.e., MyProc->procLatch.




Another concern is that we are always using
XLogCtl->recoveryWakeupLatch in shared mode, this makes sense as this
latch is also being used in walrecevier. But sometimes, MyLatch is
created in non-shared mode as well(see InitLatch(MyLatch)).


Yes, and then the startup process calls SwitchToSharedLatch() and
repoint MyLatch to the shared one.



Others may have better thoughts though.


Okay, I will reconsider the patch and post it separately later if necessary.






Probably we can also replace sigHupHandler() in syslogger.c with
SignalHandlerForConfigReload()? This would be separate patch, though.



+1 to replace sigHupHandler() with SignalHandlerForConfigReload() as
the latch and the functionality are pretty much the same.

WalReceiverMai(): I think we can also replace WalRcvShutdownHandler()
with SignalHandlerForShutdownRequest() because walrcv->latch point to
&MyProc->procLatch which in turn point to MyLatch.

Thoughts? If okay, we can combine these into a single patch. I will
post an updated patch soon.


+1 Or it's also ok to make each patch separately.
Anyway, thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Michael Paquier

On Mon, Oct 05, 2020 at 11:23:50PM -0400, Bruce Momjian wrote:
> On Tue, Oct  6, 2020 at 11:11:27AM +0800, Julien Rouhaud wrote:
>> Maybe we could add a new hook for only queryid computation, and add a
>> GUC to let people choose between no queryid computed, core computation
>> (current pg_stat_statement) and 3rd party plugin?
> 
> That all seems very complicated.  If we go in that direction, I suggest
> we just give up getting any of this into core.

A GUC would have at least the advantage to make the computation
consistent for any system willing to consume it, with the option to
not pay any potential performance impact, though I have to admit that
just moving the query ID computation of PGSS into core may not be the
best option as a query ID of 0 means the same thing for a utility, for
an initialization, and for a backend running a query with an unknown
value, but that could be worked out.

FWIW, I think that adding the system ID in the hash is too
restrictive, as it could be interesting for users to do stat
comparisons across multiple systems running the same major version.
It would be better to not give any strong guarantee that the query ID
computed will remain consistent across major versions so as it is
possible to keep improving it.  Also, if nothing has been done that
changes the hashing computation, I see little benefit in forcing a
breakage by adding something like PG_MAJORVERSION_NUM or such in the
hash computation.
--
Michael


signature.asc
Description: PGP signature

Re: fixing old_snapshot_threshold's time->xid mapping

2020-10-05 Thread Thomas Munro

On Fri, Sep 25, 2020 at 2:00 PM Michael Paquier  wrote:
> On Thu, Sep 24, 2020 at 03:46:14PM -0400, Robert Haas wrote:
> > Committed.
>
> Cool, thanks.

+1

> > Thomas, with respect to your part of this patch set, I wonder if we
> > can make the functions that you're using to write tests safe enough
> > that we could add them to contrib/old_snapshot and let users run them
> > if they want. As you have them, they are hedged around with vague and
> > scary warnings, but is that really justified? And if so, can it be
> > fixed? It would be nicer not to end up with two loadable modules here,
> > and maybe the right sorts of functions could even have some practical
> > use.

Yeah, you may be right.  I am thinking about that.  In the meantime,
here is a rebase.  A quick recap of these remaining patches:

0001 replaces the current "magic test mode" that didn't really test
anything with a new test mode that verifies pruning and STO behaviour.
0002 fixes a separate bug that Andres reported: the STO XID map
suffers from wraparound-itis.
0003 adds a simple smoke test for Robert's commit 55b7e2f4.  Before
that fix landed, it failed.

> I have switched this item as waiting on author in the CF app then, as
> we are not completely done yet.

Thanks.  For the record, I think there is still one more complaint
from Andres that remains unaddressed even once these are in the tree:
there are thought to be some more places that lack
TestForOldSnapshot() calls.
From d2909b13737cc74c1b3d6896ac4375d611f32df2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 20 Apr 2020 16:23:02 +1200
Subject: [PATCH v10 1/3] Rewrite the "snapshot_too_old" tests.

Previously the snapshot too old feature used a special test value for
old_snapshot_threshold.  Instead, use a new approach based on explicitly
advancing the "current" timestamp used in snapshot-too-old book keeping,
so that we can control the timeline precisely without having to resort
to sleeping or special test branches in the code.

Also check that early pruning actually happens, by vacuuming and
inspecting the visibility map at key points in the test schedule.

Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
---
 src/backend/utils/time/snapmgr.c  |  21 ---
 src/test/modules/snapshot_too_old/Makefile|  23 +--
 .../expected/sto_using_cursor.out |  75 -
 .../expected/sto_using_hash_index.out |  26 ++-
 .../expected/sto_using_select.out | 157 +++---
 .../specs/sto_using_cursor.spec   |  30 ++--
 .../specs/sto_using_hash_index.spec   |  19 ++-
 .../specs/sto_using_select.spec   |  50 --
 src/test/modules/snapshot_too_old/sto.conf|   2 +-
 .../snapshot_too_old/test_sto--1.0.sql|  14 ++
 src/test/modules/snapshot_too_old/test_sto.c  |  74 +
 .../modules/snapshot_too_old/test_sto.control |   5 +
 12 files changed, 366 insertions(+), 130 deletions(-)
 create mode 100644 src/test/modules/snapshot_too_old/test_sto--1.0.sql
 create mode 100644 src/test/modules/snapshot_too_old/test_sto.c
 create mode 100644 src/test/modules/snapshot_too_old/test_sto.control

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8c41483e87..1d80770136 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1774,23 +1774,6 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	next_map_update_ts = oldSnapshotControl->next_map_update;
 	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
 
-	/*
-	 * Zero threshold always overrides to latest xmin, if valid.  Without some
-	 * heuristic it will find its own snapshot too old on, for example, a
-	 * simple UPDATE -- which would make it useless for most testing, but
-	 * there is no principled way to ensure that it doesn't fail in this way.
-	 * Use a five-second delay to try to get useful testing behavior, but this
-	 * may need adjustment.
-	 */
-	if (old_snapshot_threshold == 0)
-	{
-		if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
-			&& TransactionIdFollows(latest_xmin, xlimit))
-			xlimit = latest_xmin;
-
-		ts -= 5 * USECS_PER_SEC;
-	}
-	else
 	{
 		ts = AlignTimestampToMinuteBoundary(ts)
 			- (old_snapshot_threshold * USECS_PER_MINUTE);
@@ -1883,10 +1866,6 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	if (!map_update_required)
 		return;
 
-	/* No further tracking needed for 0 (used for testing). */
-	if (old_snapshot_threshold == 0)
-		return;
-
 	/*
 	 * We don't want to do something stupid with unusual values, but we don't
 	 * want to litter the log with warnings or break otherwise normal
diff --git a/src/test/modules/snapshot_too_old/Makefile b/src/test/modules/snapshot_too_old/Makefile
index dfb4537f63..81836e9953 100644
--- a/src/test/modules/snapshot_too_old/Makefile
+++ b/src/test/modules/snapshot_too_old/Makefile
@@ -1,14 +1,22 @@
 # src/tes

Re: [HACKERS] Custom compression methods

2020-10-05 Thread Dilip Kumar

On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
 wrote:
>
> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> > wrote:
> >>
> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> > wrote:
> >> >
> >> >Thanks, Tomas for your feedback.
> >> >
> >> >> 9) attcompression ...
> >> >>
> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> >> of just using it to store a the compression method, it's also used to
> >> >> store the preserved compression methods. And using NameData to store
> >> >> this seems wrong too - if we really want to store this info, the correct
> >> >> way is either using text[] or inventing charvector or similar.
> >> >
> >> >The reason for using the NameData is the get it in the fixed part of
> >> >the data structure.
> >> >
> >>
> >> Why do we need that? It's possible to have varlena fields with direct
> >> access (see pg_index.indkey for example).
> >
> >I see.  While making it NameData I was thinking whether we have an
> >option to direct access the varlena. Thanks for pointing me there. I
> >will change this.
> >
> > Adding NameData just to make
> >> it fixed-length means we're always adding 64B even if we just need a
> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> (e.g. with many temp tables, etc.).
> >
> >You are right.  Even I did not like to keep 64B for this, so I will change 
> >it.
> >
> >>
> >> >> But to me this seems very much like a misuse of attcompression to track
> >> >> dependencies on compression methods, necessary because we don't have a
> >> >> separate catalog listing compression methods. If we had that, I think we
> >> >> could simply add dependencies between attributes and that catalog.
> >> >
> >> >Basically, up to this patch, we are having only built-in compression
> >> >methods and those can not be dropped so we don't need any dependency
> >> >at all.  We just want to know what is the current compression method
> >> >and what is the preserve compression methods supported for this
> >> >attribute.  Maybe we can do it better instead of using the NameData
> >> >but I don't think it makes sense to add a separate catalog?
> >> >
> >>
> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> very much like a workaround needed because we don't have the catalog.
> >>
> >> I don't quite understand how could we support custom compression methods
> >> without listing them in some sort of catalog?
> >
> >Yeah for supporting custom compression we need some catalog.
> >
> >> >> Moreover, having the catalog would allow adding compression methods
> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> >> compression methods. Which seems like a strange limitation, considering
> >> >> this thread is called "custom compression methods".
> >> >
> >> >I think I forgot to mention while submitting the previous patch that
> >> >the next patch I am planning to submit is, Support creating the custom
> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> >compression method.  And for dependency handling, we can create an
> >> >attribute dependency on the pg_am row.  Basically, we will create the
> >> >attribute dependency on the current compression method AM as well as
> >> >on the preserved compression methods AM.   As part of this, we will
> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> >will be converted to the oid_vector (first OID will be of the current
> >> >compression method, followed by the preserved compression method's
> >> >oids).
> >> >
> >>
> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> quite match what I though AMs are about, but maybe it's just my fault.
> >>
> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> then add the catalog later - I think we should start with the catalog
> >> right away. The advantage is that if we end up committing only some of
> >> the patches in this cycle, we already have all the infrastructure etc.
> >> We can reorder that later, though.
> >
> >Hmm,  yeah we can do this way as well that first create a new catalog
> >table and add entries for these two built-in methods and the
> >attcompression can store the oid vector.  But if we only commit the
> >build-in compression methods part then does it make sense to create an
> >extra catalog or adding these build-in methods to the existing catalog
> >(if we plan to use pg_am).  Then in attcompression instead of using
> >one byte for each preserve compression method, we need to use oid.  So
> >from Robert's mail[1], it appeared to me that he wants that the
> >build-in compression methods part should be independently committable
> >and if we think from that perspective then adding a

Re: Parallel Inserts in CREATE TABLE AS

2020-10-05 Thread Amit Kapila

On Mon, Sep 28, 2020 at 3:58 PM Bharath Rupireddy
 wrote:
>
> Thanks Andres for the comments.
>
> On Thu, Sep 24, 2020 at 8:11 AM Andres Freund  wrote:
> >
> > > The design:
> >
> > I think it'd be good if you could explain a bit more why you think this
> > safe to do in the way you have done it.
> >
> > E.g. from a quick scroll through the patch, there's not even a comment
> > explaining that the only reason there doesn't need to be code dealing
> > with xid assignment because we already did the catalog changes to create
> > the table.
> >
>
> Yes we do a bunch of catalog changes related to the created new table.
> We will have both the txn id and command id assigned when catalogue
> changes are being made. But, right after the table is created in the
> leader, the command id is incremented (CommandCounterIncrement() is
> called from create_ctas_internal()) whereas the txn id remains the
> same. The new command id is marked as GetCurrentCommandId(true); in
> intorel_startup, then the parallel mode is entered. The txn id and
> command id are serialized into parallel DSM, they are then available
> to all parallel workers. This is discussed in [1].
>
> Few changes I have to make in the parallel worker code: set
> currentCommandIdUsed = true;, may be via a common API
> SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the
> extra command id sharing from the leader to workers.
>
> I will add a few comments in the upcoming patch related to the above info.
>

Yes, that would be good.

> >
> > But how does that work for SELECT INTO? Are you prohibiting
> > that? ...
> >
>
> In case of SELECT INTO, a new table gets created and I'm not
> prohibiting the parallel inserts and I think we don't need to.
>

So, in this case, also do we ensure that table is created before we
launch the workers. If so, I think you can explain in comments about
it and what you need to do that to ensure the same.

While skimming through the patch, a small thing I noticed:
+ /*
+ * SELECT part of the CTAS is parallelizable, so we can make
+ * each parallel worker insert the tuples that are resulted
+ * in it's execution into the target table.
+ */
+ if (!is_matview &&
+ IsA(plan->planTree, Gather))
+ ((DR_intorel *) dest)->is_parallel = true;
+

I am not sure at this stage if this is the best way to make CTAS as
parallel but if so, then probably you can expand the comments a bit to
say why you consider only Gather node (and that too when it is the
top-most node) and why not another parallel node like GatherMerge?

> Thoughts?
>
> >
> > > Below things are still pending. Thoughts are most welcome:
> > > 1. How better we can lift the "cannot insert tuples in a parallel worker"
> > > from heap_prepare_insert() for only CTAS cases or for that matter parallel
> > > copy? How about having a variable in any of the worker global contexts and
> > > use that? Of course, we can remove this restriction entirely in case we
> > > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY.
> >
> > And for the purpose of your question, we could then have a
> > table_insert_allow_parallel(TableInsertScan *);
> > or an additional arg to table_begin_insert().
> >
>
> Removing "cannot insert tuples in a parallel worker" restriction from
> heap_prepare_insert() is a common problem for parallel inserts in
> general, i.e. parallel inserts in CTAS, parallel INSERT INTO
> SELECTs[1] and parallel copy[2]. It will be good if a common solution
> is agreed.
>

Right, for now, I think you can simply remove that check from the code
instead of just commenting it. We will see if there is a better
check/Assert we can add there.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

2020-10-05 Thread Peter Smith

Hello Ajin.

I have done some review of the v6 patches.

I had some difficulty replying my review comments to the OSS list, so
I am putting them as an attachment here.

Kind Regards,
Peter Smith
Fujitsu Australia
Hello Ajin.

I have gone through the v6 patch changes and have a list of review
comments below.

Apologies for the length of this email - I know that many of the
following comments are trivial, but I figured I should either just
ignore everything cosmetic, or list everything regardless. I chose the
latter.

There may be some duplication where the same review comment is written
for multiple files and/or where the same file is in your multiple
patches.

Kind Regards.
Peter Smith
Fujitsu Australia

[BEGIN]

==
Patch V6-0001, File: contrib/test_decoding/expected/prepared.out (so
prepared.sql also)
==

COMMENT
Line 30 - The INSERT INTO test_prepared1 VALUES (2); is kind of
strange because it is not really part of the prior test nor the
following test. Maybe it would be better to have a comment describing
the purpose of this isolated INSERT and to also consume the data from
the slot so it does not get jumbled with the data of the following
(abort) test.

;

COMMENT
Line 53 - Same comment for this test INSERT INTO test_prepared1 VALUES
(4); It kind of has nothing really to do with either the prior (abort)
test nor the following (ddl) test.

;

COMMENT
Line 60 - Seems to check which locks are held for the test_prepared_1
table while the transaction is in progress. Maybe it would be better
to have more comments describing what is expected here and why.

;

COMMENT
Line 88 - There is a comment in the test saying "-- We should see '7'
before '5' in our results since it commits first." but I did not see
any test code that actually verifies that happens.

;

QUESTION
Line 120 - I did not really understand the SQL checking the pg_class.
I expected this would be checking table 'test_prepared1' instead. Can
you explain it?
SELECT 'pg_class' AS relation, locktype, mode
FROM pg_locks
WHERE locktype = 'relation'
AND relation = 'pg_class'::regclass;
relation | locktype | mode
--+--+--
(0 rows)

;

QUESTION
Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough
here for this test, or might it be that these statements would be
completed in less than one seconds anyhow?

;

QUESTION
Line 163 - How is this testing a SAVEPOINT? Or is it only to check
that the SAVEPOINT command is not part of the replicated changes?

;

COMMENT
Line 175 - Missing underscore in comment. Code requires also underscore:
"nodecode" --> "_nodecode"

==
Patch V6-0001, File: contrib/test_decoding/test_decoding.c
==

COMMENT
Line 43
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;

The "check_xid" seems a meaningless name. Check what?
IIUC maybe should be something like "check_xid_aborted"

;

COMMENT
Line 105
@ -88,6 +93,19 @@ static void
pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 ReorderBufferTXN *txn,
 int nrelations, Relation relations[],
 ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,

Remove extra blank line after these functions

;

COMMENT
Line 149
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->stream_change_cb = pg_decode_stream_change;
 cb->stream_message_cb = pg_decode_stream_message;
 cb->stream_truncate_cb = pg_decode_stream_truncate;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }

There is a confusing mix of terminology where sometimes things are
referred as ROLLBACK/rollback and other times apparently the same
operation is referred as ABORT/abort. I do not know the root cause of
this mixture. IIUC maybe the internal functions and protocol generally
use the term "abort", whereas the SQL syntax is "ROLLBACK"... but
where those two terms collide in the middle it gets quite confusing.

At least I thought the names of the "callbacks" which get exposed to
the user (e.g. in the help) might be better if they would match the
SQL.
"abort_prepared_cb" --> "rollback_prepared_db"

There are similar review comments like this below where the
alternating terms caused me some confusion.

~

Also, Remove the extra blank line before the end of the function.

;

COMMENT
Line 267
@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 strVal(elem->arg), elem->defname)));
 }
+ else if (strcmp(elem->defname, "two-phase-commit") == 0)
+ {
+ if (elem->arg == NULL)
+ continue;

IMO the "check-xid" code might be better rearranged so the NULL check
is first instead of if/else.
e.g.
if (elem->arg ==

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Michael Paquier

On Mon, Oct 05, 2020 at 05:24:06PM -0400, Bruce Momjian wrote:
> (Also, did we decide _not_ to make the pg_stat_statements queryid
> always a positive value?)

This specific point has been discussed a couple of years ago, please
see cff440d and its related thread:
https://www.postgresql.org/message-id/ca+tgmobg_kp4cbkfmsznuaam1gww6hhrnizc0kjrmooeynz...@mail.gmail.com
--
Michael


signature.asc
Description: PGP signature

Re: 回复：how to create index concurrently on partitioned table

2020-10-05 Thread Michael Paquier

On Mon, Oct 05, 2020 at 10:07:33PM -0500, Justin Pryzby wrote:
> Honestly, I think you're over-thinking and over-engineering indisclustered.
> 
> If "clusteredness" was something we offered to maintain across DML, I think
> that might be important to provide stronger guarantees.  As it is now, I don't
> think this patch is worth changing the catalog definition.

Well, this use case is new because we are discussing the relationship
of indisclustered across multiple transactions for multiple indexes,
so I'd rather have this discussion than not, and I have learnt
the hard way with REINDEX that we should care a lot about the
consistency of partition trees at any step of the operation.  Let's
imagine a simple example here, take this partition tree: p (parent),
and two partitions p1 and p2.  p has two partitioned indexes i and j,
indexes also present in p1 and p2 as i1, i2, j1 and j2.  Let's assume
that the user has done a CLUSTER on p USING i that completes, meaning
that i, i1 and i2 have indisclustered set.  Now let's assume that the
user does a CLUSTER on p USING j this time, and that this command
fails while processing p2, meaning that indisclustered is set for j1,
i2, and perhaps i or j depending on what the patch does.  Per the
latest arguments, j would be the one set to indisclustered.  From this
inconsistent state comes a couple of interesting things:
- A database-wide CLUSTER would finish by using j1 and i2 for the
operation on the partitions, while the intention was to use j2 for the
second partition as the previous command failed.
- With CLUSTER p, without USING.  Logically, I would assume that we
would rely on the value of indisclustered as of j, meaning that we
would *enforce* p2 to use j2.  But it could also be seen as incorrect
by the user because we would not use the index originally marked as
such.

So keeping this consistent has the advantage to have clear rules
here.

> I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition
> just because a different partitioned index was set clustered.  We'd clear 
> that,
> like always, and then (in my proposal) also clear its parents 
> "indisclustered".
> I still don't think that's essential, though.

Why?  Blocking a partition, which may be itself partitioned, to switch
to a different index if its partitioned parent uses something else
sounds kind of logic to me, at the end, because the user originally
intended to use CLUSTER with a specific index on this tree.  So I
would say that the partitioned table takes priority, and this should
be released with a WITHOUT CLUSTER from the partitioned table.

> I didn't think it's worth the overhead of closing and opening more CFs.
> But I don't mind.

Thanks, I'll do some cleanup.
--
Michael


signature.asc
Description: PGP signature

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-05 Thread Masahiko Sawada

On Mon, 5 Oct 2020 at 17:50, Amit Kapila  wrote:
>
> On Mon, Oct 5, 2020 at 1:26 PM Masahiko Sawada
>  wrote:
> >
> > On Sat, 3 Oct 2020 at 16:55, Amit Kapila  wrote:
> > >
> > > On Sat, Oct 3, 2020 at 9:26 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > When we discussed this before, I was thinking that we could have other
> > > > statistics for physical slots in the same statistics view in the
> > > > future. Having the view show only logical slots also makes sense to me
> > > > but I’m concerned a bit that we could break backward compatibility
> > > > that monitoring tools etc will be affected when the view starts to
> > > > show physical slots too.
> > > >
> > >
> > > I think that would happen anyway as we need to add more columns in
> > > view for the physical slots.
> >
> > I think it depends; adding more columns to the view might not break
> > tools if the query used in the tool explicitly specifies columns.
> >
>
> What if it uses Select * ...? It might not be advisable to assume how
> the user might fetch data.
>
> > OTOH
> > if the view starts to show more rows, the tool will need to have the
> > condition to get the same result as before.
> >
> > >
> > > > If the view shows only logical slots, it also
> > > > might be worth considering to have separate views for logical slots
> > > > and physical slots and having pg_stat_logical_replication_slots by
> > > > this change.
> > > >
> > >
> > > I am not sure at this stage but I think we will add the additional
> > > stats for physical slots once we have any in this view itself. I would
> > > like to avoid adding separate views if possible. The only reason to
> > > omit physical slots at this stage is that we don't have any stats for
> > > the same.
> >
> > I also prefer not to have separate views. I'm concerned about the
> > compatibility I explained above but at the same time I agree that it
> > doesn't make sense to show the stats always having nothing. Since
> > given you and Dilip agreed on that, I also agree with that.
> >
>
> Okay.
>
> > >
> > > > Here is my review comment on the v7 patch.
> > > >
> > > > +   /*
> > > > +* Set the same reset timestamp for all replication slots too.
> > > > +*/
> > > > +   for (i = 0; i < max_replication_slots; i++)
> > > > +   replSlotStats[i].stat_reset_timestamp =
> > > > globalStats.stat_reset_timestamp;
> > > > +
> > > >
> > > > You added back the above code but since we clear the timestamps on
> > > > creation of a new slot they are not shown:
> > > >
> > > > +   /* Register new slot */
> > > > +   memset(&replSlotStats[nReplSlotStats], 0, 
> > > > sizeof(PgStat_ReplSlotStats));
> > > > +   memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
> > > >
> > > > Looking at other statistics views such as pg_stat_slru,
> > > > pg_stat_bgwriter, and pg_stat_archiver, they have a valid
> > > > reset_timestamp value from the beginning. That's why I removed that
> > > > code and assigned the timestamp when registering a new slot.
> > > >
> > >
> > > Hmm, I don't think it is shown intentionally in those views. I think
> > > what is happening in other views is that it has been initialized with
> > > some value when we read the stats and then while updating and or
> > > writing because we don't change the stat_reset_timestamp, it displays
> > > the same value as initialized at the time of read. Now, because in
> > > pgstat_replslot_index() we always initialize the replSlotStats it
> > > would overwrite any previous value we have set during read and display
> > > the stat_reset as empty for replication slots. If we stop initializing
> > > the replSlotStats in pgstat_replslot_index() then we will see similar
> > > behavior as other views have. So even if we want to change then
> > > probably we should stop initialization in pgstat_replslot_index but I
> > > don't think that is necessarily better behavior because the
> > > description of the parameter doesn't indicate any such thing.
> >
> > Understood. I agreed that the newly created slot doesn't have
> > reset_timestamp. Looking at pg_stat_database, a view whose rows are
> > added dynamically unlike other stat views, the newly created database
> > doesn't have reset_timestamp. But given we clear the stats for a slot
> > at pgstat_replslot_index(), why do we need to initialize the
> > reset_timestamp with globalStats.stat_reset_timestamp when reading the
> > stats file? Even if we could not find any slot stats in the stats file
> > the view won’t show anything.
> >
>
> It was mainly for a code consistency point of view. Also, we will
> clear the data in pgstat_replslot_index only for new slots, not for
> existing slots. It might be used when we can't load the statsfile as
> per comment in code ("Set the current timestamp (will be kept only in
> case we can't load an existing statsfile)).
>

Understood.

Looking at pgstat_reset_replslot_counter() in the v8 patch, even if we
pass a physical slot name to pg_stat_reset_replication_slot() a
PgStat_

Re: Logical Replication - detail message with names of missing columns

2020-10-05 Thread Amit Kapila

On Mon, Oct 5, 2020 at 8:00 PM Bharath Rupireddy
 wrote:
>
> Thanks Amit for the review comments. I will post an updated patch soon.
>
> On Mon, Oct 5, 2020 at 9:39 AM Amit Kapila  wrote:
> >
> > 6. I think we should add one test case for this in the existing
> > regression test (src/test/subscription/t/008_diff_schema).
> >
>
> This patch logs the missing column names message in subscriber server
> logs. Is there a way to see the logs in these tests and use that as
> expected result for the test case?
>

I don't think there is any direct way to achieve that. What we can do
is to check that the data is not replicated in such a case but I don't
think it is worth to add a test for that behavior. So, I think we can
skip adding a test for this unless someone else has a better idea to
do the same.

-- 
With Regards,
Amit Kapila.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Bruce Momjian

On Tue, Oct  6, 2020 at 11:11:27AM +0800, Julien Rouhaud wrote:
> > I do think the queryid has to display independent of pg_stat_statements,
> > because I can see people using queryid for log file and pg_stat_activity
> > comparisons.  I also think the ability to have queryid accessible is an
> > important feature outside of pg_stat_statements, so I do think we need a
> > way to move this idea forward.
> 
> For the record, for now any extension can compute a queryid and there
> are at least 2 other published extensions that already do that, one of
> them having different semantics on how to compute the queryid.  I'm
> not sure that we'll ever get a consensus on those semantics due to
> performance tradeoff, so removing the ability to let people put their
> own code for that doesn't seem like the best way forward.
> 
> Maybe we could add a new hook for only queryid computation, and add a
> GUC to let people choose between no queryid computed, core computation
> (current pg_stat_statement) and 3rd party plugin?

That all seems very complicated.  If we go in that direction, I suggest
we just give up getting any of this into core.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Julien Rouhaud

On Tue, Oct 6, 2020 at 10:18 AM Bruce Momjian  wrote:
>
> On Mon, Oct  5, 2020 at 07:58:42PM -0300, Álvaro Herrera wrote:
> > On 2020-Oct-05, Tom Lane wrote:
> >
> > > FWIW, I think this proposal is a mess.  I was willing to hold my nose
> > > and have a queryId field in the internal Query struct without any solid
> > > consensus about what its semantics are and which extensions get to use it.
> > > Exposing it to end users seems like a bridge too far, though.  In
> > > particular, I'm afraid that that will cause people to expect it to have
> > > consistent values across PG versions, or even just across architectures
> > > within one version.
> >
> > I wonder if it would help to purposefully change the computation so that
> > it is not -- for instance, hash the system_identifier as initial value.
> > Then users would be forced to accept that it'll change as soon as it
> > migrates to another server or is upgraded to a new major version.
>
> That seems like a good idea, but it would prevent cross-cluster
> same-major-version comparisons, which seems like a negative.  Perhaps we
> should add the major version into the hash to handle this.  Ideally,
> let's just put a queryid-hash-version into to the hash, so if we change
> the computation, we just update the hash version and nothing matches
> anymore.
>
> I do think the queryid has to display independent of pg_stat_statements,
> because I can see people using queryid for log file and pg_stat_activity
> comparisons.  I also think the ability to have queryid accessible is an
> important feature outside of pg_stat_statements, so I do think we need a
> way to move this idea forward.

For the record, for now any extension can compute a queryid and there
are at least 2 other published extensions that already do that, one of
them having different semantics on how to compute the queryid.  I'm
not sure that we'll ever get a consensus on those semantics due to
performance tradeoff, so removing the ability to let people put their
own code for that doesn't seem like the best way forward.

Maybe we could add a new hook for only queryid computation, and add a
GUC to let people choose between no queryid computed, core computation
(current pg_stat_statement) and 3rd party plugin?

Re: 回复：how to create index concurrently on partitioned table

2020-10-05 Thread Justin Pryzby

On Tue, Oct 06, 2020 at 11:42:55AM +0900, Michael Paquier wrote:
> On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote:
> > It means that we might do N catalog updates for a partition heirarchy 
> > that's N
> > levels deep.  Normally, N=2, and we'd clear indisclustered for the index as
> > well as its parent.  This is not essential, though.
> 
> Hmm.  I got to think more about this one, and being able to ensure a
> consistent state of indisclustered for partitioned tables and all
> their partitions across multiple transactions at all times would be a
> nice property, as we could easily track down if an operation has
> failed in-flight.  The problem here is that we are limited by
> indisclustered being a boolean, so we may set indisclustered one way
> or another on some partitions, or on some parent partitioned tables,
> but we are not sure if what's down is actually clustered for the
> leaves of a partitioned index, or not.  Or maybe we even have an
> inconsistent state, so this does not provide a good user experience.
> The consistency of a partition tree is a key thing, and we have such 
> guarantees with REINDEX (with or without concurrently), so what I'd
> like to think that what makes sense for indisclustered on a
> partitioned index is to have the following set of guarantees, and
> relying on indisvalid as being true iff an index partition tree is
> complete on a given table partition tree is really important:
> - If indisclustered is set to true for a partitioned index, then we
> have the guarantee that all its partition and partitioned indexes have
> indisclustered set to true, across all the layers down to the leaves.
> - If indisclustered is set to false for a partitioned index, then we
> have the guarantee that all its partition and partitioned indexes have
> indisclustered set to false.
> 
> If we happen to attach a new partition to a partitioned table of such
> a tree, I guess that it is then logic to have indisclustered set to
> the same value as the partitioned index when the partition inherits an
> index.  So, it seems to me that with the current catalogs we are
> limited by indisclustered as being a boolean to maintain some kind of
> consistency across transactions of CLUSTER, as we would use one
> transaction per leaf for the work.  In order to make things better
> here, we could switch indisclustered to a char, able to use three
> states:
> - 'e' or enabled, equivalent to indisclustered == true.  There should
> be only one index on a table with 'e' set at a given time.
> - 'd' or disabled, equivalent to indisclustered == false.
> - a new third state, for an equivalent of work-in-progress, let's call
> it 'w', or whatever.
> 
> Then, when we begin a CLUSTER on a partitioned table, I would imagine
> the following:
> - Switch all the indexes selected to 'w' in a first transaction, and
> do not reset the state of other indexes if there is one 'e'.  For
> CLUSTER without USING, we switch the existing 'e' to 'w' if there is
> such an index.  If there are no indexes select-able, issue an error.
> If we find an index with 'w', we have a candidate from a previous
> failed command, so this gets used.  I don't think that this design
> requires a new DDL either as we could reset all 'w' and 'e' to 'd' if
> using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table.  For
> CLUSTER with USING, the partitioned index selected and its leaves are
> switched to 'w', similarly.
> - Then do the work for all the partitions, one partition per
> transaction, keeping 'w'.
> - In a last transaction, switch all the partitions from 'w' to 'e',
> resetting any existing 'e'.

Honestly, I think you're over-thinking and over-engineering indisclustered.

If "clusteredness" was something we offered to maintain across DML, I think
that might be important to provide stronger guarantees.  As it is now, I don't
think this patch is worth changing the catalog definition.

> ALTER TABLE CLUSTER ON should also be a supported operation, where 'e'
> gets switched for all the partitions in a single transaction.  Of
> course, this should not work for an invalid index.  Even with such a
> design, I got to wonder if there could be cases where a user does
> *not* want to cluster the leaves of a partition tree with the same
> index.  If that were to happen, the partition tree may need a redesign
> so making things work so as we force partitions to inherit what's
> wanted for the partitioned table makes the most sense to me.

I wondered the same thing: should clustering a parent index cause the the child
indexes to be marked as clustered ?  Or should it be possible for an
intermediate child to say (marked as) clustered on some other index.  I don't
have strong feelings, but the patch currently sets indisclustered as a natural
consequence of the implementation, so I tentatively think that's what's right.
After all, CLUSTER sets indisclustered without having to also say
"ALTER..CLUSTER ON".

This is relevant to the question I

Re: 回复：how to create index concurrently on partitioned table

2020-10-05 Thread Michael Paquier

On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote:
> It means that we might do N catalog updates for a partition heirarchy that's N
> levels deep.  Normally, N=2, and we'd clear indisclustered for the index as
> well as its parent.  This is not essential, though.

Hmm.  I got to think more about this one, and being able to ensure a
consistent state of indisclustered for partitioned tables and all
their partitions across multiple transactions at all times would be a
nice property, as we could easily track down if an operation has
failed in-flight.  The problem here is that we are limited by
indisclustered being a boolean, so we may set indisclustered one way
or another on some partitions, or on some parent partitioned tables,
but we are not sure if what's down is actually clustered for the
leaves of a partitioned index, or not.  Or maybe we even have an
inconsistent state, so this does not provide a good user experience.
The consistency of a partition tree is a key thing, and we have such 
guarantees with REINDEX (with or without concurrently), so what I'd
like to think that what makes sense for indisclustered on a
partitioned index is to have the following set of guarantees, and
relying on indisvalid as being true iff an index partition tree is
complete on a given table partition tree is really important:
- If indisclustered is set to true for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to true, across all the layers down to the leaves.
- If indisclustered is set to false for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to false.

If we happen to attach a new partition to a partitioned table of such
a tree, I guess that it is then logic to have indisclustered set to
the same value as the partitioned index when the partition inherits an
index.  So, it seems to me that with the current catalogs we are
limited by indisclustered as being a boolean to maintain some kind of
consistency across transactions of CLUSTER, as we would use one
transaction per leaf for the work.  In order to make things better
here, we could switch indisclustered to a char, able to use three
states:
- 'e' or enabled, equivalent to indisclustered == true.  There should
be only one index on a table with 'e' set at a given time.
- 'd' or disabled, equivalent to indisclustered == false.
- a new third state, for an equivalent of work-in-progress, let's call
it 'w', or whatever.

Then, when we begin a CLUSTER on a partitioned table, I would imagine
the following:
- Switch all the indexes selected to 'w' in a first transaction, and
do not reset the state of other indexes if there is one 'e'.  For
CLUSTER without USING, we switch the existing 'e' to 'w' if there is
such an index.  If there are no indexes select-able, issue an error.
If we find an index with 'w', we have a candidate from a previous
failed command, so this gets used.  I don't think that this design
requires a new DDL either as we could reset all 'w' and 'e' to 'd' if
using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table.  For
CLUSTER with USING, the partitioned index selected and its leaves are
switched to 'w', similarly.
- Then do the work for all the partitions, one partition per
transaction, keeping 'w'.
- In a last transaction, switch all the partitions from 'w' to 'e',
resetting any existing 'e'.

ALTER TABLE CLUSTER ON should also be a supported operation, where 'e'
gets switched for all the partitions in a single transaction.  Of
course, this should not work for an invalid index.  Even with such a
design, I got to wonder if there could be cases where a user does
*not* want to cluster the leaves of a partition tree with the same
index.  If that were to happen, the partition tree may need a redesign
so making things work so as we force partitions to inherit what's
wanted for the partitioned table makes the most sense to me.

By the way, I mentioned that previously, but this thread got used for
REINDEX that is finished, and now we have a discussion going on with
CLUSTER.  There is also stuff related to CIC and DIC.  There is also
only one CF entry for all that.  I really think that this work had
better be split into separate threads, with separate CF entries. Do
you mind if I change the contents of the CF app so as the existing 
item is renamed for REINDEX, that got committed?  Then we could create
a new entry for CIC/DIC (it may make sense to split these two as
well, but I am not if there are any overlaps between the APIs of the
two), and a new entry for CLUSTER, depending on the state of the work.
The original subject of the thread is also unrelated, this makes the
review process unnecessarily complicated, and that's really
confusing.  Each discussion should go into its own, independent,
thread.
--
Michael


signature.asc
Description: PGP signature

Re: enable_incremental_sort changes query behavior

2020-10-05 Thread James Coleman

On Mon, Oct 5, 2020 at 11:33 AM James Coleman  wrote:
>
> On Sun, Oct 4, 2020 at 9:40 PM James Coleman  wrote:
> >
> > On Sat, Oct 3, 2020 at 10:10 PM James Coleman  wrote:
> > >
> > > On Sat, Oct 3, 2020 at 5:44 PM Tomas Vondra
> > >  wrote:
> > > >
> > > > On Sat, Oct 03, 2020 at 10:50:06AM -0400, James Coleman wrote:
> > > > >On Fri, Oct 2, 2020 at 11:16 PM James Coleman  wrote:
> > > > >>
> > > > >> On Fri, Oct 2, 2020 at 7:07 PM James Coleman  
> > > > >> wrote:
> > > > >> >
> > > > >> > On Fri, Oct 2, 2020 at 6:28 PM Tomas Vondra
> > > > >> >  wrote:
> > > > >> > >
> > > > >> > > On Fri, Oct 02, 2020 at 05:45:52PM -0400, James Coleman wrote:
> > > > >> > > >On Fri, Oct 2, 2020 at 4:56 PM Tomas Vondra
> > > > >> > > > wrote:
> > > > >> > > >>
> > > > >> > > >> ...
> > > > >> > > >>
> > > > >> > > >> More importanly, it does not actually fix the issue - it does 
> > > > >> > > >> fix that
> > > > >> > > >> particular query, but just replacing the DISTINCT with either 
> > > > >> > > >> ORDER BY
> > > > >> > > >> or GROUP BY make it fail again :-(
> > > > >> > > >>
> > > > >> > > >> Attached is a simple script I used, demonstrating these 
> > > > >> > > >> issues for the
> > > > >> > > >> three cases that expect to have ressortgroupref != 0 (per the 
> > > > >> > > >> comment
> > > > >> > > >> before TargetEntry in plannodes.h).
> > > > >> > > >
> > > > >> > > >So does checking for volatile expressions (if you happened to 
> > > > >> > > >test
> > > > >> > > >that) solve all the cases? If you haven't tested that yet, I 
> > > > >> > > >can try
> > > > >> > > >to do that this evening.
> > > > >> > > >
> > > > >> > >
> > > > >> > > Yes, it does fix all the three queries in the SQL script.
> > > > >> > >
> > > > >> > > The question however is whether this is the root issue, and 
> > > > >> > > whether it's
> > > > >> > > the right way to fix it. For example - volatility is not the 
> > > > >> > > only reason
> > > > >> > > that may block qual pushdown. If you look at 
> > > > >> > > qual_is_pushdown_safe, it
> > > > >> > > also blocks pushdown of leaky functions in security_barrier 
> > > > >> > > views. So I
> > > > >> > > wonder if that could cause failures too, somehow. But I haven't 
> > > > >> > > managed
> > > > >> > > to create such example.
> > > > >> >
> > > > >> > I was about to say that the issue here is slightly different from 
> > > > >> > qual
> > > > >> > etc. pushdown, since we're not concerned about quals here, and so I
> > > > >> > wonder where we determine what target list entries to put in a 
> > > > >> > given
> > > > >> > scan path, but then I realized that implies (maybe!) a simpler
> > > > >> > solution. Instead of duplicating checks on target list entries 
> > > > >> > would
> > > > >> > be safe, why not check directly in 
> > > > >> > get_useful_pathkeys_for_relation()
> > > > >> > whether or not the pathkey has a target list entry?
> > > > >> >
> > > > >> > I haven't been able to try that out yet, and so maybe I'm missing
> > > > >> > something, but I need to step out for a bit, so I'll have to look 
> > > > >> > at
> > > > >> > it later.
> > > > >>
> > > > >> I've started poking at this, but haven't yet found a workable
> > > > >> solution. See the attached patch which "solves" the problem by
> > > > >> breaking putting the sort under the gather merge, but it at least
> > > > >> demonstrates conceptually what I think we're interested in doing.
> > > > >>
> > > > >> The issue currently is that the comparison of expressions fails -- in
> > > > >> our query, the first column selected shows up as a Var (with the same
> > > > >> contents) but is a different pointer in the em_expr and the reltarget
> > > > >> exprs list. I don't yet see a good way to get the equivalence class
> > > > >> for a PathTarget entry.
> > > > >
> > > > >Hmm, I think I was looking to do is the attached patch. I didn't
> > > > >realize until I did a lot of reading through source that we have an
> > > > >equal() function that can compare exprs. That (plus the realization in
> > > > >[1] the originally reported backtrace wasn't where the error actually
> > > > >came from) convinced me that what we need is to confirm not only that
> > > > >the all of the vars in the ec member come from the relids in the rel
> > > > >but also that the expr is actually represented in the target of the
> > > > >rel.
> > > > >
> > > > >With the gucs changed as I mentioned earlier both of the plans (with
> > > > >and without a volatile call in the 2nd select entry) now look like:
> > > > >
> > > > > Unique
> > > > >   ->  Gather Merge
> > > > > Workers Planned: 2
> > > > > ->  Sort
> > > > >   Sort Key: ref_0.i, (md5(ref_0.t))
> > > > >   ->  Nested Loop
> > > > > ->  Parallel Index Scan using ref_0_i_idx on ref_0
> > > > > ->  Function Scan on generate_series ref_1
> > > > >
> > > > >Without the gucs changed the minimal repro case now doesn't error, but
>

Re: Improving connection scalability: GetSnapshotData()

2020-10-05 Thread Andres Freund

Hi,

On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote:
> On 10/1/20 4:22 PM, Andres Freund wrote:
> > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> > output
> > # if we're using native Perl, but not if we're using MSys Perl.  So do 
> > it
> > # by hand in the latter case, here and elsewhere.
> > that IPC::Run converts things, but that native windows perl uses
> > https://perldoc.perl.org/perlrun#PERLIO
> > a PERLIO that includes :crlf, whereas msys probably doesn't?
> >
> > Any chance you could run something like
> > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> > on both native and msys perl?
> >
> >
> >> possibly also for stderr, just to make it more futureproof, and at the
> >> top of the file:
> >>
> >> use Config;
> >>
> >>
> >> Do you want me to test that first?
> > That'd be awesome.

> The change I suggested makes jacana happy.

Thanks, pushed. Hopefully that fixes the mingw animals.

I wonder if we instead should do something like

# Have mingw perl treat CRLF the same way as perl on native windows does
ifeq ($(build_os),mingw32)
PROVE="PERLIO=unixcrlf $(PROVE)"
endif

in Makefile.global.in? Then we could remove these rexes from all the
various places?

Greetings,

Andres Freund

Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-10-05 Thread Kyotaro Horiguchi

At Mon, 5 Oct 2020 16:07:50 -0400, Bruce Momjian  wrote in 
> > Yes, that is the version I was going to apply.  I will do it today. 
> > Thanks.
> 
> Patch applied to master, and the first paragraph diff was applied to PG
> 12-13 too.

Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Bruce Momjian

On Mon, Oct  5, 2020 at 07:58:42PM -0300, Álvaro Herrera wrote:
> On 2020-Oct-05, Tom Lane wrote:
> 
> > FWIW, I think this proposal is a mess.  I was willing to hold my nose
> > and have a queryId field in the internal Query struct without any solid
> > consensus about what its semantics are and which extensions get to use it.
> > Exposing it to end users seems like a bridge too far, though.  In
> > particular, I'm afraid that that will cause people to expect it to have
> > consistent values across PG versions, or even just across architectures
> > within one version.
> 
> I wonder if it would help to purposefully change the computation so that
> it is not -- for instance, hash the system_identifier as initial value.
> Then users would be forced to accept that it'll change as soon as it
> migrates to another server or is upgraded to a new major version.

That seems like a good idea, but it would prevent cross-cluster
same-major-version comparisons, which seems like a negative.  Perhaps we
should add the major version into the hash to handle this.  Ideally,
let's just put a queryid-hash-version into to the hash, so if we change
the computation, we just update the hash version and nothing matches
anymore.

I do think the queryid has to display independent of pg_stat_statements,
because I can see people using queryid for log file and pg_stat_activity
comparisons.  I also think the ability to have queryid accessible is an
important feature outside of pg_stat_statements, so I do think we need a
way to move this idea forward.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-10-05 Thread Fujii Masao





On 2020/10/05 20:32, Bharath Rupireddy wrote:

On Mon, Oct 5, 2020 at 9:45 AM Fujii Masao  wrote:



I see the errmsg() with plain texts in other places in the code base
as well. Is it  that we look at the error message and if it is a plain
text(without database objects or table data), we decide to have no
translation? Or is there any other policy?


I was thinking that elog() basically should be used to report this
debug message, instead, but you used ereport() because maybe
you'd like to add detail message about connection error. Is this
understanding right? elog() uses errmsg_internal().



Yes that's correct.



So if ereport() is used as an aternative of elog() for some reasons,
IMO errmsg_internal() should be used. Thought?



Yes, this is apt for our case.



OTOH, the messages you mentioned are not debug ones,
so basically ereport() and errmsg() should be used, I think.



In connection.c file, yes they are of ERROR type. Looks like it's not
a standard to use errmsg_internal for DEBUG messages that require no
translation with ereport

(errmsg("wrote block details for %d blocks", num_blocks)));
(errmsg("MultiXact member stop limit is now %u based on MultiXact %u
(errmsg("oldest MultiXactId member is at offset %u",

However, there are few other places, where errmsg_internal is used for
DEBUG purposes.

(errmsg_internal("finished verifying presence of "
(errmsg_internal("%s(%d) name: %s; blockState:

Having said that, IMHO it's better to keep the way it is currently in
the code base.


Agreed.







Thanks. Attaching v10 patch. Please have a look.


Thanks for updating the patch! I will mark the patch as ready for committer in 
CF.
Barring any objections, I will commit that.



Thanks a lot for the review comments.


I pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Recent failures on buildfarm member hornet

2020-10-05 Thread Tom Lane

hornet has failed its last five runs with

2020-10-05 22:45:42.784 UTC [34734498:40] pg_regress/create_aggregate LOG:  
statement: create aggregate my_percentile_disc(float8 ORDER BY anyelement) (
  stype = internal,
  sfunc = ordered_set_transition,
  finalfunc = percentile_disc_final,
  finalfunc_extra = true,
  finalfunc_modify = read_write
);
TRAP: FailedAssertion("variadicArgType != InvalidOid", File: "pg_aggregate.c", 
Line: 216, PID: 34734498)

After looking at the commits immediately preceding the first failure, and
digging around in the aggregate-related code, it seems like commit 
cc99baa43 (Improve pg_list.h's linitial(), lsecond() and co macros)
must've broke it somehow.  The nearest thing that I can see to a theory
is that where DefineAggregate does
numDirectArgs = intVal(lsecond(args));
it's coming out with the wrong result, leading to a failure of the
numDirectArgs-vs-numArgs sanity check in AggregateCreate.  But how could
that be?  I hesitate to blame the compiler twice in one week.  OTOH, it's
a not-very-mainstream compiler on a not-very-mainstream architecture.

Noah, can you poke into this in a little more detail and try to verify
what is happening?

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Alvaro Herrera

On 2020-Oct-05, Tom Lane wrote:

> FWIW, I think this proposal is a mess.  I was willing to hold my nose
> and have a queryId field in the internal Query struct without any solid
> consensus about what its semantics are and which extensions get to use it.
> Exposing it to end users seems like a bridge too far, though.  In
> particular, I'm afraid that that will cause people to expect it to have
> consistent values across PG versions, or even just across architectures
> within one version.

I wonder if it would help to purposefully change the computation so that
it is not -- for instance, hash the system_identifier as initial value.
Then users would be forced to accept that it'll change as soon as it
migrates to another server or is upgraded to a new major version.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Tom Lane

Bruce Momjian  writes:
> I would like to apply this patch (I know it has been in the commitfest
> since July 2019), but I have some questions about the user API.  Does it
> make sense to have a column in pg_stat_actvity and an option in
> log_line_prefix that will be empty unless pg_stat_statements is
> installed?  Is there no clean way to move the query hash computation out
> of pg_stat_statements and into the main code so the query id is always
> visible?  (Also, did we decide _not_ to make the pg_stat_statements
> queryid always a positive value?)

FWIW, I think this proposal is a mess.  I was willing to hold my nose
and have a queryId field in the internal Query struct without any solid
consensus about what its semantics are and which extensions get to use it.
Exposing it to end users seems like a bridge too far, though.  In
particular, I'm afraid that that will cause people to expect it to have
consistent values across PG versions, or even just across architectures
within one version.

The larger picture here is that there's lots of room to doubt whether
pg_stat_statements' decisions about what to ignore or include in the ID
will be satisfactory to everybody.  If that were not so, we'd just move
the computation into core.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-05 Thread Bruce Momjian

On Wed, Aug 19, 2020 at 04:19:30PM +0200, Julien Rouhaud wrote:
> Similarly to other fields in pg_stat_activity, only the queryid from the top
> level statements are exposed, and if the backends status isn't active then the
> queryid from the last executed statements is displayed.
> 
> Also add a %Q placeholder to include the queryid in the log_line_prefix, which
> will also only expose top level statements.

I would like to apply this patch (I know it has been in the commitfest
since July 2019), but I have some questions about the user API.  Does it
make sense to have a column in pg_stat_actvity and an option in
log_line_prefix that will be empty unless pg_stat_statements is
installed?  Is there no clean way to move the query hash computation out
of pg_stat_statements and into the main code so the query id is always
visible?  (Also, did we decide _not_ to make the pg_stat_statements
queryid always a positive value?)

Also, in the doc patch:

By default, query identifiers are not computed, so this field will 
always
be null, unless an additional module that compute query identifiers, 
such
as , is configured.

why are you saying "such as"?  Isn't pg_stat_statements the only way to
see the queryid?  This command allowed the queryid to be displayed in
pg_stat_activity:

ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: [PATCH] Automatic HASH and LIST partition creation

2020-10-05 Thread Pavel Borisov

Hi, hackers!
I added some extra tests for different cases of use of automatic partition
creation.
v3-0002 can be applied on top of the original v2 patch for correct work
with some corner cases with constraints included in this test.

As for immediate/deferred I think that now only available now is immediate,
so using word IMMEDIATE seems a little bit redundant to me. We may
introduce this word together with adding DEFERRED option. However, my
opinion is not in strong opposition to both options. Оther opinions are
very much welcome!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v3-0002-Minor-fix.patch
Description: Binary data


v3-0001-Tests-for-automatic-hash-list-partitions-creation.patch
Description: Binary data

Re: 回复：how to create index concurrently on partitioned table

2020-10-05 Thread Justin Pryzby

On Mon, Oct 05, 2020 at 05:46:27PM +0900, Michael Paquier wrote:
> On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:
> > Also, if a partitioned index is clustered, when we clear indisclustered for
> > other indexes, should we also propogate that to their parent indexes, if 
> > any ?
> 
> I am not sure what you mean here.  Each partition's cluster runs in
> its own individual transaction based on the patch you sent.  Are you
> suggesting to update indisclustered for the partitioned index of a
> partitioned table and all its parent partitioned in the same
> transaction, aka a transaction working on the partitioned table?

No, I mean that if a partition is no longer clustered on some index, then its
parent isn't clustered on that indexes parent, either.

It means that we might do N catalog updates for a partition heirarchy that's N
levels deep.  Normally, N=2, and we'd clear indisclustered for the index as
well as its parent.  This is not essential, though.

> >> It would be good also to check if
> >> we have a partition index tree that maps partially with a partition
> >> table tree (aka no all table partitions have a partition index), where
> >> these don't get clustered because there is no index to work on.
> >
> > This should not happen, since a incomplete partitioned index is "invalid".
> 
> Indeed, I did not know this property.

I think that's something we can apply for CIC/DIC, too.
It's not essential to avoid leaving an "invalid" or partial index if
interrupted.  It's only essential that a partial, partitioned index is not
"valid".

For DROP IND CONCURRENTLY, I wrote:

On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:
> 2. Maybe the easiest way is to mark all indexes invalid and then drop all
> partitions (concurrently) and then the partitioned parent.  If interrupted,
> this would leave a parent index marked "invalid", and some child tables with 
> no
> indexes.  I think this may be "ok".  The same thing is possible if a 
> concurrent
> build is interrupted, right ?


-- 
Justin

Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-10-05 Thread Bruce Momjian

On Mon, Oct  5, 2020 at 02:02:34PM -0400, Bruce Momjian wrote:
> On Mon, Oct  5, 2020 at 10:25:08AM +0900, Kyotaro Horiguchi wrote:
> > > > >From 2978479ada887284eae0ed36c8acf29f1a002feb Mon Sep 17 00:00:00 2001
> > > > From: Kyotaro Horiguchi 
> > > > Date: Tue, 21 Jul 2020 23:01:27 +0900
> > > > Subject: [PATCH v2] Allow directory name for GUC ssl_crl_file and 
> > > > connection
> > > >  option sslcrl
> > > > 
> > > > X509_STORE_load_locations accepts a directory, which leads to
> > > > on-demand loading method with which method only relevant CRLs are
> > > > loaded.
> > > 
> > > Uh, I think this CRL patch is the wrong patch.  This thread is about the
> > > clientcert=verify-ca in pg_hba.conf.  I will use the patch I developed
> > > and posted on Tue, 1 Sep 2020 11:47:34 -0400 in this thread.
> > 
> > Mmmm. Sorry for the silly mistake. I'm confused with another one.
> > 
> > FWIW, the cause is a rewording of "cannot" to "can not". This is the
> > right one.
> 
> Yes, that is the version I was going to apply.  I will do it today. 
> Thanks.

Patch applied to master, and the first paragraph diff was applied to PG
12-13 too.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Probable documentation errors or improvements

2020-10-05 Thread Justin Pryzby

On Fri, Sep 25, 2020 at 09:30:00AM -0500, Justin Pryzby wrote:
> Split one patch about text search, added another one (sequences), added some
> info to commit messages, and added here.
> https://commitfest.postgresql.org/30/2744/

Added an additional patch regarding spaces between function arguments.

-- 
Justin
>From 28c0407bcf044e1a9a8c542ae6a4304a54f3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 22 Sep 2020 21:40:17 -0500
Subject: [PATCH v3 01/12] extraneous comma

---
 doc/src/sgml/rules.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index bcf860b68b..d6e3463ac2 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -769,7 +769,7 @@ SELECT t1.a, t2.b, t1.ctid FROM t1, t2 WHERE t1.a = t2.a;
 
 
 
-The benefit of implementing views with the rule system is,
+The benefit of implementing views with the rule system is
 that the planner has all
 the information about which tables have to be scanned plus the
 relationships between these tables plus the restrictive
@@ -781,7 +781,7 @@ SELECT t1.a, t2.b, t1.ctid FROM t1, t2 WHERE t1.a = t2.a;
 the best path to execute the query, and the more information
 the planner has, the better this decision can be. And
 the rule system as implemented in PostgreSQL
-ensures, that this is all information available about the query
+ensures that this is all information available about the query
 up to that point.
 
 
-- 
2.17.0

>From 4a181e8df650e2c93a2a3fded7bfd67081bb18dc Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 23 Sep 2020 13:09:11 -0500
Subject: [PATCH v3 02/12] semicolons after END in programlisting

---
 doc/src/sgml/func.sgml|  2 +-
 doc/src/sgml/plpgsql.sgml | 12 ++--
 doc/src/sgml/rules.sgml   |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..8082fddf8c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26597,7 +26597,7 @@ BEGIN
  obj.object_name,
  obj.object_identity;
 END LOOP;
-END
+END;
 $$;
 CREATE EVENT TRIGGER test_event_trigger_for_drops
ON sql_drop
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 74b6b25878..94a3b36458 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1181,7 +1181,7 @@ BEGIN
 SELECT users.userid INTO STRICT userid
 FROM users WHERE users.username = get_userid.username;
 RETURN userid;
-END
+END;
 $$ LANGUAGE plpgsql;
 
  On failure, this function might produce an error message such as
@@ -1854,7 +1854,7 @@ BEGIN
 RETURN NEXT r; -- return current row of SELECT
 END LOOP;
 RETURN;
-END
+END;
 $BODY$
 LANGUAGE plpgsql;
 
@@ -1882,7 +1882,7 @@ BEGIN
 END IF;
 
 RETURN;
- END
+ END;
 $BODY$
 LANGUAGE plpgsql;
 
@@ -1956,7 +1956,7 @@ DECLARE myvar int := 5;
 BEGIN
   CALL triple(myvar);
   RAISE NOTICE 'myvar = %', myvar;  -- prints 15
-END
+END;
 $$;
 
 
@@ -3559,7 +3559,7 @@ BEGIN
 ROLLBACK;
 END IF;
 END LOOP;
-END
+END;
 $$;
 
 CALL transaction_test1();
@@ -5213,7 +5213,7 @@ DECLARE
 f1 int;
 BEGIN
 RETURN f1;
-END
+END;
 $$ LANGUAGE plpgsql;
 WARNING:  variable "f1" shadows a previously defined variable
 LINE 3: f1 int;
diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index d6e3463ac2..e81addcfa9 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -2087,7 +2087,7 @@ CREATE FUNCTION tricky(text, text) RETURNS bool AS $$
 BEGIN
 RAISE NOTICE '% => %', $1, $2;
 RETURN true;
-END
+END;
 $$ LANGUAGE plpgsql COST 0.01;
 
 SELECT * FROM phone_number WHERE tricky(person, phone);
-- 
2.17.0

>From f3a1c02e297a501b116c079c8cbc41563522320b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 22 Sep 2020 22:51:47 -0500
Subject: [PATCH v3 03/12] punctuation

---
 doc/src/sgml/catalogs.sgml | 2 +-
 doc/src/sgml/config.sgml   | 4 ++--
 doc/src/sgml/parallel.sgml | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3927b1030d..5bd54cb218 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1525,7 +1525,7 @@
   
   
Role can log in. That is, this role can be given as the initial
-   session authorization identifier
+   session authorization identifier.
   
  
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee914740cc..8b4a2e440f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10084,8 +10084,8 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
   

-If set, do not trace locks for tables below this OID. (use to avoid
-output on system tables)
+If set, do not trace locks for tables below this OID (used

Re: Partitionwise join fails under GEQO

2020-10-05 Thread Tom Lane

Ashutosh Bapat  writes:
> On Mon, Oct 5, 2020 at 6:47 AM Tom Lane  wrote:
>> Now that I've seen this, I wonder whether add_child_join_rel_equivalences
>> might not be making duplicate EC entries even without GEQO.  Is there any
>> guarantee that it's not called repeatedly on related join-rel sets?

> build_child_join_rel() is the only caller of
> add_child_join_rel_equivalences(). Before the first calls the later,
> the first has an Assert
>  899 Assert(!find_join_rel(root, joinrel->relids));
> This means that for a given child join rel this function is called
> only once implying that add_child_join_rel_equivalences() is called
> only once for a given child join rel. Initially I thought that this is
> enough to not add duplicate child members. But to me use of
> bms_overlap() in that function looks doubtful.

Yeah.  As soon as you get into three-or-more-way joins, this code will
be called again, mentioning some of the same relids as at the lower
join level.

> On a related but different subject, I have been thinking on-off about
> the need for expression translation while planning partitions.The
> translated expressions differ only in the leaf-vars and even for most
> of the partitioned tables really only the varno (assuming that
> partitioned table and partitions will have same layouts when a large
> number of partitions are created automatically.) If we can introduce a
> new type of (polymorphic) Var node which represents not just the
> parent Var but also child Vars as well. The whole need for expression
> translation vanishes, reducing memory requirements by many folds.

Actually, the reason I have been looking at equivclass.c is that I've
been attacking the problem from a different direction.  I'm fairly
unexcited about making the definition of Var looser as you suggest
here --- I actually think that it's dangerously imprecise already,
due to not accounting for the effects of outer joins.  However,
we could avoid the growth in eclass members for large partition sets
if we simply didn't store child eclass members, instead translating
on-the-fly when we need to deal with child rels.  I have a patch
about half done, but it won't be possible to determine the true
performance implications of that idea until it's all done.  More
later.

Either approach would mean that add_child_join_rel_equivalences
goes away entirely, or at least doesn't need to store new em_is_child
entries anymore, causing the memory bloat issue to become moot.
So maybe we should just fix the wrong-context issue for now, and
live with the GEQO bloat in the back branches.

regards, tom lane

Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-10-05 Thread Bruce Momjian

On Mon, Oct  5, 2020 at 10:25:08AM +0900, Kyotaro Horiguchi wrote:
> > > >From 2978479ada887284eae0ed36c8acf29f1a002feb Mon Sep 17 00:00:00 2001
> > > From: Kyotaro Horiguchi 
> > > Date: Tue, 21 Jul 2020 23:01:27 +0900
> > > Subject: [PATCH v2] Allow directory name for GUC ssl_crl_file and 
> > > connection
> > >  option sslcrl
> > > 
> > > X509_STORE_load_locations accepts a directory, which leads to
> > > on-demand loading method with which method only relevant CRLs are
> > > loaded.
> > 
> > Uh, I think this CRL patch is the wrong patch.  This thread is about the
> > clientcert=verify-ca in pg_hba.conf.  I will use the patch I developed
> > and posted on Tue, 1 Sep 2020 11:47:34 -0400 in this thread.
> 
> Mmmm. Sorry for the silly mistake. I'm confused with another one.
> 
> FWIW, the cause is a rewording of "cannot" to "can not". This is the
> right one.

Yes, that is the version I was going to apply.  I will do it today. 
Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Buglets in equivclass.c

2020-10-05 Thread Tom Lane

David Rowley  writes:
> On Mon, 5 Oct 2020 at 06:29, Tom Lane  wrote:
>> I'm unsure whether to back-patch either of these.  They both seem to be
>> just latent bugs so far as the core code is concerned, but the first one
>> in particular seems like something that could bite extension code.
>> Thoughts?

> That's a good question. I'm leaning towards backpatching both of them.
> The 2nd is new as of v13, so it does not seem unreasonable that
> someone has just not yet stumbled on it with some extension that adds
> extra child ECs. I can imagine a use case for that, by getting rid of
> needless equality quals that duplicate the partition constraint.  The
> fix for the first just seems neater/faster/correct, so I can't really
> see any reason not to backpatch it.

Yeah, that's my conclusion too after sleeping on it.  Pushed,
thanks for reviewing.

regards, tom lane

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-10-05 Thread Bharath Rupireddy

On Mon, Oct 5, 2020 at 8:04 PM Fujii Masao  wrote:
>
> On 2020/10/05 19:45, Bharath Rupireddy wrote:
> > Hi,
> >
> > Autoprewarm module is using it's own SIGHUP(apw_sigterm_handler, 
> > got_sigterm) and SIGTERM(apw_sighup_handler, got_sighup) handlers which are 
> > similar to standard signal handlers(except for a difference [1]). Isn't it 
> > good to remove them and  use standard SignalHandlerForConfigReload and 
> > SignalHandlerForShutdownRequest?
> >
> > Attaching the patch for the above changes.
> >
> > Looks like the commit[2] replaced custom handlers with standard handlers.
> >
> > Thoughts?
>
> +1
>
> The patch looks good to me.
>

Thanks.

>
> ISTM that we can also replace StartupProcSigHupHandler() in startup.c
> with SignalHandlerForConfigReload() by making the startup process use
> the general shared latch instead of its own one. POC patch attached.
> Thought?
>

I'm not quite sure whether it breaks something or not. I see that
WakeupRecovery() with XLogCtl->recoveryWakeupLatch latch from the
startup process is also being used in the walreceiver process. I may
be wrong, but have some concern if the behaviour is different in case
of EXEC_BACKEND and Windows?

Another concern is that we are always using
XLogCtl->recoveryWakeupLatch in shared mode, this makes sense as this
latch is also being used in walrecevier. But sometimes, MyLatch is
created in non-shared mode as well(see InitLatch(MyLatch)).

Others may have better thoughts though.

>
> Probably we can also replace sigHupHandler() in syslogger.c with
> SignalHandlerForConfigReload()? This would be separate patch, though.
>

+1 to replace sigHupHandler() with SignalHandlerForConfigReload() as
the latch and the functionality are pretty much the same.

WalReceiverMai(): I think we can also replace WalRcvShutdownHandler()
with SignalHandlerForShutdownRequest() because walrcv->latch point to
&MyProc->procLatch which in turn point to MyLatch.

Thoughts? If okay, we can combine these into a single patch. I will
post an updated patch soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2020-10-05 Thread Tomas Vondra

On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:

On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
 wrote:

On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> wrote:
>
>Thanks, Tomas for your feedback.
>
>> 9) attcompression ...
>>
>> The main issue I see is what the patch does with attcompression. Instead
>> of just using it to store a the compression method, it's also used to
>> store the preserved compression methods. And using NameData to store
>> this seems wrong too - if we really want to store this info, the correct
>> way is either using text[] or inventing charvector or similar.
>
>The reason for using the NameData is the get it in the fixed part of
>the data structure.
>

Why do we need that? It's possible to have varlena fields with direct
access (see pg_index.indkey for example).

I see.  While making it NameData I was thinking whether we have an
option to direct access the varlena. Thanks for pointing me there. I
will change this.

Adding NameData just to make

it fixed-length means we're always adding 64B even if we just need a
single byte, which means ~30% overhead for the FormData_pg_attribute.
That seems a bit unnecessary, and might be an issue with many attributes
(e.g. with many temp tables, etc.).

You are right.  Even I did not like to keep 64B for this, so I will change it.

>> But to me this seems very much like a misuse of attcompression to track
>> dependencies on compression methods, necessary because we don't have a
>> separate catalog listing compression methods. If we had that, I think we
>> could simply add dependencies between attributes and that catalog.
>
>Basically, up to this patch, we are having only built-in compression
>methods and those can not be dropped so we don't need any dependency
>at all.  We just want to know what is the current compression method
>and what is the preserve compression methods supported for this
>attribute.  Maybe we can do it better instead of using the NameData
>but I don't think it makes sense to add a separate catalog?
>

Sure, I understand what the goal was - all I'm saying is that it looks
very much like a workaround needed because we don't have the catalog.

I don't quite understand how could we support custom compression methods
without listing them in some sort of catalog?

Yeah for supporting custom compression we need some catalog.

>> Moreover, having the catalog would allow adding compression methods
>> (from extensions etc) instead of just having a list of hard-coded
>> compression methods. Which seems like a strange limitation, considering
>> this thread is called "custom compression methods".
>
>I think I forgot to mention while submitting the previous patch that
>the next patch I am planning to submit is, Support creating the custom
>compression methods wherein we can use pg_am catalog to insert the new
>compression method.  And for dependency handling, we can create an
>attribute dependency on the pg_am row.  Basically, we will create the
>attribute dependency on the current compression method AM as well as
>on the preserved compression methods AM.   As part of this, we will
>add two build-in AMs for zlib and pglz, and the attcompression field
>will be converted to the oid_vector (first OID will be of the current
>compression method, followed by the preserved compression method's
>oids).
>

Hmmm, ok. Not sure pg_am is the right place - compression methods don't
quite match what I though AMs are about, but maybe it's just my fault.

FWIW it seems a bit strange to first do the attcompression magic and
then add the catalog later - I think we should start with the catalog
right away. The advantage is that if we end up committing only some of
the patches in this cycle, we already have all the infrastructure etc.
We can reorder that later, though.

Hmm,  yeah we can do this way as well that first create a new catalog
table and add entries for these two built-in methods and the
attcompression can store the oid vector.  But if we only commit the
build-in compression methods part then does it make sense to create an
extra catalog or adding these build-in methods to the existing catalog
(if we plan to use pg_am).  Then in attcompression instead of using
one byte for each preserve compression method, we need to use oid.  So
from Robert's mail[1], it appeared to me that he wants that the
build-in compression methods part should be independently committable
and if we think from that perspective then adding a catalog doesn't
make much sense.  But if we are planning to commit the custom method
also then it makes more sense to directly start with the catalog
because that way it will be easy to expand without much refactoring.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com

Hmmm. Maybe I'm missing something subtle, but I think that plan can be
interpreted in various ways - it does not really say wheth

Re: partition routing layering in nodeModifyTable.c

2020-10-05 Thread Heikki Linnakangas


On 13/07/2020 08:47, Amit Langote wrote:

It's been over 11 months since there was any significant commentary on
the contents of the patches themselves, so perhaps I should reiterate
what the patches are about and why it might still be a good idea to
consider them.

The thread started with some very valid criticism of the way
executor's partition tuple routing logic looks randomly sprinkled over
in nodeModifyTable.c, execPartition.c.  In the process of making it
look less random, we decided to get rid of the global variable
es_result_relation_info to avoid complex maneuvers of
setting/resetting it correctly when performing partition tuple
routing, causing some other churn beside the partitioning code.  Same
with another global variable TransitionCaptureState.tcs_map.  So, the
patches neither add any new capabilities, nor improve performance, but
they do make the code in this area a bit easier to follow.

Actually, there is a problem that some of the changes here conflict
with patches being discussed on other threads ([1], [2]), so much so
that I decided to absorb some changes here into another "refactoring"
patch that I have posted at [2].


Thanks for the summary. It's been a bit hard to follow what depends on 
what across these threads, and how they work together. It seems that 
this patch set is the best place to start.



Attached rebased patches.

0001 contains preparatory FDW API changes to stop relying on
es_result_relation_info being set correctly.


Makes sense. The only thing I don't like about this is the way the 
ForeignScan->resultRelIndex field is set. make_foreignscan() initializes 
it to -1, and the FDW's PlanDirectModify() function is expected to set 
it, like you did in postgres_fdw:



@@ -2319,6 +2322,11 @@ postgresPlanDirectModify(PlannerInfo *root,
rebuild_fdw_scan_tlist(fscan, returningList);
}
 
+	/*

+* Set the index of the subplan result rel.
+*/
+   fscan->resultRelIndex = subplan_index;
+
table_close(rel, NoLock);
return true;
 }


It has to be set to that value (subplan_index is an argument to 
PlanDirectModify()), the FDW doesn't have any choice there, so this is 
just additional boilerplate code that has to be copied to every FDW that 
implements direct modify. Furthermore, if the FDW doesn't set it 
correctly, you could have some very interesting results, like updating 
wrong table. It would be better to set it in make_modifytable(), just 
after calling PlanDirectModify().


I'm also a bit unhappy with the way it's updated in set_plan_refs():


--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -904,6 +904,13 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
rc->rti += rtoffset;
rc->prti += rtoffset;
}
+   /*
+* Caution: Do not change the relative ordering 
of this loop
+* and the statement below that adds the result 
relations to
+* root->glob->resultRelations, because we need 
to use the
+* current value of 
list_length(root->glob->resultRelations)
+* in some plans.
+*/
foreach(l, splan->plans)
{
lfirst(l) = set_plan_refs(root,
@@ -1243,6 +1250,14 @@ set_foreignscan_references(PlannerInfo *root,
}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);

+
+   /*
+* Adjust resultRelIndex if it's valid (note that we are called before
+* adding the RT indexes of ModifyTable result relations to the global
+* list)
+*/
+   if (fscan->resultRelIndex >= 0)
+   fscan->resultRelIndex += 
list_length(root->glob->resultRelations);
 }
 
 /*


That "Caution" comment is well deserved, but could we make this more 
robust to begin with? The most straightforward solution would be to pass 
down the "current resultRelIndex" as an extra parameter to 
set_plan_refs(), similar to rtoffset. If we did that, we wouldn't 
actually need to set it before setrefs.c processing at all.


I'm a bit wary of adding another argument to set_plan_refs() because 
that's a lot of code churn, but it does seem like the most natural 
solution to me. Maybe create a new context struct to hold the 
PlannerInfo, rtoffset, and the new "currentResultRelIndex" value, 
similar to fix_scan_expr_context, to avoid passing through so many 
arguments.



Another idea is to merge "resultRelIndex" and a "range table index" into 
one value. Range table entries that are updated would have a 
ResultRelInfo, others would not. I'm not sure if that would end up being 
cleaner or messier than what we have now, but might

Re: enable_incremental_sort changes query behavior

2020-10-05 Thread James Coleman

On Sun, Oct 4, 2020 at 9:40 PM James Coleman  wrote:
>
> On Sat, Oct 3, 2020 at 10:10 PM James Coleman  wrote:
> >
> > On Sat, Oct 3, 2020 at 5:44 PM Tomas Vondra
> >  wrote:
> > >
> > > On Sat, Oct 03, 2020 at 10:50:06AM -0400, James Coleman wrote:
> > > >On Fri, Oct 2, 2020 at 11:16 PM James Coleman  wrote:
> > > >>
> > > >> On Fri, Oct 2, 2020 at 7:07 PM James Coleman  wrote:
> > > >> >
> > > >> > On Fri, Oct 2, 2020 at 6:28 PM Tomas Vondra
> > > >> >  wrote:
> > > >> > >
> > > >> > > On Fri, Oct 02, 2020 at 05:45:52PM -0400, James Coleman wrote:
> > > >> > > >On Fri, Oct 2, 2020 at 4:56 PM Tomas Vondra
> > > >> > > > wrote:
> > > >> > > >>
> > > >> > > >> ...
> > > >> > > >>
> > > >> > > >> More importanly, it does not actually fix the issue - it does 
> > > >> > > >> fix that
> > > >> > > >> particular query, but just replacing the DISTINCT with either 
> > > >> > > >> ORDER BY
> > > >> > > >> or GROUP BY make it fail again :-(
> > > >> > > >>
> > > >> > > >> Attached is a simple script I used, demonstrating these issues 
> > > >> > > >> for the
> > > >> > > >> three cases that expect to have ressortgroupref != 0 (per the 
> > > >> > > >> comment
> > > >> > > >> before TargetEntry in plannodes.h).
> > > >> > > >
> > > >> > > >So does checking for volatile expressions (if you happened to test
> > > >> > > >that) solve all the cases? If you haven't tested that yet, I can 
> > > >> > > >try
> > > >> > > >to do that this evening.
> > > >> > > >
> > > >> > >
> > > >> > > Yes, it does fix all the three queries in the SQL script.
> > > >> > >
> > > >> > > The question however is whether this is the root issue, and 
> > > >> > > whether it's
> > > >> > > the right way to fix it. For example - volatility is not the only 
> > > >> > > reason
> > > >> > > that may block qual pushdown. If you look at 
> > > >> > > qual_is_pushdown_safe, it
> > > >> > > also blocks pushdown of leaky functions in security_barrier views. 
> > > >> > > So I
> > > >> > > wonder if that could cause failures too, somehow. But I haven't 
> > > >> > > managed
> > > >> > > to create such example.
> > > >> >
> > > >> > I was about to say that the issue here is slightly different from 
> > > >> > qual
> > > >> > etc. pushdown, since we're not concerned about quals here, and so I
> > > >> > wonder where we determine what target list entries to put in a given
> > > >> > scan path, but then I realized that implies (maybe!) a simpler
> > > >> > solution. Instead of duplicating checks on target list entries would
> > > >> > be safe, why not check directly in get_useful_pathkeys_for_relation()
> > > >> > whether or not the pathkey has a target list entry?
> > > >> >
> > > >> > I haven't been able to try that out yet, and so maybe I'm missing
> > > >> > something, but I need to step out for a bit, so I'll have to look at
> > > >> > it later.
> > > >>
> > > >> I've started poking at this, but haven't yet found a workable
> > > >> solution. See the attached patch which "solves" the problem by
> > > >> breaking putting the sort under the gather merge, but it at least
> > > >> demonstrates conceptually what I think we're interested in doing.
> > > >>
> > > >> The issue currently is that the comparison of expressions fails -- in
> > > >> our query, the first column selected shows up as a Var (with the same
> > > >> contents) but is a different pointer in the em_expr and the reltarget
> > > >> exprs list. I don't yet see a good way to get the equivalence class
> > > >> for a PathTarget entry.
> > > >
> > > >Hmm, I think I was looking to do is the attached patch. I didn't
> > > >realize until I did a lot of reading through source that we have an
> > > >equal() function that can compare exprs. That (plus the realization in
> > > >[1] the originally reported backtrace wasn't where the error actually
> > > >came from) convinced me that what we need is to confirm not only that
> > > >the all of the vars in the ec member come from the relids in the rel
> > > >but also that the expr is actually represented in the target of the
> > > >rel.
> > > >
> > > >With the gucs changed as I mentioned earlier both of the plans (with
> > > >and without a volatile call in the 2nd select entry) now look like:
> > > >
> > > > Unique
> > > >   ->  Gather Merge
> > > > Workers Planned: 2
> > > > ->  Sort
> > > >   Sort Key: ref_0.i, (md5(ref_0.t))
> > > >   ->  Nested Loop
> > > > ->  Parallel Index Scan using ref_0_i_idx on ref_0
> > > > ->  Function Scan on generate_series ref_1
> > > >
> > > >Without the gucs changed the minimal repro case now doesn't error, but
> > > >results in this plan:
> > > >
> > > > HashAggregate
> > > >   Group Key: ref_0.i, CASE WHEN pg_rotate_logfile_old() THEN ref_0.t
> > > >ELSE ref_0.t END
> > > >   ->  Nested Loop
> > > > ->  Seq Scan on ref_0
> > > > ->  Function Scan on generate_series ref_1
> > > >
> > > >Similarly in your six queries I now only se

Re: Online checksums patch - once again

2020-10-05 Thread Heikki Linnakangas


On 05/10/2020 17:25, Álvaro Herrera wrote:

On 2020-Oct-05, Heikki Linnakangas wrote:


The code in sendFile() in basebackup.c seems suspicious in that regard. It
calls DataChecksumsNeedVerify() once before starting to read the file. Isn't
it possible for the checksums flag to change while it's reading the file and
sending it to the client? I hope there are CHECK_FOR_INTERRUPTS() calls
buried somewhere in the loop, because it could take minutes to send the
whole file.

I would feel better if the state transition of the "checksums" flag could
only happen in a few safe places, or there were some other safeguards for
this. I think that's what Andres was trying to say earlier in the thread on
ProcSignalBarriers. I'm not sure what the interface to that should be. It
could be something like HOLD/RESUME_INTERRUPTS(), where normally all
procsignals are handled on CHECK_FOR_INTERRUPTS(), but you could "hold off"
some if needed. Or something else. Or maybe we can just use
HOLD/RESUME_INTERRUPTS() for this. It's more coarse-grained than necessary,
but probably doesn't matter in practice.


I hope you're not suggesting that interrupts would be held for the whole
transmission of a file, which you say could take minutes.  If we do have
an interrupt holdoff, then it has to be pretty short; users (and
systemd) despair if service shutdown is delayed more than a few seconds.


I'm not suggesting that, sorry I wasn't clear. That would indeed be 
horrible.


sendFile() needs a different solution, but all the other places where 
DataChecksumsNeedWrite/Verify() is called need to be inspected to make 
sure that they hold interrupts, or ensure some other way that an 
interrupt doesn't change the local checksums flag between the 
DataChecksumsNeedWrite/Verify() call and the read/write.


I think sendFile() needs to re-check the local checksums state before 
each read. It also needs to ensure that an interrupt doesn't occur and 
change the local checksums state between read and the 
DataChecksumsNeedVerify() calls, but that's a very short period if you 
call DataChecksumsNeedVerify() again for each block.


- Heikki

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-10-05 Thread Fujii Masao




On 2020/10/05 19:45, Bharath Rupireddy wrote:

Hi,

Autoprewarm module is using it's own SIGHUP(apw_sigterm_handler, got_sigterm) 
and SIGTERM(apw_sighup_handler, got_sighup) handlers which are similar to 
standard signal handlers(except for a difference [1]). Isn't it good to remove 
them and  use standard SignalHandlerForConfigReload and 
SignalHandlerForShutdownRequest?

Attaching the patch for the above changes.

Looks like the commit[2] replaced custom handlers with standard handlers.

Thoughts?


+1

The patch looks good to me.

ISTM that we can also replace StartupProcSigHupHandler() in startup.c
with SignalHandlerForConfigReload() by making the startup process use
the general shared latch instead of its own one. POC patch attached.
Thought?

Probably we can also replace sigHupHandler() in syslogger.c with
SignalHandlerForConfigReload()? This would be separate patch, though.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 8f11b1b9de..8b4434962f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -682,7 +682,7 @@ typedef struct XLogCtlData
 * WAL replay, if it is waiting for WAL to arrive or failover trigger 
file
 * to appear.
 */
-   Latch   recoveryWakeupLatch;
+   Latch   *recoveryWakeupLatch;
 
/*
 * During recovery, we keep a copy of the latest checkpoint record here.
@@ -5185,7 +5185,6 @@ XLOGShmemInit(void)
SpinLockInit(&XLogCtl->Insert.insertpos_lck);
SpinLockInit(&XLogCtl->info_lck);
SpinLockInit(&XLogCtl->ulsn_lck);
-   InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
 }
 
 /*
@@ -6122,7 +6121,7 @@ recoveryApplyDelay(XLogReaderState *record)
 
while (true)
{
-   ResetLatch(&XLogCtl->recoveryWakeupLatch);
+   ResetLatch(MyLatch);
 
/* might change the trigger file's location */
HandleStartupProcInterrupts();
@@ -6146,7 +6145,7 @@ recoveryApplyDelay(XLogReaderState *record)
elog(DEBUG2, "recovery apply delay %ld seconds, %d 
milliseconds",
 secs, microsecs / 1000);
 
-   (void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+   (void) WaitLatch(MyLatch,
 WL_LATCH_SET | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,
 secs * 1000L + microsecs / 
1000,
 
WAIT_EVENT_RECOVERY_APPLY_DELAY);
@@ -6474,11 +6473,11 @@ StartupXLOG(void)
}
 
/*
-* Take ownership of the wakeup latch if we're going to sleep during
-* recovery.
+* Advertise our latch that other processes can use to wake us up
+* if we're going to sleep during recovery.
 */
if (ArchiveRecoveryRequested)
-   OwnLatch(&XLogCtl->recoveryWakeupLatch);
+   XLogCtl->recoveryWakeupLatch = &MyProc->procLatch;
 
/* Set up XLOG reader facility */
MemSet(&private, 0, sizeof(XLogPageReadPrivate));
@@ -7488,13 +7487,6 @@ StartupXLOG(void)
if (InRecovery)
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
 
-   /*
-* We don't need the latch anymore. It's not strictly necessary to 
disown
-* it, but let's do it for the sake of tidiness.
-*/
-   if (ArchiveRecoveryRequested)
-   DisownLatch(&XLogCtl->recoveryWakeupLatch);
-
/*
 * We are now done reading the xlog from stream. Turn off streaming
 * recovery to force fetching the files (which would be required at end 
of
@@ -12244,12 +12236,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,
wait_time = 
wal_retrieve_retry_interval -
(secs * 1000 + usecs / 
1000);
 
-   (void) 
WaitLatch(&XLogCtl->recoveryWakeupLatch,
+   (void) WaitLatch(MyLatch,

 WL_LATCH_SET | WL_TIMEOUT |

 WL_EXIT_ON_PM_DEATH,

 wait_time,

 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
-   
ResetLatch(&XLogCtl->recoveryWakeupLatch);
+   ResetLatch(MyLatch);
now = GetCurrentTimestamp();
}

Re: Logical Replication - detail message with names of missing columns

2020-10-05 Thread Bharath Rupireddy

Thanks Amit for the review comments. I will post an updated patch soon.

On Mon, Oct 5, 2020 at 9:39 AM Amit Kapila  wrote:
>
> 6. I think we should add one test case for this in the existing
> regression test (src/test/subscription/t/008_diff_schema).
>

This patch logs the missing column names message in subscriber server
logs. Is there a way to see the logs in these tests and use that as
expected result for the test case?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2020-10-05 Thread Dilip Kumar

On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
 wrote:
>
> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> > wrote:
> >
> >Thanks, Tomas for your feedback.
> >
> >> 9) attcompression ...
> >>
> >> The main issue I see is what the patch does with attcompression. Instead
> >> of just using it to store a the compression method, it's also used to
> >> store the preserved compression methods. And using NameData to store
> >> this seems wrong too - if we really want to store this info, the correct
> >> way is either using text[] or inventing charvector or similar.
> >
> >The reason for using the NameData is the get it in the fixed part of
> >the data structure.
> >
>
> Why do we need that? It's possible to have varlena fields with direct
> access (see pg_index.indkey for example).

I see.  While making it NameData I was thinking whether we have an
option to direct access the varlena. Thanks for pointing me there. I
will change this.

 Adding NameData just to make
> it fixed-length means we're always adding 64B even if we just need a
> single byte, which means ~30% overhead for the FormData_pg_attribute.
> That seems a bit unnecessary, and might be an issue with many attributes
> (e.g. with many temp tables, etc.).

You are right.  Even I did not like to keep 64B for this, so I will change it.

>
> >> But to me this seems very much like a misuse of attcompression to track
> >> dependencies on compression methods, necessary because we don't have a
> >> separate catalog listing compression methods. If we had that, I think we
> >> could simply add dependencies between attributes and that catalog.
> >
> >Basically, up to this patch, we are having only built-in compression
> >methods and those can not be dropped so we don't need any dependency
> >at all.  We just want to know what is the current compression method
> >and what is the preserve compression methods supported for this
> >attribute.  Maybe we can do it better instead of using the NameData
> >but I don't think it makes sense to add a separate catalog?
> >
>
> Sure, I understand what the goal was - all I'm saying is that it looks
> very much like a workaround needed because we don't have the catalog.
>
> I don't quite understand how could we support custom compression methods
> without listing them in some sort of catalog?

Yeah for supporting custom compression we need some catalog.

> >> Moreover, having the catalog would allow adding compression methods
> >> (from extensions etc) instead of just having a list of hard-coded
> >> compression methods. Which seems like a strange limitation, considering
> >> this thread is called "custom compression methods".
> >
> >I think I forgot to mention while submitting the previous patch that
> >the next patch I am planning to submit is, Support creating the custom
> >compression methods wherein we can use pg_am catalog to insert the new
> >compression method.  And for dependency handling, we can create an
> >attribute dependency on the pg_am row.  Basically, we will create the
> >attribute dependency on the current compression method AM as well as
> >on the preserved compression methods AM.   As part of this, we will
> >add two build-in AMs for zlib and pglz, and the attcompression field
> >will be converted to the oid_vector (first OID will be of the current
> >compression method, followed by the preserved compression method's
> >oids).
> >
>
> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> quite match what I though AMs are about, but maybe it's just my fault.
>
> FWIW it seems a bit strange to first do the attcompression magic and
> then add the catalog later - I think we should start with the catalog
> right away. The advantage is that if we end up committing only some of
> the patches in this cycle, we already have all the infrastructure etc.
> We can reorder that later, though.

Hmm,  yeah we can do this way as well that first create a new catalog
table and add entries for these two built-in methods and the
attcompression can store the oid vector.  But if we only commit the
build-in compression methods part then does it make sense to create an
extra catalog or adding these build-in methods to the existing catalog
(if we plan to use pg_am).  Then in attcompression instead of using
one byte for each preserve compression method, we need to use oid.  So
from Robert's mail[1], it appeared to me that he wants that the
build-in compression methods part should be independently committable
and if we think from that perspective then adding a catalog doesn't
make much sense.  But if we are planning to commit the custom method
also then it makes more sense to directly start with the catalog
because that way it will be easy to expand without much refactoring.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com

> >> 10) compression parameters?
> >>
> >> I wonder if we could/shou

Re: Online checksums patch - once again

2020-10-05 Thread Álvaro Herrera

On 2020-Oct-05, Heikki Linnakangas wrote:

> The code in sendFile() in basebackup.c seems suspicious in that regard. It
> calls DataChecksumsNeedVerify() once before starting to read the file. Isn't
> it possible for the checksums flag to change while it's reading the file and
> sending it to the client? I hope there are CHECK_FOR_INTERRUPTS() calls
> buried somewhere in the loop, because it could take minutes to send the
> whole file.
> 
> I would feel better if the state transition of the "checksums" flag could
> only happen in a few safe places, or there were some other safeguards for
> this. I think that's what Andres was trying to say earlier in the thread on
> ProcSignalBarriers. I'm not sure what the interface to that should be. It
> could be something like HOLD/RESUME_INTERRUPTS(), where normally all
> procsignals are handled on CHECK_FOR_INTERRUPTS(), but you could "hold off"
> some if needed. Or something else. Or maybe we can just use
> HOLD/RESUME_INTERRUPTS() for this. It's more coarse-grained than necessary,
> but probably doesn't matter in practice.

I hope you're not suggesting that interrupts would be held for the whole
transmission of a file, which you say could take minutes.  If we do have
an interrupt holdoff, then it has to be pretty short; users (and
systemd) despair if service shutdown is delayed more than a few seconds.

Re: Improve choose_custom_plan for initial partition prune case

2020-10-05 Thread Ashutosh Bapat

On Thu, Oct 1, 2020 at 9:34 PM Andy Fan  wrote:
>
> Given the plan example:
>
> CREATE TABLE measurement (
> city_id int not null,
> logdate date not null,
> peaktempint,
> unitsales   int
> ) PARTITION BY RANGE (logdate);
>
> CREATE TABLE measurement_y2006m02 PARTITION OF measurement
> FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');
>
> CREATE TABLE measurement_y2006m03 PARTITION OF measurement
> FOR VALUES FROM ('2006-03-01') TO ('2006-04-01');
>
> prepare s as select * from measurement where logdate = $1;
> execute s('2006-02-01').
>
> The generic plan will probably not be chosen because it doesn't reduce the 
> cost
> which can be reduced at initial_prune while the custom plan reduces such cost
> at  planning time. which makes the cost comparison not fair.  I'm thinking if 
> we can
> get an estimated cost reduction of initial_prunne for generic plan based on 
> the
> partition pruned at plan time from custom plan and then reducing
> such costs from the generic plan.  I just went through the related code but
> didn't write anything now.  I'd like to see if this is a correct direction to 
> go.

I think the end result will depend upon the value passed to the first
few executions of the prepared plan. If they happen to have estimates
similar or higher compared to the generic plan, generic plan will be
chosen later. But if they happen to be way lesser than the generic
plan's estimates, a custom plan might be chosen. What happens when we
execute plans with values that have estimates similar to the generic
plan later when we moderate generic plan costs based on the custom
plans?

If the table has good distribution of a partition key, which also
results in good distribution of data across partitions, generic plan
cost will be similar to the custom plan costs. If not that's something
we want to fix. But if there's a large data skew, probably letting the
custom plan always win is better. [1] talks about generic plan being
not chosen just because it has higher cost even though its shape is
similar to a custom plan. Leveraging that idea might be a good option.
If the custom plans produced have shapes similar to the generic plan,
stop producing those.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZHYoAL4HYwnGO25B8CxCB%2BvNMdf%2B7rbUzYykR4sU9yUA%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: Partitionwise join fails under GEQO

2020-10-05 Thread Ashutosh Bapat

On Mon, Oct 5, 2020 at 6:47 AM Tom Lane  wrote:
>
> If you run the core regression tests with geqo_threshold = 2
> (injected, say, via ALTER SYSTEM SET), you will notice the
> earlier tests showing some join ordering differences, which
> is unsurprising ... but when it gets to partition_join, that
> test just dumps core.
>
> Investigation shows that the crash is due to a corrupt EquivalenceClass
> member.  It gets that way because add_child_join_rel_equivalences
> is unaware of the planner's memory allocation policies, and feels
> free to mess with the EC data structures during join planning.
> When GEQO's transient memory context is thrown away after one round
> of GEQO planning, some still-referenced EC data goes with it,
> resulting in a crash during the next round.
>
> I band-aided around that with the attached patch, which is enough
> to prevent the crash, but it's entirely unsatisfactory as a production
> solution.  The problem is that each GEQO round will re-add the same
> child EC members, since add_child_join_rel_equivalences has absolutely
> no concept that the members it wants might be there already.  Since
> GEQO tends to use a lot of rounds, this can run to significant memory
> bloat, not to mention a pretty serious speed hit while EC operations
> thumb through all of those duplicate expressions.
>
> Now that I've seen this, I wonder whether add_child_join_rel_equivalences
> might not be making duplicate EC entries even without GEQO.  Is there any
> guarantee that it's not called repeatedly on related join-rel sets?

build_child_join_rel() is the only caller of
add_child_join_rel_equivalences(). Before the first calls the later,
the first has an Assert
 899 Assert(!find_join_rel(root, joinrel->relids));
This means that for a given child join rel this function is called
only once implying that add_child_join_rel_equivalences() is called
only once for a given child join rel. Initially I thought that this is
enough to not add duplicate child members. But to me use of
bms_overlap() in that function looks doubtful. Thinking allowed, let's
say there's an EC member with em_relids = {1, 2} 1, 2 being relids of
two partitioned tables. Let's assume that there's a third partitioned
table with relid = 3. When a child join rel between children, say 6
and 7, of 1 and 2 resp was produces the EC memeber with em_relids =
{1,2} was translated to produce and EC memeber with em_relids = {6, 7}
and em_is_child = true. But when we will add a child join rel between
(6, 7, 8) of a join between (1, 2, 3), bms_overlap({1, 2}, {1, 2, 3})
will return true and we will add one more member with em_relids = {6,
7}. So your suspicion is true. I think, using
bms_is_subset(top_parent_relids, em->em_relids) should fix that
problem. In addition, adding an Assert to make sure that we are not
adding multiple instances of same child EC member should suffice.
Thumbing through all the child EC members to eliminate duplicates
would be much costlier when a huge number of partitions are involved.

On a related but different subject, I have been thinking on-off about
the need for expression translation while planning partitions.The
translated expressions differ only in the leaf-vars and even for most
of the partitioned tables really only the varno (assuming that
partitioned table and partitions will have same layouts when a large
number of partitions are created automatically.) If we can introduce a
new type of (polymorphic) Var node which represents not just the
parent Var but also child Vars as well. The whole need for expression
translation vanishes, reducing memory requirements by many folds. Such
a Var node will indicate which varnos it covers and for a group of
varnos which varattno it represents. In most of the cases where parent
and the child share the same varattno, all the varnos form a single
set. Whole Var references pose a problem since parent's whole var
reference translates to a RowExpr containing child Var references, but
probably using varattno = 0 for children will suffice. As far as my
thoughts go, this will work in planner but when translating plan tree
into execution tree we will have to produce different Execution time
Var nodes. That should be fine since we will be dealing with only a
single plan. If we could fix that somehow, well and good.

-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] Custom compression methods

2020-10-05 Thread Tomas Vondra


On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:

On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
 wrote:

Thanks, Tomas for your feedback.


9) attcompression ...

The main issue I see is what the patch does with attcompression. Instead
of just using it to store a the compression method, it's also used to
store the preserved compression methods. And using NameData to store
this seems wrong too - if we really want to store this info, the correct
way is either using text[] or inventing charvector or similar.


The reason for using the NameData is the get it in the fixed part of
the data structure.



Why do we need that? It's possible to have varlena fields with direct
access (see pg_index.indkey for example). Adding NameData just to make
it fixed-length means we're always adding 64B even if we just need a
single byte, which means ~30% overhead for the FormData_pg_attribute.
That seems a bit unnecessary, and might be an issue with many attributes
(e.g. with many temp tables, etc.).


But to me this seems very much like a misuse of attcompression to track
dependencies on compression methods, necessary because we don't have a
separate catalog listing compression methods. If we had that, I think we
could simply add dependencies between attributes and that catalog.


Basically, up to this patch, we are having only built-in compression
methods and those can not be dropped so we don't need any dependency
at all.  We just want to know what is the current compression method
and what is the preserve compression methods supported for this
attribute.  Maybe we can do it better instead of using the NameData
but I don't think it makes sense to add a separate catalog?



Sure, I understand what the goal was - all I'm saying is that it looks
very much like a workaround needed because we don't have the catalog.

I don't quite understand how could we support custom compression methods
without listing them in some sort of catalog?


Moreover, having the catalog would allow adding compression methods
(from extensions etc) instead of just having a list of hard-coded
compression methods. Which seems like a strange limitation, considering
this thread is called "custom compression methods".


I think I forgot to mention while submitting the previous patch that
the next patch I am planning to submit is, Support creating the custom
compression methods wherein we can use pg_am catalog to insert the new
compression method.  And for dependency handling, we can create an
attribute dependency on the pg_am row.  Basically, we will create the
attribute dependency on the current compression method AM as well as
on the preserved compression methods AM.   As part of this, we will
add two build-in AMs for zlib and pglz, and the attcompression field
will be converted to the oid_vector (first OID will be of the current
compression method, followed by the preserved compression method's
oids).



Hmmm, ok. Not sure pg_am is the right place - compression methods don't
quite match what I though AMs are about, but maybe it's just my fault.

FWIW it seems a bit strange to first do the attcompression magic and
then add the catalog later - I think we should start with the catalog
right away. The advantage is that if we end up committing only some of
the patches in this cycle, we already have all the infrastructure etc.
We can reorder that later, though.


10) compression parameters?

I wonder if we could/should allow parameters, like compression level
(and maybe other stuff, depending on the compression method). PG13
allowed that for opclasses, so perhaps we should allow it here too.


Yes, that is also in the plan.   For doing this we are planning to add
an extra column in the pg_attribute which will store the compression
options for the current compression method. The original patch was
creating an extra catalog pg_column_compression, therein it maintains
the oid of the compression method as well as the compression options.
The advantage of creating an extra catalog is that we can keep the
compression options for the preserved compression methods also so that
we can support the options which can be used for decompressing the
data as well.  Whereas if we want to avoid this extra catalog then we
can not use that compression option for decompressing.  But most of
the options e.g. compression level are just for the compressing so it
is enough to store for the current compression method only.  What's
your thoughts?



Not sure. My assumption was we'd end up with a new catalog, but maybe
stashing it into pg_attribute is fine. I was really thinking about two
kinds of options - compression level, and some sort of column-level
dictionary. Compression level is not necessary for decompression, but
the dictionary ID would be needed. (I think the global dictionary was
one of the use cases, aimed at JSON compression.)

But I don't think stashing it in pg_attribute means we couldn't use it
for decompression - we'd just need to keep an array of options, one

Re: Online checksums patch - once again

2020-10-05 Thread Heikki Linnakangas


Replying to an older message in this thread:


+ /*
+ * If we reach this point with checksums in inprogress state, we notify
+ * the user that they need to manually restart the process to enable
+ * checksums. This is because we cannot launch a dynamic background worker
+ * directly from here, it has to be launched from a regular backend.
+ */
+ if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
+ ereport(WARNING,
+ (errmsg("checksum state is \"inprogress\" with no worker"),
+ errhint("Either disable or enable checksums by calling the
pg_disable_data_checksums() or pg_enable_data_checksums()
functions.")));

This seems pretty half-baked.


I don't disagree with that.  However, given that enabling checksums is a pretty
intensive operation it seems somewhat unfriendly to automatically restart.  As
a DBA I wouldn't want that to kick off without manual intervention, but there
is also the risk of this being missed due to assumptions that it would restart.
Any ideas on how to treat this?

If/when we can restart the processing where it left off, without the need to go
over all data again, things might be different wrt the default action.


The later patch version do support restarting, so I think we should 
revisit this issue. I would expect the checksums worker to be 
automatically started at postmaster startup. Can we make that happen?


- Heikki

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-05 Thread Amit Kapila

On Mon, Oct 5, 2020 at 3:04 PM k.jami...@fujitsu.com
 wrote:
>
> On Monday, October 5, 2020 3:30 PM, Amit Kapila wrote:
>
> > + for (i = 0; i < nforks; i++)
> > + {
> > + /* Get the total nblocks for a relation's fork */ nForkBlocks =
> > + smgrcachednblocks(smgr_reln, forkNum[i]);
> > +
> > + if (nForkBlocks == InvalidBlockNumber) { nTotalBlocks =
> > + InvalidBlockNumber; break; } nTotalBlocks += nForkBlocks;
> > + nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i]; }
> > +
> > + /*
> > + * Do explicit hashtable probe if the total of nblocks of relation's
> > + forks
> > + * is not invalid and the nblocks to be invalidated is less than the
> > + * full-scan threshold of buffer pool.  Otherwise, full scan is executed.
> > + */
> > + if (nTotalBlocks != InvalidBlockNumber && nBlocksToInvalidate <
> > + BUF_DROP_FULL_SCAN_THRESHOLD) { for (j = 0; j < nforks; j++) {
> > + BlockNumber curBlock;
> > +
> > + nForkBlocks = smgrcachednblocks(smgr_reln, forkNum[j]);
> > +
> > + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks; curBlock++)
> >
> > What if one or more of the forks doesn't have cached value? I think the 
> > patch
> > will skip such forks and will invalidate/unpin buffers for others.
>
> Not having a cached value is equivalent to InvalidBlockNumber, right?
> Maybe I'm missing something? But in the first loop we are already doing the
> pre-check of whether or not one of the forks doesn't have cached value.
> If it's not cached, then the nTotalBlocks is set to InvalidBlockNumber so we
> won't need to enter the optimization loop and just execute the full scan 
> buffer
> invalidation process.
>

oh, I have missed that, so the existing code will work fine for that case.

> > You probably
> > need a local array of nForkBlocks which will be formed first time and then
> > used in the second loop. You also in some way need to handle the case where
> > that local array doesn't have cached blocks.
>
> Understood. that would be cleaner.
> BlockNumber nForkBlocks[MAX_FORKNUM];
>
> As for handling whether the local array is empty, I think the first loop 
> would cover it,
> and there's no need to pre-check if the array is empty again in the second 
> loop.
> for (i = 0; i < nforks; i++)
> {
> nForkBlocks[i] = smgrcachednblocks(smgr_reln, forkNum[i]);
>
> if (nForkBlocks[i] == InvalidBlockNumber)
> {
> nTotalBlocks = InvalidBlockNumber;
> break;
> }
> nTotalBlocks += nForkBlocks[i];
> nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i];
> }
>

This appears okay.

> > 2. Also, the other thing is I have asked for some testing to avoid the small
> > regression we have for a smaller number of shared buffers which I don't see
> > the results nor any change in the code. I think it is better if you post the
> > pending/open items each time you post a new version of the patch.
>
> Ah. Apologies for forgetting to include updates about that, but since I keep 
> on updating
> the patch I've decided not to post results yet as performance may vary per 
> patch-update
> due to possible bugs.
> But for the performance case of not using recovery check, I just removed it 
> from below.
> Does it meet the intention?
>
> BlockNumber smgrcachednblocks(SMgrRelation reln, ForkNumber forknum) {
> -   if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
> InvalidBlockNumber)
> +   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
> return reln->smgr_cached_nblocks[forknum];
>

Yes, we can do that for the purpose of testing.

-- 
With Regards,
Amit Kapila.

Re: Support for OUT parameters in procedures

2020-10-05 Thread Pavel Stehule

po 5. 10. 2020 v 11:46 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2020-09-29 08:23, Pavel Stehule wrote:
> > This was an important issue if I remember well.  Passing mandatory NULL
> > as OUT arguments solves this issue.
> > I fully agree so OUT arguments are part of the procedure's signature.
> > Unfortunately, there is another difference
> > from functions, but I don't think so there is a better solution, and we
> > should live with it. I think it can work well.
>
> This has been committed.
>
> > I found one issue. The routine for selecting function or procedure based
> > on signature should be fixed.
> >
> > CREATE OR REPLACE PROCEDURE public.procp(OUT integer)
> >   LANGUAGE plpgsql
> > AS $procedure$
> > BEGIN
> >$1 := 10;
> > END;
> > $procedure$
> >
> > DO
> > $$
> > DECLARE n numeric;
> > BEGIN
> >CALL procp(n);
> >RAISE NOTICE '%', n;
> > END;
> > $$;
> > ERROR:  procedure procp(numeric) does not exist
> > LINE 1: CALL procp(n)
> >   ^
> > HINT:  No procedure matches the given name and argument types. You might
> > need to add explicit type casts.
> > QUERY:  CALL procp(n)
> > CONTEXT:  PL/pgSQL function inline_code_block line 4 at CALL
>
> This is normal; there is no implicit cast from numeric to int.  The same
> error happens if you call a function foo(int) with foo(42::numeric).
>

this is OUT argument - so direction is reversed - and implicit cast from
int to numeric exists.


> > postgres=# create or replace procedure px(anyelement, out anyelement)
> > as $$
> > begin
> >$2 := $1;
> > end;
> > $$ language plpgsql;
> >
> > postgres=# call px(10, null);
> > ERROR:  cannot display a value of type anyelement
> >
> > but inside plpgsql it works
> > do $$
> > declare xx int;
> > begin
> >call px(10, xx);
> >raise notice '%', xx;
> > end;
> > $$;
>
> This might be worth further investigation, but since it happens also
> with INOUT parameters, it seems orthogonal to this patch.
>

yes - this breaks using varchar against text argument, although these types
are almost identical.



> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-05 Thread Dilip Kumar

On Mon, Oct 5, 2020 at 4:53 PM Amit Kapila  wrote:
>
> On Mon, Oct 5, 2020 at 4:26 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow  wrote:
> > >
> >
> > I have one question which is common to both this patch and parallel
> > inserts in CTAS[1], do we need to skip creating tuple
> > queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> > that's being shared from workers to leader?
> >
>
> As far as this patch is concerned we might need to return tuples when
> there is a Returning clause. I think for the cases where we don't need
> to return tuples we might want to skip creating these queues if it is
> feasible without too many changes.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Online checksums patch - once again

2020-10-05 Thread Heikki Linnakangas


I looked at patch v22, and I can see two main issues:

1. The one that Robert talked about earlier: A backend checks the local 
"checksums" state. If it's 'off', it writes a page without checksums. 
How do you guarantee that the local state doesn't change in between? The 
implicit assumption seems to be that there MUST NOT be any 
CHECK_FOR_INTERRUPTS() calls between DataChecksumsNeedWrite() and the 
write (or read and DataChecksumsNeedVerify()).


In most code, the DataChecksumsNeedWrite() call is very close to writing 
out the page, often in the same critical section. But this is an 
undocumented assumption.


The code in sendFile() in basebackup.c seems suspicious in that regard. 
It calls DataChecksumsNeedVerify() once before starting to read the 
file. Isn't it possible for the checksums flag to change while it's 
reading the file and sending it to the client? I hope there are 
CHECK_FOR_INTERRUPTS() calls buried somewhere in the loop, because it 
could take minutes to send the whole file.


I would feel better if the state transition of the "checksums" flag 
could only happen in a few safe places, or there were some other 
safeguards for this. I think that's what Andres was trying to say 
earlier in the thread on ProcSignalBarriers. I'm not sure what the 
interface to that should be. It could be something like 
HOLD/RESUME_INTERRUPTS(), where normally all procsignals are handled on 
CHECK_FOR_INTERRUPTS(), but you could "hold off" some if needed. Or 
something else. Or maybe we can just use HOLD/RESUME_INTERRUPTS() for 
this. It's more coarse-grained than necessary, but probably doesn't 
matter in practice.


At minimum, there needs to be comments in DataChecksumsNeedWrite() and 
DataChecksumsNeedVerify(), instructing how to use them safely. Namely, 
you must ensure that there are no interrupts between the 
DataChecksumsNeedWrite() and writing out the page, or between reading 
the page and the DataChecksumsNeedVerify() call. You can achieve that 
with HOLD_INTERRUPTS() or a critical section, or simply ensuring that 
there is no substantial code in between that could call 
CHECK_FOR_INTERRUPTS(). And sendFile() in basebackup.c needs to be fixed.


Perhaps you could have "Assert(InteruptHoldOffCount > 0)" in 
DataChecksumsNeedWrite() and DataChecksumsNeedVerify()? There could be 
other ways that callers could avoid the TOCTOU issue, but it would 
probably catch most of the unsafe call patterns, and you could always 
wrap the DataChecksumsNeedWrite/verify() call in a dummy 
HOLD_INTERRUPTS() block to work around the assertion if you know what 
you're doing.



2. The signaling between enable_data_checksums() and the launcher 
process looks funny to me. The general idea seems to be that 
enable_data_checksums() just starts the launcher process, and the 
launcher process figures out what it need to do and makes all the 
changes to the global state. But then there's this violation of the 
idea: enable_data_checksums() checks DataChecksumsOnInProgress(), and 
tells the launcher process whether it should continue a previously 
crashed operation or start from scratch. I think it would be much 
cleaner if the launcher process figured that out itself, and 
enable_data_checksums() would just tell the launcher what the target 
state is.


enable_data_checksums() and disable_data_checksums() seem prone to race 
conditions. If you call enable_data_checksums() in two backends 
concurrently, depending on the timing, there are two possible outcomes:


a) One call returns true, and launches the background process. The other 
call returns false.


b) Both calls return true, but one of them emits a "NOTICE: data 
checksums worker is already running".


In disable_data_checksum() imagine what happens if another backend calls 
enable_data_checksums() in between the 
ShutdownDatachecksumsWorkerIfRunning() and SetDataChecksumsOff() calls.




/*
 * Mark the buffer as dirty and force a full page write.  We 
have to
 * re-write the page to WAL even if the checksum hasn't changed,
 * because if there is a replica it might have a slightly 
different
 * version of the page with an invalid checksum, caused by 
unlogged
 * changes (e.g. hintbits) on the master happening while 
checksums
 * were off. This can happen if there was a valid checksum on 
the page
 * at one point in the past, so only when checksums are first 
on, then
 * off, and then turned on again. Iff wal_level is set to 
"minimal",
 * this could be avoided iff the checksum is calculated to be 
correct.
 */
START_CRIT_SECTION();
MarkBufferDirty(buf);
log_newpage_buffer(buf, false);
END_CRIT_SECTION();


It's really unfortunate that we have to dirty the page even if the 
checksum already happens to match. Could we only do the 
log_n

Re: deferred primary key and logical replication

2020-10-05 Thread Amit Kapila

On Mon, May 11, 2020 at 2:41 AM Euler Taveira
 wrote:
>
> Hi,
>
> While looking at an old wal2json issue, I stumbled on a scenario that a table
> with a deferred primary key is not updatable in logical replication. AFAICS it
> has been like that since the beginning of logical decoding and seems to be an
> oversight while designing logical decoding.
>

I am not sure if it is an oversight because we document that the index
must be non-deferrable, see "USING INDEX records the old values of the
columns covered by the named index, which must be unique, not partial,
not deferrable, and include only columns marked NOT NULL." in docs
[1].

Now sure this constraint is when we use USING INDEX for REPLICA
IDENTITY but why it has to be different for PRIMARY KEY especially
when UNIQUE constraint will have similar behavior and the same is
documented?

[1] - https://www.postgresql.org/docs/devel/sql-altertable.html

-- 
With Regards,
Amit Kapila.

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-10-05 Thread Bharath Rupireddy

On Mon, Oct 5, 2020 at 9:45 AM Fujii Masao  wrote:
>
> > I see the errmsg() with plain texts in other places in the code base
> > as well. Is it  that we look at the error message and if it is a plain
> > text(without database objects or table data), we decide to have no
> > translation? Or is there any other policy?
>
> I was thinking that elog() basically should be used to report this
> debug message, instead, but you used ereport() because maybe
> you'd like to add detail message about connection error. Is this
> understanding right? elog() uses errmsg_internal().
>

Yes that's correct.

>
> So if ereport() is used as an aternative of elog() for some reasons,
> IMO errmsg_internal() should be used. Thought?
>

Yes, this is apt for our case.

>
> OTOH, the messages you mentioned are not debug ones,
> so basically ereport() and errmsg() should be used, I think.
>

In connection.c file, yes they are of ERROR type. Looks like it's not
a standard to use errmsg_internal for DEBUG messages that require no
translation with ereport

(errmsg("wrote block details for %d blocks", num_blocks)));
(errmsg("MultiXact member stop limit is now %u based on MultiXact %u
(errmsg("oldest MultiXactId member is at offset %u",

However, there are few other places, where errmsg_internal is used for
DEBUG purposes.

(errmsg_internal("finished verifying presence of "
(errmsg_internal("%s(%d) name: %s; blockState:

Having said that, IMHO it's better to keep the way it is currently in
the code base.

>
> > Thanks. Attaching v10 patch. Please have a look.
>
> Thanks for updating the patch! I will mark the patch as ready for committer 
> in CF.
> Barring any objections, I will commit that.
>

Thanks a lot for the review comments.

> >>
> >>> I have another question not related to this patch: though we have
> >>> wait_pid() function, we are not able to use it like
> >>> pg_terminate_backend() in other modules, wouldn't be nice if we can
> >>> make it generic under the name pg_wait_pid() and usable across all pg
> >>> modules?
> >>
> >> I thought that, too. But I could not come up with good idea for *real* use 
> >> case
> >> of that function. At least that's useful for the regression test, though.
> >> Anyway, IMO it's worth proposing that and hearing more opinions about that
> >> from other hackers.
> >>
> >
> > Yes it will be useful for testing when coupled with
> > pg_terminate_backend(). I will post the idea in a separate thread soon
> > for more thoughts.
>
> Sounds good!
> ISTM that he function should at least check the target process is PostgreSQL 
> one.
>

Thanks. I will take care of this point.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-05 Thread Amit Kapila

On Mon, Oct 5, 2020 at 4:26 PM Bharath Rupireddy
 wrote:
>
> On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow  wrote:
> >
>
> I have one question which is common to both this patch and parallel
> inserts in CTAS[1], do we need to skip creating tuple
> queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> that's being shared from workers to leader?
>

As far as this patch is concerned we might need to return tuples when
there is a Returning clause. I think for the cases where we don't need
to return tuples we might want to skip creating these queues if it is
feasible without too many changes.

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-05 Thread Dilip Kumar

On Mon, Oct 5, 2020 at 4:26 PM Bharath Rupireddy
 wrote:
>
> On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow  wrote:
> >
> > > >
> > > > I think you still need to work on the costing part, basically if we
> > > > are parallelizing whole insert then plan is like below
> > > >
> > > > -> Gather
> > > >   -> Parallel Insert
> > > >   -> Parallel Seq Scan
> > > >
> > > > That means the tuple we are selecting via scan are not sent back to
> > > > the gather node, so in cost_gather we need to see if it is for the
> > > > INSERT then there is no row transferred through the parallel queue
> > > > that mean we need not to pay any parallel tuple cost.
> > >
> > > I just looked into the parallel CTAS[1] patch for the same thing, and
> > > I can see in that patch it is being handled.
> > >
> > > [1] 
> > > https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
> > >
> >
> > Hi Dilip,
> >
> > You're right, the costing for Parallel Insert is not done and
> > finished, I'm still working on the costing, and haven't posted an
> > updated patch for it yet.
> > As far as cost_gather() method is concerned, for Parallel INSERT, it
> > can probably use the same costing approach as the CTAS patch except in
> > the case of a specified RETURNING clause.
> >
>
> I have one question which is common to both this patch and parallel
> inserts in CTAS[1], do we need to skip creating tuple
> queues(ExecParallelSetupTupleQueues) as we don't have any tuples
> that's being shared from workers to leader? Put it another way, do we
> use the tuple queue for sharing any info other than tuples from
> workers to leader?

Ideally, we don't need the tuple queue unless we want to transfer the
tuple to the gather node.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-05 Thread Bharath Rupireddy

On Wed, Sep 30, 2020 at 7:38 AM Greg Nancarrow  wrote:
>
> > >
> > > I think you still need to work on the costing part, basically if we
> > > are parallelizing whole insert then plan is like below
> > >
> > > -> Gather
> > >   -> Parallel Insert
> > >   -> Parallel Seq Scan
> > >
> > > That means the tuple we are selecting via scan are not sent back to
> > > the gather node, so in cost_gather we need to see if it is for the
> > > INSERT then there is no row transferred through the parallel queue
> > > that mean we need not to pay any parallel tuple cost.
> >
> > I just looked into the parallel CTAS[1] patch for the same thing, and
> > I can see in that patch it is being handled.
> >
> > [1] 
> > https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com
> >
>
> Hi Dilip,
>
> You're right, the costing for Parallel Insert is not done and
> finished, I'm still working on the costing, and haven't posted an
> updated patch for it yet.
> As far as cost_gather() method is concerned, for Parallel INSERT, it
> can probably use the same costing approach as the CTAS patch except in
> the case of a specified RETURNING clause.
>

I have one question which is common to both this patch and parallel
inserts in CTAS[1], do we need to skip creating tuple
queues(ExecParallelSetupTupleQueues) as we don't have any tuples
that's being shared from workers to leader? Put it another way, do we
use the tuple queue for sharing any info other than tuples from
workers to leader?

[1] 
https://www.postgresql.org/message-id/CALj2ACWFq6Z4_jd9RPByURB8-Y8wccQWzLf%2B0-Jg%2BKYT7ZO-Ug%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-10-05 Thread Bharath Rupireddy

Hi,

Autoprewarm module is using it's own SIGHUP(apw_sigterm_handler,
got_sigterm) and SIGTERM(apw_sighup_handler, got_sighup) handlers which are
similar to standard signal handlers(except for a difference [1]). Isn't it
good to remove them and  use standard SignalHandlerForConfigReload and
SignalHandlerForShutdownRequest?

Attaching the patch for the above changes.

Looks like the commit[2] replaced custom handlers with standard handlers.

Thoughts?

[1] apw_sigterm_handler() and apw_sighup_handler() use MyProc->procLatch
if (MyProc)
SetLatch(&MyProc->procLatch);
where as standard handlers use MyLatch
SetLatch(MyLatch);
Both MyProc->procLatch and MyLatch point to same, see comment from global.c
/*
 * MyLatch points to the latch that should be used for signal handling by
the
 * current process. It will either point to a process local latch if the
 * current process does not have a PGPROC entry in that moment, or to
 * PGPROC->procLatch if it has.
*Thus it can always be used in signal handlers, * without checking for its
existence.*
 */
struct Latch *MyLatch;

(gdb) p MyProc->procLatch
$6 = {is_set = 0, is_shared = true, owner_pid = 1448807}
(gdb) p MyLatch
*$7 = (struct Latch *) 0x7fcacc6d902c*
(gdb) p &MyProc->procLatch
*$8 = (Latch *) 0x7fcacc6d902c*
(gdb) p *MyLatch
$9 = {is_set = 0, is_shared = true, owner_pid = 1448807}

[2] commit 1e53fe0e70f610c34f4c9e770d108cd94151342c
Author: Robert Haas 
Date:   2019-12-17 13:03:57 -0500

Use PostgresSigHupHandler in more places.

There seems to be no reason for every background process to have
its own flag indicating that a config-file reload is needed.
Instead, let's just use ConfigFilePending for that purpose
everywhere.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 7659d36c227fb8193bc76c03f8f215f0a3ac3bb5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 3 Oct 2020 09:40:05 +0530
Subject: [PATCH v1] Use Standard SIGTERM,SIGHUP Handlers In AutoPrewarm

Replace apw_sigterm_handler and apw_sighup_handler with standard
signal handlers SignalHandlerForShutdownRequest and
SignalHandlerForConfigReload.
---
 contrib/pg_prewarm/autoprewarm.c | 49 
 1 file changed, 6 insertions(+), 43 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c32ddc56fd..93e526ef62 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -35,6 +35,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgworker.h"
+#include "postmaster/interrupt.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
 #include "storage/ipc.h"
@@ -94,12 +95,6 @@ static void apw_start_database_worker(void);
 static bool apw_init_shmem(void);
 static void apw_detach_shmem(int code, Datum arg);
 static int	apw_compare_blockinfo(const void *p, const void *q);
-static void apw_sigterm_handler(SIGNAL_ARGS);
-static void apw_sighup_handler(SIGNAL_ARGS);
-
-/* Flags set by signal handlers */
-static volatile sig_atomic_t got_sigterm = false;
-static volatile sig_atomic_t got_sighup = false;
 
 /* Pointer to shared-memory state. */
 static AutoPrewarmSharedState *apw_state = NULL;
@@ -161,8 +156,8 @@ autoprewarm_main(Datum main_arg)
 	TimestampTz last_dump_time = 0;
 
 	/* Establish signal handlers; once that's done, unblock signals. */
-	pqsignal(SIGTERM, apw_sigterm_handler);
-	pqsignal(SIGHUP, apw_sighup_handler);
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	BackgroundWorkerUnblockSignals();
 
@@ -206,12 +201,12 @@ autoprewarm_main(Datum main_arg)
 	}
 
 	/* Periodically dump buffers until terminated. */
-	while (!got_sigterm)
+	while (!ShutdownRequestPending)
 	{
 		/* In case of a SIGHUP, just reload the configuration. */
-		if (got_sighup)
+		if (ConfigReloadPending)
 		{
-			got_sighup = false;
+			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
 		}
 
@@ -895,35 +890,3 @@ apw_compare_blockinfo(const void *p, const void *q)
 
 	return 0;
 }
-
-/*
- * Signal handler for SIGTERM
- */
-static void
-apw_sigterm_handler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	got_sigterm = true;
-
-	if (MyProc)
-		SetLatch(&MyProc->procLatch);
-
-	errno = save_errno;
-}
-
-/*
- * Signal handler for SIGHUP
- */
-static void
-apw_sighup_handler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	got_sighup = true;
-
-	if (MyProc)
-		SetLatch(&MyProc->procLatch);
-
-	errno = save_errno;
-}
-- 
2.25.1

Re: Support for OUT parameters in procedures

2020-10-05 Thread Peter Eisentraut


On 2020-09-29 08:23, Pavel Stehule wrote:
This was an important issue if I remember well.  Passing mandatory NULL 
as OUT arguments solves this issue.
I fully agree so OUT arguments are part of the procedure's signature. 
Unfortunately, there is another difference
from functions, but I don't think so there is a better solution, and we 
should live with it. I think it can work well.


This has been committed.

I found one issue. The routine for selecting function or procedure based 
on signature should be fixed.


CREATE OR REPLACE PROCEDURE public.procp(OUT integer)
  LANGUAGE plpgsql
AS $procedure$
BEGIN
   $1 := 10;
END;
$procedure$

DO
$$
DECLARE n numeric;
BEGIN
   CALL procp(n);
   RAISE NOTICE '%', n;
END;
$$;
ERROR:  procedure procp(numeric) does not exist
LINE 1: CALL procp(n)
              ^
HINT:  No procedure matches the given name and argument types. You might 
need to add explicit type casts.

QUERY:  CALL procp(n)
CONTEXT:  PL/pgSQL function inline_code_block line 4 at CALL


This is normal; there is no implicit cast from numeric to int.  The same 
error happens if you call a function foo(int) with foo(42::numeric).



postgres=# create or replace procedure px(anyelement, out anyelement)
as $$
begin
   $2 := $1;
end;
$$ language plpgsql;

postgres=# call px(10, null);
ERROR:  cannot display a value of type anyelement

but inside plpgsql it works
do $$
declare xx int;
begin
   call px(10, xx);
   raise notice '%', xx;
end;
$$;


This might be worth further investigation, but since it happens also 
with INOUT parameters, it seems orthogonal to this patch.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: small cleanup: unify scanstr() functions

2020-10-05 Thread John Naylor

>
>
> On Sun, Oct 4, 2020 at 4:20 PM Tom Lane  wrote:
> > I didn't try very hard to trace the commit history, but I did note that
> > almost all of the removed #includes dated to 1996 or so.  I'm surprised
> > they survived Bruce's occasional attempts at removing unused #includes;
> > maybe his script didn't check .y/.l files.
>
> That is some ancient detritus! Thanks for committing.
>
>

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-05 Thread k.jami...@fujitsu.com

On Monday, October 5, 2020 3:30 PM, Amit Kapila wrote:

> + for (i = 0; i < nforks; i++)
> + {
> + /* Get the total nblocks for a relation's fork */ nForkBlocks =
> + smgrcachednblocks(smgr_reln, forkNum[i]);
> +
> + if (nForkBlocks == InvalidBlockNumber) { nTotalBlocks =
> + InvalidBlockNumber; break; } nTotalBlocks += nForkBlocks;
> + nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i]; }
> +
> + /*
> + * Do explicit hashtable probe if the total of nblocks of relation's
> + forks
> + * is not invalid and the nblocks to be invalidated is less than the
> + * full-scan threshold of buffer pool.  Otherwise, full scan is executed.
> + */
> + if (nTotalBlocks != InvalidBlockNumber && nBlocksToInvalidate <
> + BUF_DROP_FULL_SCAN_THRESHOLD) { for (j = 0; j < nforks; j++) {
> + BlockNumber curBlock;
> +
> + nForkBlocks = smgrcachednblocks(smgr_reln, forkNum[j]);
> +
> + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks; curBlock++)
> 
> What if one or more of the forks doesn't have cached value? I think the patch
> will skip such forks and will invalidate/unpin buffers for others. 

Not having a cached value is equivalent to InvalidBlockNumber, right?
Maybe I'm missing something? But in the first loop we are already doing the
pre-check of whether or not one of the forks doesn't have cached value.
If it's not cached, then the nTotalBlocks is set to InvalidBlockNumber so we
won't need to enter the optimization loop and just execute the full scan buffer
invalidation process.

> You probably
> need a local array of nForkBlocks which will be formed first time and then
> used in the second loop. You also in some way need to handle the case where
> that local array doesn't have cached blocks.

Understood. that would be cleaner.
BlockNumber nForkBlocks[MAX_FORKNUM];

As for handling whether the local array is empty, I think the first loop would 
cover it,
and there's no need to pre-check if the array is empty again in the second loop.
for (i = 0; i < nforks; i++)
{
nForkBlocks[i] = smgrcachednblocks(smgr_reln, forkNum[i]);

if (nForkBlocks[i] == InvalidBlockNumber)
{
nTotalBlocks = InvalidBlockNumber;
break;
}
nTotalBlocks += nForkBlocks[i];
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i];
}

> 2. Also, the other thing is I have asked for some testing to avoid the small
> regression we have for a smaller number of shared buffers which I don't see
> the results nor any change in the code. I think it is better if you post the
> pending/open items each time you post a new version of the patch.

Ah. Apologies for forgetting to include updates about that, but since I keep on 
updating
the patch I've decided not to post results yet as performance may vary per 
patch-update
due to possible bugs.
But for the performance case of not using recovery check, I just removed it 
from below.
Does it meet the intention?

BlockNumber smgrcachednblocks(SMgrRelation reln, ForkNumber forknum) {
-   if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
+   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum];

Regards,
Kirk Jamison

Re: Implementing Incremental View Maintenance

2020-10-05 Thread Yugo NAGATA

Hi,

Attached is the rebased patch (v18) to add support for Incremental
Materialized View Maintenance (IVM). It is able to be applied to
current latest master branch.

Also, this now supports simple CTEs (WITH clauses) which do not contain
aggregates or DISTINCT like simple sub-queries. This feature is provided
as an additional patch segment "0010-Add-CTE-support-in-IVM.patch".

 Example 

cte=# TABLE r;
 i | v  
---+
 1 | 10
 2 | 20
(2 rows)

cte=# TABLE s;
 i |  v  
---+-
 2 | 200
 3 | 300
(2 rows)

cte=# \d+ mv
  Materialized view "public.mv"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 r  | integer |   |  | | plain   |  | 
 x  | integer |   |  | | plain   |  | 
View definition:
 WITH x AS (
 SELECT s.i,
s.v
   FROM s
)
 SELECT r.v AS r,
x.v AS x
   FROM r,
x
  WHERE r.i = x.i;
Access method: heap
Incremental view maintenance: yes

cte=# SELECT * FROM mv;
 r  |  x  
+-
 20 | 200
(1 row)

cte=# INSERT INTO r VALUES (3,30);
INSERT 0 1
cte=# INSERT INTO s VALUES (1,100);
INSERT 0 1
cte=# SELECT * FROM mv;
 r  |  x  
+-
 20 | 200
 30 | 300
 10 | 100
(3 rows)

==

Regards,
Yugo Nagata


-- 
Yugo NAGATA 


IVM_patches_v18.tar.gz
Description: application/gzip

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-05 Thread Amit Kapila

On Mon, Oct 5, 2020 at 1:26 PM Masahiko Sawada
 wrote:
>
> On Sat, 3 Oct 2020 at 16:55, Amit Kapila  wrote:
> >
> > On Sat, Oct 3, 2020 at 9:26 AM Masahiko Sawada
> >  wrote:
> > >
> > > When we discussed this before, I was thinking that we could have other
> > > statistics for physical slots in the same statistics view in the
> > > future. Having the view show only logical slots also makes sense to me
> > > but I’m concerned a bit that we could break backward compatibility
> > > that monitoring tools etc will be affected when the view starts to
> > > show physical slots too.
> > >
> >
> > I think that would happen anyway as we need to add more columns in
> > view for the physical slots.
>
> I think it depends; adding more columns to the view might not break
> tools if the query used in the tool explicitly specifies columns.
>

What if it uses Select * ...? It might not be advisable to assume how
the user might fetch data.

> OTOH
> if the view starts to show more rows, the tool will need to have the
> condition to get the same result as before.
>
> >
> > > If the view shows only logical slots, it also
> > > might be worth considering to have separate views for logical slots
> > > and physical slots and having pg_stat_logical_replication_slots by
> > > this change.
> > >
> >
> > I am not sure at this stage but I think we will add the additional
> > stats for physical slots once we have any in this view itself. I would
> > like to avoid adding separate views if possible. The only reason to
> > omit physical slots at this stage is that we don't have any stats for
> > the same.
>
> I also prefer not to have separate views. I'm concerned about the
> compatibility I explained above but at the same time I agree that it
> doesn't make sense to show the stats always having nothing. Since
> given you and Dilip agreed on that, I also agree with that.
>

Okay.

> >
> > > Here is my review comment on the v7 patch.
> > >
> > > +   /*
> > > +* Set the same reset timestamp for all replication slots too.
> > > +*/
> > > +   for (i = 0; i < max_replication_slots; i++)
> > > +   replSlotStats[i].stat_reset_timestamp =
> > > globalStats.stat_reset_timestamp;
> > > +
> > >
> > > You added back the above code but since we clear the timestamps on
> > > creation of a new slot they are not shown:
> > >
> > > +   /* Register new slot */
> > > +   memset(&replSlotStats[nReplSlotStats], 0, 
> > > sizeof(PgStat_ReplSlotStats));
> > > +   memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
> > >
> > > Looking at other statistics views such as pg_stat_slru,
> > > pg_stat_bgwriter, and pg_stat_archiver, they have a valid
> > > reset_timestamp value from the beginning. That's why I removed that
> > > code and assigned the timestamp when registering a new slot.
> > >
> >
> > Hmm, I don't think it is shown intentionally in those views. I think
> > what is happening in other views is that it has been initialized with
> > some value when we read the stats and then while updating and or
> > writing because we don't change the stat_reset_timestamp, it displays
> > the same value as initialized at the time of read. Now, because in
> > pgstat_replslot_index() we always initialize the replSlotStats it
> > would overwrite any previous value we have set during read and display
> > the stat_reset as empty for replication slots. If we stop initializing
> > the replSlotStats in pgstat_replslot_index() then we will see similar
> > behavior as other views have. So even if we want to change then
> > probably we should stop initialization in pgstat_replslot_index but I
> > don't think that is necessarily better behavior because the
> > description of the parameter doesn't indicate any such thing.
>
> Understood. I agreed that the newly created slot doesn't have
> reset_timestamp. Looking at pg_stat_database, a view whose rows are
> added dynamically unlike other stat views, the newly created database
> doesn't have reset_timestamp. But given we clear the stats for a slot
> at pgstat_replslot_index(), why do we need to initialize the
> reset_timestamp with globalStats.stat_reset_timestamp when reading the
> stats file? Even if we could not find any slot stats in the stats file
> the view won’t show anything.
>

It was mainly for a code consistency point of view. Also, we will
clear the data in pgstat_replslot_index only for new slots, not for
existing slots. It might be used when we can't load the statsfile as
per comment in code ("Set the current timestamp (will be kept only in
case we can't load an existing statsfile)).

-- 
With Regards,
Amit Kapila.

Re: 回复：how to create index concurrently on partitioned table

2020-10-05 Thread Michael Paquier

On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:
> Also, if a partitioned index is clustered, when we clear indisclustered for
> other indexes, should we also propogate that to their parent indexes, if any ?

I am not sure what you mean here.  Each partition's cluster runs in
its own individual transaction based on the patch you sent.  Are you
suggesting to update indisclustered for the partitioned index of a
partitioned table and all its parent partitioned in the same
transaction, aka a transaction working on the partitioned table?
Doesn't that mean that if we have a partition tree with multiple
layers then we finish by doing multiple time the same operation for
the parents?

>> It would be good also to check if
>> we have a partition index tree that maps partially with a partition
>> table tree (aka no all table partitions have a partition index), where
>> these don't get clustered because there is no index to work on.
>
> This should not happen, since a incomplete partitioned index is "invalid".

Indeed, I did not know this property.  I can see also that you have
added a test for this case, so that's good if we can rely on that.  I
am still in the process of reviewing this patch, all this handling
around indisclustered makes it rather complex.
--
Michael

signature.asc
Description: PGP signature

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-05 Thread Masahiko Sawada

On Sat, 3 Oct 2020 at 16:55, Amit Kapila  wrote:
>
> On Sat, Oct 3, 2020 at 9:26 AM Masahiko Sawada
>  wrote:
> >
> > When we discussed this before, I was thinking that we could have other
> > statistics for physical slots in the same statistics view in the
> > future. Having the view show only logical slots also makes sense to me
> > but I’m concerned a bit that we could break backward compatibility
> > that monitoring tools etc will be affected when the view starts to
> > show physical slots too.
> >
>
> I think that would happen anyway as we need to add more columns in
> view for the physical slots.

I think it depends; adding more columns to the view might not break
tools if the query used in the tool explicitly specifies columns. OTOH
if the view starts to show more rows, the tool will need to have the
condition to get the same result as before.

>
> > If the view shows only logical slots, it also
> > might be worth considering to have separate views for logical slots
> > and physical slots and having pg_stat_logical_replication_slots by
> > this change.
> >
>
> I am not sure at this stage but I think we will add the additional
> stats for physical slots once we have any in this view itself. I would
> like to avoid adding separate views if possible. The only reason to
> omit physical slots at this stage is that we don't have any stats for
> the same.

I also prefer not to have separate views. I'm concerned about the
compatibility I explained above but at the same time I agree that it
doesn't make sense to show the stats always having nothing. Since
given you and Dilip agreed on that, I also agree with that.

>
> > Here is my review comment on the v7 patch.
> >
> > +   /*
> > +* Set the same reset timestamp for all replication slots too.
> > +*/
> > +   for (i = 0; i < max_replication_slots; i++)
> > +   replSlotStats[i].stat_reset_timestamp =
> > globalStats.stat_reset_timestamp;
> > +
> >
> > You added back the above code but since we clear the timestamps on
> > creation of a new slot they are not shown:
> >
> > +   /* Register new slot */
> > +   memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
> > +   memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
> >
> > Looking at other statistics views such as pg_stat_slru,
> > pg_stat_bgwriter, and pg_stat_archiver, they have a valid
> > reset_timestamp value from the beginning. That's why I removed that
> > code and assigned the timestamp when registering a new slot.
> >
>
> Hmm, I don't think it is shown intentionally in those views. I think
> what is happening in other views is that it has been initialized with
> some value when we read the stats and then while updating and or
> writing because we don't change the stat_reset_timestamp, it displays
> the same value as initialized at the time of read. Now, because in
> pgstat_replslot_index() we always initialize the replSlotStats it
> would overwrite any previous value we have set during read and display
> the stat_reset as empty for replication slots. If we stop initializing
> the replSlotStats in pgstat_replslot_index() then we will see similar
> behavior as other views have. So even if we want to change then
> probably we should stop initialization in pgstat_replslot_index but I
> don't think that is necessarily better behavior because the
> description of the parameter doesn't indicate any such thing.

Understood. I agreed that the newly created slot doesn't have
reset_timestamp. Looking at pg_stat_database, a view whose rows are
added dynamically unlike other stat views, the newly created database
doesn't have reset_timestamp. But given we clear the stats for a slot
at pgstat_replslot_index(), why do we need to initialize the
reset_timestamp with globalStats.stat_reset_timestamp when reading the
stats file? Even if we could not find any slot stats in the stats file
the view won’t show anything. And the reset_timestamp will be cleared
when receiving a slot stats.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Automatic HASH and LIST partition creation

2020-10-05 Thread Pavel Borisov

>
> Sorry for not being clear earlier, I mean the partition name
> 'tablename_partnum' can conflict with any existing table name.
> As per current impemetation, if I do the following it results in the table
> name conflict.
>
> postgres=# create table tbl_test_5_1(i int);
> CREATE TABLE
> postgres=# CREATE TABLE tbl_test_5 (i int) PARTITION BY LIST((tbl_test_5))
>
> CONFIGURATION (values in
> ('(1)'::tbl_test_5), ('(3)'::tbl_test_5) default partition tbl_default_5);
> ERROR:  relation "tbl_test_5_1" already exists
>

Basically, it's the same thing when you try to create two tables with the
same name. It is not specific to partition creation and common for every
case that using any defaults, they can conflict with something existing.
And in this case this conflict is explicitly processes as I see from output
message.

In fact in PG there are other places when names are done in default way
e.g. in aggregates regression test it is not surprise to find in PG13:

explain (costs off)
  select min(f1), max(f1) from minmaxtest;
 QUERY PLAN
-
 Result
   InitPlan 1 (returns $0)
 ->  Limit
   ->  Merge Append
 Sort Key: minmaxtest.f1
 ->  Index Only Scan using minmaxtesti on minmaxtest
minmaxtest_1
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan using minmaxtest1i on minmaxtest1
minmaxtest_2
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan Backward using minmaxtest2i on
minmaxtest2 minmaxtest_3
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan using minmaxtest3i on minmaxtest3
minmaxtest_4
   InitPlan 2 (returns $1)
 ->  Limit
   ->  Merge Append
 Sort Key: minmaxtest_5.f1 DESC
 ->  Index Only Scan Backward using minmaxtesti on
minmaxtest minmaxtest_6
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan Backward using minmaxtest1i on
minmaxtest1 minmaxtest_7
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan using minmaxtest2i on minmaxtest2
minmaxtest_8
   Index Cond: (f1 IS NOT NULL)
 ->  Index Only Scan Backward using minmaxtest3i on
minmaxtest3 minmaxtest_9

where minmaxtest_ are the temporary relations
and minmaxtest are real partition names (last naming is unrelated
to first)

Overall I don't see much trouble in any form of automatic naming. But there
may be a convenience to provide fixed user-specified prefix to partition
names.

Thank you,
-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

66 matches

Mail list logo