date:20211206

On Tue, Dec 7, 2021 at 10:52 AM Amit Langote  wrote:
>
> On Tue, Dec 7, 2021 at 1:01 PM Amit Kapila  wrote:
> > On Mon, Dec 6, 2021 at 8:59 PM Amit Langote  wrote:
> > > On Mon, Dec 6, 2021 at 6:05 PM Amit Kapila  
> > > wrote:
> > > > Your concern is not very clear to me. Can you be more specific in what
> > > > kind of problem you see with such a design for row filtering?
> > >
> > > I guess my characterization of it is still vague but the problem I see
> > > is that we'll be adding one more feature that allows manipulating
> > > partitions directly, which means more system catalogs than there
> > > already are that know about partitions as objects.  I remember being
> > > deliberate about avoiding pg_publication_rel entries for all
> > > partitions when a partitioned table is added to a publication for that
> > > very reason.
> >
> > The same will be true even after this new feature.
>
> Oh, interesting.
>
> > > Is covering the use case of defining different filters for each
> > > partition important for developing that feature?
> > >
> >
> > Users can only define the filters for a relation only when they want
> > to add a particular partition to the publication. Say, if only root
> > table is added to the publication (and pubviaroot is true) then we
> > don't need row filters to be defined for partitions. Basically, there
> > will be only one additional column in pg_publication_rel for the row
> > filter.
>
> Okay, so you are talking about allowing setting filters on a partition
> that is added *explicitly* to a publication.  I guess that's the only
> way to do such a thing given the syntax of FOR TABLE / ADD TABLE.
> Filters specified on a partitioned table that is added to a
> publication apply equally to all partitions, depending on the
> publish_via_partition_root configuration.
>

Yes.

>  So if a partition is
> explicitly present in a publication with a filter defined on it, I
> suppose following are the possible scenarios:
>
> * publish_via_partition_root=true
>
> 1. If the root table is not present in the publication (perhaps
> unlikely), use the partition's filter because there's nothing else to
> consider.
>
> 2. If the root table is present in the publication (very likely), use
> the root's filter, ignoring/overriding the partition's own if any.
>
> * publish_via_partition_root=false
>
> 1. If the root table is not present in the publication, use the
> partition's filter because there's nothing else to consider.
>
> 2. If the root table is present in the publication, use the
> partition's own filter if present, else root's.
>

I have not tested/checked each of these scenarios individually but I
think it is something like, if publish_via_partition_root is false
then we will always try to use partitions row filter and if it is not
there then we don't use any filter. Similarly, if
publish_via_partition_root is true, then we always try to use a
partitioned table row filter and if it is not there (either because
the partitioned table is not part of the publication or because there
is no filter defined for it) then we don't use any filter.

-- 
With Regards,
Amit Kapila.

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-06 Thread Justin Pryzby

> +  
> +   subxact_count xid
> +  
> +  
> +   The current backend's active subtransactions count.

subtransaction (no s)

> +   Set to true if current backend's subtransaction cache is overflowed.

Say "has overflowed"

> + if (local_beentry->subxact_count > 0)
> + {
> + values[30] = local_beentry->subxact_count;
> + values[31] = local_beentry->subxact_overflowed;
> + }
> + else
> + {
> + nulls[30] = true;
> + nulls[31] = true;
> + }

Why is the subxact count set to NULL instead of zero ?

You added this to pg_stat_activity, which already has a lot of fields.
We talked a few months ago about not adding more fields that weren't commonly
used.
https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e

Since I think this field is usually not interesting to most users of
pg_stat_activity, maybe this should instead be implemented as a function like
pg_backend_get_subxact_status(pid).

People who want to could use it like:
SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

-- 
Justin

Re: pg_get_publication_tables() output duplicate relid

On Tue, Dec 7, 2021 at 1:01 PM Amit Kapila  wrote:
> On Mon, Dec 6, 2021 at 8:59 PM Amit Langote  wrote:
> > On Mon, Dec 6, 2021 at 6:05 PM Amit Kapila  wrote:
> > > Your concern is not very clear to me. Can you be more specific in what
> > > kind of problem you see with such a design for row filtering?
> >
> > I guess my characterization of it is still vague but the problem I see
> > is that we'll be adding one more feature that allows manipulating
> > partitions directly, which means more system catalogs than there
> > already are that know about partitions as objects.  I remember being
> > deliberate about avoiding pg_publication_rel entries for all
> > partitions when a partitioned table is added to a publication for that
> > very reason.
>
> The same will be true even after this new feature.

Oh, interesting.

> > Is covering the use case of defining different filters for each
> > partition important for developing that feature?
> >
>
> Users can only define the filters for a relation only when they want
> to add a particular partition to the publication. Say, if only root
> table is added to the publication (and pubviaroot is true) then we
> don't need row filters to be defined for partitions. Basically, there
> will be only one additional column in pg_publication_rel for the row
> filter.

Okay, so you are talking about allowing setting filters on a partition
that is added *explicitly* to a publication.  I guess that's the only
way to do such a thing given the syntax of FOR TABLE / ADD TABLE.
Filters specified on a partitioned table that is added to a
publication apply equally to all partitions, depending on the
publish_via_partition_root configuration.  So if a partition is
explicitly present in a publication with a filter defined on it, I
suppose following are the possible scenarios:

* publish_via_partition_root=true

1. If the root table is not present in the publication (perhaps
unlikely), use the partition's filter because there's nothing else to
consider.

2. If the root table is present in the publication (very likely), use
the root's filter, ignoring/overriding the partition's own if any.

* publish_via_partition_root=false

1. If the root table is not present in the publication, use the
partition's filter because there's nothing else to consider.

2. If the root table is present in the publication, use the
partition's own filter if present, else root's.

> >  I feel that defining
> > one for the root table and just using that for all partitions (albeit
> > translating as needed to account for possibly different attribute
> > numbers) would be enough for most users' use cases.
> >
>
> Sure, but what if the user just wants to publish only one of the
> partitions? I think in that case along with adding the partition to
> publication, we need to allow specifying row filter.

Yeah, I understand now that we're talking about partitions that are
added explicitly.

Thanks for your patience.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Fix a bug in DecodeAbort() and improve input data check on subscriber.

On Tue, Dec 7, 2021 at 6:06 AM Masahiko Sawada  wrote:
>
> Hi all,
>
> While updating the patch I recently posted[1] to make pg_waldump
> report replication origin ID, LSN, and timestamp, I found a bug that
> replication origin timestamp is not set in ROLLBACK PREPARED case.
> Commit 8bdb1332eb5 (CC'ed Amit) added an argument to
> ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the
> caller specified it at the wrong position:
>
> @@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
> if (two_phase)
> {
> ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr,
> buf->endptr,
> +
> SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
> commit_time, origin_id, origin_lsn,
> parsed->twophase_gid, true);
> }
> @@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
> {
> ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr,
> buf->endptr,
> abort_time, origin_id, origin_lsn,
> +   InvalidXLogRecPtr,
> parsed->twophase_gid, false);
> }
>
> This affects the value of rollback_data.rollback_time on the
> subscriber, resulting in setting the wrong timestamp to both the
> replication origin timestamp and the error callback argument on the
> subscriber. I've attached the patch to fix it.
>

Thanks for the report and patches. I see this is a problem and the
first patch will fix it. I'll test the same and review another patch
as well.

-- 
With Regards,
Amit Kapila.

Re: pg_get_publication_tables() output duplicate relid

On Tue, Dec 7, 2021 at 12:45 PM Amit Kapila  wrote:
> On Mon, Dec 6, 2021 at 8:04 PM Amit Langote  wrote:
> > So IIUC the scenario of concern is when a table to be attached as a
> > partition is in a schema that's present in pg_publication_namespace.
> > The only way to stop it from being published is to move it to another
> > schema that is not published using that publication.
> >
> > I think I misunderstood how the IN SCHEMA feature works.
> > Specifically, I didn't know that one can add a partitioned table to
> > the same publication (or any table other than those in a particular
> > schema for that matter).  Then the attached partition would still be
> > present in the publication by way of being part of the schema that is
> > present in the publication, along with the partitioned table that is
> > added separately.
>
> Right.
>
> > Yes, my proposal in its current form can't prevent that kind of duplication.
>
> I am not sure how to proceed here. I feel it is better to go ahead
> with the fix Hou-san proposed here and in another email [1] to fix the
> know issues, especially because the issue discussed in [1] needs to be
> back-patched.
>
> We can evaluate your proposal separately for HEAD. What
> do you think?

Yeah, maybe.  Though given the direction that the row-filters patch
set is taking in allowing to definw filters on the individual
partitions, I am not sure if I should be pushing the approach to
disallow partitions from being added to publications explicitly
alongside their parent tables.  I'll try to take a look at that thread
to be sure if that's actually the case.

Also, for the purposes of the problems that Greg and Hou-san have
discovered, I have no objection with applying Hou-san's patches.
Those seem better for the back-patching anyway.

Thank you.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-06 Thread Nikolay Samokhvalov

On Mon, Dec 6, 2021 at 8:16 PM Dilip Kumar  wrote:

> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.

I think it's a good idea – had the same need when recently researching
various issues with subtransactions [1], needed to patch Postgres in
benchmarking environments. To be fair, there is a way to understand that
the overflowed state is reached for PG 13+ – on standbys, observe reads in
Subtrans in pg_stat_slru. But of course, it's an indirect way.

I see that the patch adds two new columns to pg_stat_activity:
subxact_count and subxact_overflowed. This should be helpful to have.
Additionally, exposing the lastOverflowedXid value would be also good for
troubleshooting of subtransaction edge and corner cases – a bug recently
fixed in all current versions [2] was really tricky to troubleshoot in
production because this value is not visible to DBAs.

[1]
https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful
[2] https://commitfest.postgresql.org/36/3399/

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-06 Thread Zhihong Yu

On Mon, Dec 6, 2021 at 8:17 PM Dilip Kumar  wrote:

> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]
>
> [1]
> https://www.postgresql.org/message-id/CAFiTN-t5BkwdHm1bV8ez64guWZJB_Jjhb7arsQsftxEwpYwObg%40mail.gmail.com
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com


Hi,

bq. there is a no way to

Extra 'a' before no.

+* Number of active subtransaction in the current session.

subtransaction -> subtransactions

+* Whether subxid count overflowed in the current session.

It seems 'count' can be dropped from the sentence.

Cheers

Add sub-transaction overflow status in pg_stat_activity

2021-12-06 Thread Dilip Kumar

If the subtransaction cache is overflowed in some of the transactions
then it will affect all the concurrent queries as they need to access
the SLRU for checking the visibility of each tuple.  But currently
there is no way to identify whether in any backend subtransaction is
overflowed or what is the current active subtransaction count.
Attached patch adds subtransaction count and subtransaction overflow
status in pg_stat_activity.  I have implemented this because of the
recent complain about the same[1]

[1] 
https://www.postgresql.org/message-id/CAFiTN-t5BkwdHm1bV8ez64guWZJB_Jjhb7arsQsftxEwpYwObg%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 622e1012667c3cfa0c71f27590e2a49833970e22 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sun, 5 Dec 2021 17:56:16 +0530
Subject: [PATCH v1] Add subtransaction count and overflow status in
 pg_stat_activity

If there are some backends having a lot of nested subtransaction
or the subtransaction cache is overflowed there is a no way to
detect that.  So this patch is making that easy by adding those
fields in pg_stat_activity view.
---
 doc/src/sgml/monitoring.sgml| 18 ++
 src/backend/catalog/system_views.sql|  4 +++-
 src/backend/storage/ipc/sinvaladt.c | 13 +
 src/backend/utils/activity/backend_status.c |  4 +++-
 src/backend/utils/adt/pgstatfuncs.c | 13 -
 src/include/catalog/pg_proc.dat |  6 +++---
 src/include/storage/sinvaladt.h |  4 +++-
 src/include/utils/backend_status.h  | 10 ++
 src/test/regress/expected/rules.out | 12 +++-
 9 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a33..3eca83a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -918,6 +918,24 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
The current backend's xmin horizon.
   
  
+
+ 
+  
+   subxact_count xid
+  
+  
+   The current backend's active subtransactions count.
+  
+ 
+
+ 
+  
+   subxact_overflowed xid
+  
+  
+   Set to true if current backend's subtransaction cache is overflowed.
+  
+ 
 
 
  
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515c..3df23df 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -839,7 +839,9 @@ CREATE VIEW pg_stat_activity AS
 s.backend_xmin,
 S.query_id,
 S.query,
-S.backend_type
+S.backend_type,
+S.subxact_count,
+S.subxact_overflowed
 FROM pg_stat_get_activity(NULL) AS S
 LEFT JOIN pg_database AS D ON (S.datid = D.oid)
 LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c
index 946bd8e..876d7fe 100644
--- a/src/backend/storage/ipc/sinvaladt.c
+++ b/src/backend/storage/ipc/sinvaladt.c
@@ -395,17 +395,20 @@ BackendIdGetProc(int backendID)
 
 /*
  * BackendIdGetTransactionIds
- *		Get the xid and xmin of the backend. The result may be out of date
- *		arbitrarily quickly, so the caller must be careful about how this
- *		information is used.
+ *		Get the xid and xmin, nsubxid and overflow status of the backend. The
+ *		result may be out of date arbitrarily quickly, so the caller must be
+ *		careful about how this information is used.
  */
 void
-BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmin)
+BackendIdGetTransactionIds(int backendID, TransactionId *xid,
+		   TransactionId *xmin, int *nsubxid, bool *overflowed)
 {
 	SISeg	   *segP = shmInvalBuffer;
 
 	*xid = InvalidTransactionId;
 	*xmin = InvalidTransactionId;
+	*nsubxid = 0;
+	*overflowed = false;
 
 	/* Need to lock out additions/removals of backends */
 	LWLockAcquire(SInvalWriteLock, LW_SHARED);
@@ -419,6 +422,8 @@ BackendIdGetTransactionIds(int backendID, TransactionId *xid, TransactionId *xmi
 		{
 			*xid = proc->xid;
 			*xmin = proc->xmin;
+			*nsubxid = proc->subxidStatus.count;
+			*overflowed = proc->subxidStatus.overflowed;
 		}
 	}
 
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 7229598..9c904be 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -848,7 +848,9 @@ pgstat_read_current_status(void)
 		{
 			BackendIdGetTransactionIds(i,
 	   &localentry->backend_xid,
-	   &localentry->backend_xmin);
+	   &localentry->backend_xmin,
+	   &localentry->subxact_count,
+	   &localentry->subxact_overflowed);
 
 			localentry++;
 			localappname += NAMEDATALEN;
diff --git a/src/backend/utils/adt/pgstatfuncs.c

Re: pg_get_publication_tables() output duplicate relid

On Mon, Dec 6, 2021 at 8:59 PM Amit Langote  wrote:
>
> On Mon, Dec 6, 2021 at 6:05 PM Amit Kapila  wrote:
> > On Mon, Dec 6, 2021 at 1:00 PM Amit Langote  wrote:
> > > On Mon, Dec 6, 2021 at 3:55 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > > After thinking more on this. I find there might be some usage about 
> > > > adding both
> > > > child and parent to the publication.
> > > >
> > > > For the future feature publication row filter(and maybe column filter), 
> > > > It
> > > > could be useful for user to adding child and parent with different 
> > > > filter
> > > > expression. If pubviaroot=true, user can expect the parent's filter take
> > > > effect, If pubviaroot=false, they can expect the child's filter take 
> > > > effect.
> > > >
> > > > If we disallow adding both child and parent to publication, it could be 
> > > > harder
> > > > for these features to implement.
> > >
> > > It's possible that one may want such a feature, and yes, outright
> > > preventing adding both the parent and a partition to a publication
> > > would perhaps make it harder.  Mind you though that any such feature
> > > would need to fix the current implementation anyway to make the
> > > publication-behavior-related catalog entries for child tables, which
> > > we currently don't, unless children are added explicitly.
> > >
> > > That said, we should also be careful in implementing new features that
> > > offer a finer per-partition granularity, because the fact that we do
> > > currently offer that for many of the existing old features such as
> > > constraints, indexes, etc. limits the scalability of our architecture.
> > > That point may be off-topic for the thread, but something to consider,
> > > or maybe I need to take a look at the other patch to see if my
> > > concerns are addressable / have been addressed.
> > >
> >
> > Your concern is not very clear to me. Can you be more specific in what
> > kind of problem you see with such a design for row filtering?
>
> I guess my characterization of it is still vague but the problem I see
> is that we'll be adding one more feature that allows manipulating
> partitions directly, which means more system catalogs than there
> already are that know about partitions as objects.  I remember being
> deliberate about avoiding pg_publication_rel entries for all
> partitions when a partitioned table is added to a publication for that
> very reason.
>

The same will be true even after this new feature.

> Is covering the use case of defining different filters for each
> partition important for developing that feature?
>

Users can only define the filters for a relation only when they want
to add a particular partition to the publication. Say, if only root
table is added to the publication (and pubviaroot is true) then we
don't need row filters to be defined for partitions. Basically, there
will be only one additional column in pg_publication_rel for the row
filter.

>  I feel that defining
> one for the root table and just using that for all partitions (albeit
> translating as needed to account for possibly different attribute
> numbers) would be enough for most users' use cases.
>

Sure, but what if the user just wants to publish only one of the
partitions? I think in that case along with adding the partition to
publication, we need to allow specifying row filter.

-- 
With Regards,
Amit Kapila.

Re: pg_get_publication_tables() output duplicate relid

On Mon, Dec 6, 2021 at 8:04 PM Amit Langote  wrote:
>
> So IIUC the scenario of concern is when a table to be attached as a
> partition is in a schema that's present in pg_publication_namespace.
> The only way to stop it from being published is to move it to another
> schema that is not published using that publication.
>
> I think I misunderstood how the IN SCHEMA feature works.
> Specifically, I didn't know that one can add a partitioned table to
> the same publication (or any table other than those in a particular
> schema for that matter).  Then the attached partition would still be
> present in the publication by way of being part of the schema that is
> present in the publication, along with the partitioned table that is
> added separately.
>

Right.

> Yes, my proposal in its current form can't prevent that kind of duplication.
>

I am not sure how to proceed here. I feel it is better to go ahead
with the fix Hou-san proposed here and in another email [1] to fix the
know issues, especially because the issue discussed in [1] needs to be
back-patched. We can evaluate your proposal separately for HEAD. What
do you think? I am fine if you think that we should evaluate/analyze
your proposal before making a call on whether to fix the existing
issues with the proposed set of patches.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB57165D94CB187A97128F75EA946A9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: Triage for unimplemented geometric operators

2021-12-06 Thread Laurenz Albe

On Mon, 2021-12-06 at 18:18 -0500, Tom Lane wrote:
> Over in [1] it was pointed out that I overenthusiastically
> documented several geometric operators that, in fact, are
> only stubs that throw errors when called.  Specifically
> these are
> 
> dist_lb:<->(line,box)
> dist_bl:<->(box,line)
> close_sl:   lseg ## line
> close_lb:   line ## box
> poly_distance:  polygon <-> polygon
> path_center:@@ path
> (this also underlies point(path), which is not documented anyway)
> 
> There are three reasonable responses:
> 
> 1. Remove the documentation, leave the stubs in place;
> 2. Rip out the stubs and catalog entries too (only possible in HEAD);
> 3. Supply implementations.
> 
> I took a brief look at these, and none of them seem exactly hard
> to implement, with the exception of path_center which seems not to
> have a non-arbitrary definition.  (We could model it on poly_center
> but that one seems rather arbitrary; also, should open paths behave
> any differently than closed ones?)  close_lb would also be rather
> arbitrary for the case of a line that intersects the box, though
> we could follow close_sb's lead and return the line's closest point
> to the box center.
> 
> On the other hand, they've been unimplemented for more than twenty years
> and no one has stepped forward to fill the gap, which sure suggests that
> nobody cares and we shouldn't expend effort and code space on them.
> 
> The only one I feel a bit bad about dropping is poly_distance, mainly
> on symmetry grounds: we have distance operators for all the geometric
> types, so dropping this one would leave a rather obvious hole.  The
> appropriate implementation seems like a trivial copy and paste job:
> distance is zero if the polygons overlap per poly_overlap, otherwise
> it's the same as the closed-path case of path_distance.
> 
> So my inclination for HEAD is to implement poly_distance and nuke
> the others.  I'm a bit less sure about the back branches, but maybe
> just de-document all of them there.
> 
> Thoughts?
>
> [1] 
> https://www.postgresql.org/message-id/flat/5405b243-4523-266e-6139-ad9f80a9d9fc%40postgrespro.ru

I agree with option #2 for HEAD; if you feel motivated to implement
"poly_distance", fine.

About the back branches, removing the documentation is a good choice.

I think the lack of complaints is because everybody who needs serious
geometry processing uses PostGIS.

Yours,
Laurenz Albe

Re: parallel vacuum comments

On Tue, Dec 7, 2021 at 6:54 AM Masahiko Sawada  wrote:
>
> On Mon, Dec 6, 2021 at 1:47 PM Amit Kapila  wrote:
> >
> > On Fri, Dec 3, 2021 at 6:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > >
> > > > 3.  /*
> > > >   * Copy the index bulk-deletion result returned from ambulkdelete and
> > > > @@ -2940,19 +2935,20 @@ parallel_process_one_index(Relation indrel,
> > > >   * Since all vacuum workers write the bulk-deletion result at different
> > > >   * slots we can write them without locking.
> > > >   */
> > > > - if (shared_istat && !shared_istat->updated && istat_res != NULL)
> > > > + if (!pindstats->istat_updated && istat_res != NULL)
> > > >   {
> > > > - memcpy(&shared_istat->istat, istat_res, 
> > > > sizeof(IndexBulkDeleteResult));
> > > > - shared_istat->updated = true;
> > > > + memcpy(&(pindstats->istat), istat_res, sizeof(IndexBulkDeleteResult));
> > > > + pindstats->istat_updated = true;
> > > >
> > > >   /* Free the locally-allocated bulk-deletion result */
> > > >   pfree(istat_res);
> > > > -
> > > > - /* return the pointer to the result from shared memory */
> > > > - return &shared_istat->istat;
> > > >   }
> > > >
> > > > I think here now we copy the results both for local and parallel
> > > > cleanup. Isn't it better to write something about that in comments as
> > > > it is not clear from current comments?
> > >
> > > What do you mean by "local cleanup"?
> > >
> >
> > Clean-up performed by leader backend.
>
> I might be missing your points but I think the patch doesn't change
> the behavior around these codes. With the patch, we allocate
> IndexBulkDeleteResult on DSM for every index but the patch doesn't
> change the fact that in parallel vacuum/cleanup cases, we copy
> IndexBulkDeleteResult returned from ambulkdelete() or amvacuumcleanup
> to DSM space. Non-parallel vacuum doesn't use this function.
>

I was talking about when we call parallel_vacuum_process_one_index()
via parallel_vacuum_process_unsafe_indexes() where the leader
processes the indexes that will be skipped by workers. Isn't that case
slightly different now because previously in that case we would not
have done the copy but now we will copy the stats in that case as
well? Am, I missing something?

-- 
With Regards,
Amit Kapila.

Re: Alter all tables in schema owner fix

2021-12-06 Thread vignesh C

On Fri, Dec 3, 2021 at 10:50 PM Bossart, Nathan  wrote:
>
> On 12/2/21, 11:57 PM, "tanghy.f...@fujitsu.com"  
> wrote:
> > Thanks for your patch.
> > I tested it and it fixed this problem as expected. It also passed "make 
> > check-world".
>
> +1, the patch looks good to me, too.  My only other suggestion would
> be to move IsSchemaPublication() to pg_publication.c

Thanks for your comments, I have made the changes. Additionally I have
renamed IsSchemaPublication to is_schema_publication for keeping the
naming similar around the code. The attached v3 patch has the changes
for the same.

Regards,
Vignesh
From cc4226ced0e535a324dde5b1ff0e48d6e0035e17 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 2 Dec 2021 19:44:15 +0530
Subject: [PATCH v3] Fix for new owner of ALL TABLES IN SCHEMA publication
 should be superuser.

Currently while changing the owner of ALL TABLES IN SCHEMA publication, it is
not checked if the new owner has superuser permission or not. Added a check to
throw an error if the new owner does not have super user permission.
---
 src/backend/catalog/pg_publication.c  | 31 +++
 src/backend/commands/publicationcmds.c|  7 +
 src/include/catalog/pg_publication.h  |  1 +
 src/test/regress/expected/publication.out | 15 +++
 src/test/regress/sql/publication.sql  | 15 +++
 5 files changed, 69 insertions(+)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 63579b2f82..265bd538c3 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -193,6 +193,37 @@ is_publishable_relation(Relation rel)
 	return is_publishable_class(RelationGetRelid(rel), rel->rd_rel);
 }
 
+/*
+ * Returns true if any schema is associated with the publication, false if no
+ * schema is associated with the publication.
+ */
+bool
+is_schema_publication(Oid pubid)
+{
+	Relation	pubschsrel;
+	ScanKeyData scankey;
+	SysScanDesc scan;
+	HeapTuple	tup;
+	bool 		result = false;
+
+	pubschsrel = table_open(PublicationNamespaceRelationId, AccessShareLock);
+	ScanKeyInit(&scankey,
+Anum_pg_publication_namespace_pnpubid,
+BTEqualStrategyNumber, F_OIDEQ,
+ObjectIdGetDatum(pubid));
+
+	scan = systable_beginscan(pubschsrel,
+			  PublicationNamespacePnnspidPnpubidIndexId,
+			  true, NULL, 1, &scankey);
+	tup = systable_getnext(scan);
+	if (HeapTupleIsValid(tup))
+		result = true;
+
+	systable_endscan(scan);
+	table_close(pubschsrel, AccessShareLock);
+
+	return result;
+}
 
 /*
  * SQL-callable variant of the above
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 7d4a0e95f6..404bb5d0c8 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1192,6 +1192,13 @@ AlterPublicationOwner_internal(Relation rel, HeapTuple tup, Oid newOwnerId)
 	 errmsg("permission denied to change owner of publication \"%s\"",
 			NameStr(form->pubname)),
 	 errhint("The owner of a FOR ALL TABLES publication must be a superuser.")));
+
+		if (!superuser_arg(newOwnerId) && is_schema_publication(form->oid))
+			ereport(ERROR,
+	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+	 errmsg("permission denied to change owner of publication \"%s\"",
+			NameStr(form->pubname)),
+	 errhint("The owner of a FOR ALL TABLES IN SCHEMA publication must be a superuser.")));
 	}
 
 	form->pubowner = newOwnerId;
diff --git a/src/include/catalog/pg_publication.h b/src/include/catalog/pg_publication.h
index 1ae439e6f3..902f2f2f0d 100644
--- a/src/include/catalog/pg_publication.h
+++ b/src/include/catalog/pg_publication.h
@@ -122,6 +122,7 @@ extern List *GetPubPartitionOptionRelations(List *result,
 			Oid relid);
 
 extern bool is_publishable_relation(Relation rel);
+extern bool is_schema_publication(Oid pubid);
 extern ObjectAddress publication_add_relation(Oid pubid, PublicationRelInfo *targetrel,
 			  bool if_not_exists);
 extern ObjectAddress publication_add_schema(Oid pubid, Oid schemaid,
diff --git a/src/test/regress/expected/publication.out b/src/test/regress/expected/publication.out
index 1feb558968..b7df48e87c 100644
--- a/src/test/regress/expected/publication.out
+++ b/src/test/regress/expected/publication.out
@@ -373,6 +373,21 @@ ALTER PUBLICATION testpub2 ADD TABLE testpub_tbl1;  -- ok
 DROP PUBLICATION testpub2;
 DROP PUBLICATION testpub3;
 SET ROLE regress_publication_user;
+CREATE ROLE regress_publication_user3;
+GRANT regress_publication_user2 TO regress_publication_user3;
+SET client_min_messages = 'ERROR';
+CREATE PUBLICATION testpub4 FOR ALL TABLES IN SCHEMA pub_test;
+RESET client_min_messages;
+ALTER PUBLICATION testpub4 OWNER TO regress_publication_user3;
+SET ROLE regress_publication_user3;
+-- fail - new owner must be superuser
+ALTER PUBLICATION testpub4 owner to regress_publication_user2; -- fail
+ERROR:  permission denied to change owner of publication "testpub4"
+HINT:

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-06 Thread Peter Geoghegan

On Mon, Dec 6, 2021 at 6:11 PM Robert Haas  wrote:
> This doesn't seem convincing. Launching autovacuum too soon surely has
> costs that someone might not want to pay. Clearly in the degenerate
> case where we always autovacuum every table every time an autovacuum
> worker is launched, we have gone insane.

Unfortunately we already sometimes behave insanely in exactly the way
that you describe:

https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com

That is, in addition to the problem that I'm highlighting on this
thread, we also have the opposite problem: autovacuum chases its tail
when it sees dead heap-only tuples that opportunistic pruning can take
care of on its own. I bet that both effects sometimes cancel each
other out, in weird and unpredictable ways. This effect might be
protective at first, and then less protective.

> So arbitrarily large moves in
> that direction can't be viewed as unproblematic.

I certainly wouldn't argue that they are. Just that the current
approach of simply counting dead tuples in the table (or trying to,
using sampling) and later launching autovacuum (when dead tuples
crosses a pretty arbitrary threshold) has many problems -- problems
that make us either run autovacuum too aggressively, and not
aggressively enough (relative to what the docs suggest is supposed to
happen).

> Now, on the other hand, I *most definitely* think
> autovacuum_vacuum_scale_factor is hogwash. Everything I've seen
> indicates that, while you do want to wait for a larger number of dead
> tuples in a large table than in a small one, it's sublinear. I don't
> know whether it's proportional to sqrt(table_size) or table_size^(1/3)
> or lg(table_size) or table_size^(0.729837166538), but just plain old
> table_size is definitely not it.

I think that it'll vary considerably, based on many factors. Making
the precise threshold "open to interpretation" to some degree (by
autovacuum.c) seems like it might help us with future optimizations.

It's hard to define a break-even point for launching an autovacuum
worker. I think it would be more productive to come up with a design
that at least doesn't go completely off the rails in various specific
ways. I also think that our problem is not so much that we don't have
accurate statistics about dead tuples (though we surely don't have
much accuracy). The main problem seems to be that there are various
specific, important ways in which the distribution of dead tuples may
matter (leading to various harmful biases). And so it seems reasonable
to fudge how we interpret dead tuples with the intention of capturing
some of that, as a medium term solution. Something that considers the
concentration of dead tuples in heap pages seems promising.

--
Peter Geoghegan

Re: row filtering for logical replication

On Mon, Dec 6, 2021 at 6:18 PM Euler Taveira  wrote:
>
> On Mon, Dec 6, 2021, at 3:44 AM, Amit Kapila wrote:
>
> True, but that is the main reason the review and development are being
> done as separate sub-features. I suggest still keeping the similar
> separation till some of the reviews of each of the patches are done,
> otherwise, we need to rethink how to divide for easier review. We need
> to retain the 0005 patch because that handles many problems without
> which the main patch is incomplete and buggy w.r.t replica identity.
>
> IMO we should merge sub-features as soon as we reach consensus. Every new
> sub-feature breaks comments, tests and documentation if you want to remove or
> rearrange patches.
>

I agree that there is some effort but OTOH, it gives the flexibility
to do a focussed review and as soon as some patch is ready or close to
ready we can merge in the main patch. This was just a humble
suggestion based on how this patch was making progress and how it has
helped to keep some parts separate by allowing different people to
work on different parts of the problem.

> It seems I misread 0005. I agree that it is important. I'll
> check it.
>

Okay, thanks!

-- 
With Regards,
Amit Kapila.

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

On Mon, Dec 6, 2021 at 8:14 PM Peter Geoghegan  wrote:
> Suppose we believe that not-all-visible pages have 20 LP_DEAD items on
> average, and they turn out to only have 3 or 5. Theoretically we've
> done the wrong thing by launching autovacuum workers sooner -- we
> introduce bias. But we also have lower variance over time, which might
> make it worth it. I also think that it might not really matter at all.
> It's no great tragedy if we clean up and set pages all-visible in the
> visibility map a little earlier on average. It might even be a
> positive thing.

This doesn't seem convincing. Launching autovacuum too soon surely has
costs that someone might not want to pay. Clearly in the degenerate
case where we always autovacuum every table every time an autovacuum
worker is launched, we have gone insane. So arbitrarily large moves in
that direction can't be viewed as unproblematic.

> The fact that the user expresses the dead-tuple-wise threshold using
> autovacuum_vacuum_scale_factor is already somewhat arbitrary -- it is
> based on some pretty iffy assumptions. Even if we greatly overestimate
> dead tuples with the new algorithm, we're only doing so under
> circumstances that might have caused
> autovacuum_vacuum_insert_scale_factor to launch an autovacuum worker
> anyway. Just setting the visibility map bit has considerable value.

Now, on the other hand, I *most definitely* think
autovacuum_vacuum_scale_factor is hogwash. Everything I've seen
indicates that, while you do want to wait for a larger number of dead
tuples in a large table than in a small one, it's sublinear. I don't
know whether it's proportional to sqrt(table_size) or table_size^(1/3)
or lg(table_size) or table_size^(0.729837166538), but just plain old
table_size is definitely not it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: parallel vacuum comments

On Mon, Dec 6, 2021 at 1:47 PM Amit Kapila  wrote:
>
> On Fri, Dec 3, 2021 at 6:06 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 3, 2021 at 6:03 PM Amit Kapila  wrote:
> > >
> > > On Thu, Dec 2, 2021 at 6:01 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I've attached updated patches.
> > > >
> > >
> > > I have a few comments on v4-0001.
> >
> > Thank you for the comments!
> >
> > > 1.
> > > In parallel_vacuum_process_all_indexes(), we can combine the two
> > > checks for vacuum/cleanup at the beginning of the function
> >
> > Agreed.
> >
> > > and I think
> > > it is better to keep the variable name as bulkdel or cleanup instead
> > > of vacuum as that is more specific and clear.
> >
> > I was thinking to use the terms "bulkdel" and "cleanup" instead of
> > "vacuum" and "cleanup" for the same reason. That way, probably we can
> > use “bulkdel" and “cleanup" when doing index bulk-deletion (i.g.,
> > calling to ambulkdelete) and index cleanup (calling to
> > amvacuumcleanup), respectively, and use "vacuum" when processing an
> > index, i.g., doing either bulk-delete or cleanup, instead of using
> > just "processing" . But we already use “vacuum” and “cleanup” in many
> > places, e.g., lazy_vacuum_index() and lazy_cleanup_index(). If we want
> > to use “bulkdel” instead of “vacuum”, I think it would be better to
> > change the terminology in vacuumlazy.c thoroughly, probably in a
> > separate patch.
> >
>
> Okay.
>
> > > 2. The patch seems to be calling parallel_vacuum_should_skip_index
> > > thrice even before starting parallel vacuum. It has a call to find the
> > > number of blocks which has to be performed for each index. I
> > > understand it might not be too costly to call this but it seems better
> > > to remember this info like we are doing in the current code.
> >
> > Yes, we can bring will_vacuum_parallel array back to the code. That
> > way, we can remove the call to parallel_vacuum_should_skip_index() in
> > parallel_vacuum_begin().
> >
> > > We can
> > > probably set parallel_workers_can_process in parallel_vacuum_begin and
> > > then again update in parallel_vacuum_process_all_indexes. Won't doing
> > > something like that be better?
> >
> > parallel_workers_can_process can vary depending on bulk-deletion, the
> > first time cleanup, or the second time (or more) cleanup. What can we
> > set parallel_workers_can_process based on in parallel_vacuum_begin()?
> >
>
> I was thinking to set the results of will_vacuum_parallel in
> parallel_vacuum_begin().
>
> > >
> > > 3.  /*
> > >   * Copy the index bulk-deletion result returned from ambulkdelete and
> > > @@ -2940,19 +2935,20 @@ parallel_process_one_index(Relation indrel,
> > >   * Since all vacuum workers write the bulk-deletion result at different
> > >   * slots we can write them without locking.
> > >   */
> > > - if (shared_istat && !shared_istat->updated && istat_res != NULL)
> > > + if (!pindstats->istat_updated && istat_res != NULL)
> > >   {
> > > - memcpy(&shared_istat->istat, istat_res, sizeof(IndexBulkDeleteResult));
> > > - shared_istat->updated = true;
> > > + memcpy(&(pindstats->istat), istat_res, sizeof(IndexBulkDeleteResult));
> > > + pindstats->istat_updated = true;
> > >
> > >   /* Free the locally-allocated bulk-deletion result */
> > >   pfree(istat_res);
> > > -
> > > - /* return the pointer to the result from shared memory */
> > > - return &shared_istat->istat;
> > >   }
> > >
> > > I think here now we copy the results both for local and parallel
> > > cleanup. Isn't it better to write something about that in comments as
> > > it is not clear from current comments?
> >
> > What do you mean by "local cleanup"?
> >
>
> Clean-up performed by leader backend.

I might be missing your points but I think the patch doesn't change
the behavior around these codes. With the patch, we allocate
IndexBulkDeleteResult on DSM for every index but the patch doesn't
change the fact that in parallel vacuum/cleanup cases, we copy
IndexBulkDeleteResult returned from ambulkdelete() or amvacuumcleanup
to DSM space. Non-parallel vacuum doesn't use this function. Do you
have any suggestions on better comments here?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-06 Thread Peter Geoghegan

On Mon, Dec 6, 2021 at 2:37 PM Peter Geoghegan  wrote:
> On Mon, Dec 6, 2021 at 12:07 PM Robert Haas  wrote:
> > So does this. If some of the table is now all-visible when it wasn't
> > before, it's certainly a good guess that the portions that still
> > aren't have about the same distribution of dead tuples that they did
> > before ... although the other direction is less clear: it seems
> > possible that newly not-all-visible pages have fewer dead tuples than
> > ones which have been not-all-visible for a while. But you have to make
> > some guess.
>
> To me, it seems natural to accept and even embrace the inherent
> uncertainty about the number of dead tuples.

> The number of dead tuples in the table is an inherently dynamic thing,
> which makes it totally dissimilar to the pg_statistics-based stats.
> And so a single snapshot of a point in time is inherently much less
> useful -- we ought to keep a few sets of old statistics within our new
> pgstat_report_analyze() -- maybe 3 or 5.

I just realized that I didn't really get around to explicitly
connecting this to your point about newly not-all-visible pages being
quite different to older ones that ANALYZE has seen -- which is
definitely an important consideration. I'll do so now:

Keeping some history makes the algorithm "less gullible" (a more
useful goal than making it "smarter", at least IMV). Suppose that our
starting point is 2 pieces of authoritative information, which are
current as of the instant we want to estimate the number of dead
tuples for VACUUM: 1. total relation size (relpages), and 2. current
not-all-visible-pages count (interesting/countable pages, calculated
by taking the "complement" of visibilitymap_count() value). Further
suppose we store the same 2 pieces of information in our ANALYZE
stats, reporting using pgstat_report_analyze() -- the same 2 pieces of
information are stored alongside the actual count of dead tuples and
live tuples found on not-all-visible pages.

The algorithm avoids believing silly things about dead tuples by
considering the delta between each piece of information, particularly
the difference between "right now" and "the last time ANALYZE ran and
called pgstat_report_analyze()". For example, if item 1/relpages
increased by exactly the same number of blocks as item
2/not-all-visible pages (or close enough to it), that is recognized as
a pretty strong signal. The algorithm should consider the newly
not-all-visible pages as likely to have very few dead tuples. At the
same time, the algorithm should not change its beliefs about the
concentration of dead tuples in remaining, older not-all-visible
pages.

This kind of thing will still have problems, no doubt. But I'd much
rather err in the direction of over-counting dead tuples like this.
The impact of the problem on the workload/autovacuum is a big part of
the picture here.

Suppose we believe that not-all-visible pages have 20 LP_DEAD items on
average, and they turn out to only have 3 or 5. Theoretically we've
done the wrong thing by launching autovacuum workers sooner -- we
introduce bias. But we also have lower variance over time, which might
make it worth it. I also think that it might not really matter at all.
It's no great tragedy if we clean up and set pages all-visible in the
visibility map a little earlier on average. It might even be a
positive thing.

The fact that the user expresses the dead-tuple-wise threshold using
autovacuum_vacuum_scale_factor is already somewhat arbitrary -- it is
based on some pretty iffy assumptions. Even if we greatly overestimate
dead tuples with the new algorithm, we're only doing so under
circumstances that might have caused
autovacuum_vacuum_insert_scale_factor to launch an autovacuum worker
anyway. Just setting the visibility map bit has considerable value.

--
Peter Geoghegan

Re: Make pg_waldump report replication origin ID, LSN, and timestamp.

On Mon, Dec 6, 2021 at 11:24 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 6, 2021 at 5:09 PM Michael Paquier  wrote:
> >
> > On Mon, Dec 06, 2021 at 04:35:07PM +0900, Masahiko Sawada wrote:
> > > I've attached a patch to add replication origin information to
> > > xact_desc_prepare().
> >
> > Yeah.
> >
> > +   if (origin_id != InvalidRepOriginId)
> > +   appendStringInfo(buf, "; origin: node %u, lsn %X/%X, at %s",
> > +origin_id,
> > +LSN_FORMAT_ARGS(parsed.origin_lsn),
> > +timestamptz_to_str(parsed.origin_timestamp));
> >
> > Shouldn't you check for parsed.origin_lsn instead?  The replication
> > origin is stored there as far as I read EndPrepare().
>
> Yeah, I was thinking check origin_lsn instead. That way, we don't show
> invalid origin_timestamp and origin_lsn even if origin_id is set. But
> as far as I read, the same is true for xact_desc_commit() (and
> xact_desc_rollback()). That is, since apply workers always its origin
> id and could do commit transactions that are not replicated from the
> publisher, it's possible that xact_desc_commit() reports like:
>
> rmgr: Transaction len (rec/tot):117/   117, tx:725, lsn:
> 0/014BE938, prev 0/014BE908, desc: COMMIT 2021-12-06 22:04:44.462200
> JST; inval msgs: catcache 55 catcache 54 catcache 64; origin: node 1,
> lsn 0/0, at 2000-01-01 09:00:00.00 JST

Hmm, is that okay in the first place? This happens since the apply
worker updates twophaesstate value of pg_subscription after setting
the origin id and before entering the apply loop. No changes in this
transaction will be replicated but an empty transaction that has
origin id and doesn't have origin lsn and time will be replicated.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-06 Thread Kyotaro Horiguchi

At Mon, 6 Dec 2021 19:28:03 +, "Bossart, Nathan"  
wrote in 
> On 12/6/21, 4:34 AM, "Bharath Rupireddy" 
>  wrote:
> > While the database is performing end-of-recovery checkpoint, the
> > control file gets updated with db state as "shutting down" in
> > CreateCheckPoint (see the code snippet at [1]) and at the end it sets
> > it back to "shut down" for a brief moment and then finally to "in
> > production". If the end-of-recovery checkpoint takes a lot of time or
> > the db goes down during the end-of-recovery checkpoint for whatever
> > reasons, the control file ends up having the wrong db state.
> >
> > Should we add a new db state something like
> > DB_IN_END_OF_RECOVERY_CHECKPOINT/"in end-of-recovery checkpoint" or
> > something else to represent the correct state?
> 
> This seems like a reasonable change to me.  From a quick glance, it
> looks like it should be a simple fix that wouldn't add too much
> divergence between the shutdown and end-of-recovery checkpoint code
> paths.

Technically end-of-crash-recovery checkpointis actually a kind of
shutdown checkpoint. In other words, a server that needs to run a
crash recovery actually is once shut down then enters normal operation
mode internally.  So if the server crashed after the end-of-recovery
checkpoint finished and before it enters DB_IN_PRODUCTION state, the
server would start with a clean startup next time.  We could treat
DB_IN_END_OF_RECOVERY_CHECKPOINT as safe state to skip recovery but I
don't think we need to preserve that behavior.

In other places, server log and ps display specifically, we already
make distinction between end-of-recovery checkopint and shutdown
checkpoint.

Finally, I agree to Nathan that it should be simple enough.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-06 Thread Ashutosh Sharma

Thanks Robert for sharing your thoughts.

On Mon, Dec 6, 2021 at 11:16 PM Robert Haas  wrote:

> On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma 
> wrote:
> > One last point - If we try to clone a huge database, as expected CREATE
> DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
> which seems to be affecting the performance slightly.
>
> Yes, I think this needs to be characterized better. If you have a big
> shared buffers setting and a lot of those buffers are dirty and the
> template database is small, all of which is fairly normal, then this
> new approach should be much quicker. On the other hand, what if the
> situation is reversed? Perhaps you have a small shared buffers and not
> much of it is dirty and the template database is gigantic. Then maybe
> this new approach will be slower. But right now I think we don't know
> where the crossover point is, and I think we should try to figure that
> out.
>

Yes I think so too.


>
> So for example, imagine tests with 1GB of shard_buffers, 8GB, and
> 64GB. And template databases with sizes of whatever the default is,
> 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
> create a new database from one of the templates. And then just measure
> the performance. Maybe for large databases this approach is just
> really the pits -- and if your max_wal_size is too small, it
> definitely will be. But, I don't know, maybe with reasonable settings
> it's not that bad. Writing everything to disk twice - once to WAL and
> once to the target directory - has to be more expensive than doing it
> once. But on the other hand, it's all sequential I/O and the data
> pages don't need to be fsync'd, so perhaps the overhead is relatively
> mild. I don't know.
>

So far, I haven't found much performance overhead with a few gb of data in
the template database. It's just a bit with the default settings, perhaps
setting a higher value of max_wal_size would reduce this overhead.

--
With Regards,
Ashutosh Sharma.

Fix a bug in DecodeAbort() and improve input data check on subscriber.

Hi all,

While updating the patch I recently posted[1] to make pg_waldump
report replication origin ID, LSN, and timestamp, I found a bug that
replication origin timestamp is not set in ROLLBACK PREPARED case.
Commit 8bdb1332eb5 (CC'ed Amit) added an argument to
ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the
caller specified it at the wrong position:

@@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
if (two_phase)
{
ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr,
buf->endptr,
+
SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
commit_time, origin_id, origin_lsn,
parsed->twophase_gid, true);
}
@@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
{
ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr,
buf->endptr,
abort_time, origin_id, origin_lsn,
+   InvalidXLogRecPtr,
parsed->twophase_gid, false);
}

This affects the value of rollback_data.rollback_time on the
subscriber, resulting in setting the wrong timestamp to both the
replication origin timestamp and the error callback argument on the
subscriber. I've attached the patch to fix it.

Besides, I think we can improve checking input data on subscribers.
This bug was not detected by compilers but it could have been detected
if we checked the input data. Looking at logicalrep_read_xxx functions
in proto.c, there are some inconsistencies; we check the value of
prepare_data->xid in logicalrep_read_prepare_common() but we don't in
both logicalrep_read_commit_prepared() and
logicalrep_read_rollback_prepared(), and we don't check anything in
stream_start/stream_stop. Also, IIUC that since timestamps are always
set in prepare/commit prepared/rollback prepared cases we can check
them too. I've attached a PoC patch that introduces macros for
checking input data and adds some new checks. Since it could be
frequently called, I used unlikely() but probably we can also consider
replacing elog(ERROR) with assertions.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoD2dJfgsdxk4_KciAZMZQoUiCvmV9sDpp8ZuKLtKCNXaA%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
From 922aff7d1f7cf75b7672c794e7960a26cf07c8f6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Mon, 6 Dec 2021 18:23:21 +0900
Subject: [PATCH 1/2] Fix a bug of passing incorrect arguments to
 ReorderBufferFinishPrepared().

This has been introduced by commit 8bdb1332eb5.
---
 src/backend/replication/logical/decode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a2b69511b4..59aed6cee6 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -875,8 +875,8 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase && !skip_xact)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-	abort_time, origin_id, origin_lsn,
 	InvalidXLogRecPtr,
+	abort_time, origin_id, origin_lsn,
 	parsed->twophase_gid, false);
 	}
 	else
-- 
2.24.3 (Apple Git-128)

From 97a21e11ff5df940a35f1cad62c56c320e92dd78 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 7 Dec 2021 00:25:49 +0900
Subject: [PATCH 2/2] Improve input data check of logical replication.

---
 src/backend/replication/logical/proto.c | 47 +++--
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9f5bf4b639..c85cad7859 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -20,6 +20,11 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+/* Macros for checking input data */
+#define LOGICALREP_CHECK_INVALID_LSN(lsn) unlikely(XLogRecPtrIsInvalid(lsn))
+#define LOGICALREP_CHECK_INVALID_TIMESTAMP(ts) unlikely((ts) == 0)
+#define LOGICALREP_CHECK_INVALID_XID(xid) unlikely(!TransactionIdIsValid(xid))
+
 /*
  * Protocol message flags.
  */
@@ -61,7 +66,7 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
 {
 	/* read fields */
 	begin_data->final_lsn = pq_getmsgint64(in);
-	if (begin_data->final_lsn == InvalidXLogRecPtr)
+	if (LOGICALREP_CHECK_INVALID_LSN(begin_data->final_lsn))
 		elog(ERROR, "final_lsn not set in begin message");
 	begin_data->committime = pq_getmsgint64(in);
 	begin_data->xid = pq_getmsgint(in, 4);
@@ -132,10 +137,10 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 {
 	/* read fields */
 	begin_data->prepare_lsn = pq_getmsgint64(in);
-	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+	if (LOGICALREP_CHECK_INVALID_LSN(begin_

Re: Optionally automatically disable logical replication subscriptions on error

2021-12-06 Thread Greg Nancarrow

On Tue, Dec 7, 2021 at 3:06 AM Mark Dilger  wrote:
>
> My concern about disabling a subscription in response to *any* error is that 
> people may find the feature does more harm than good.  Disabling the 
> subscription in response to an occasional deadlock against other database 
> users, or occasional resource pressure, might annoy people and lead to the 
> feature simply not being used.
>
I can understand this point of view.
It kind of suggests to me the possibility of something like a
configurable timeout (e.g. disable the subscription if the same error
has occurred for more than X minutes) or, similarly, perhaps if some
threshold has been reached (e.g. same error has occurred more than X
times), but I think that this was previously suggested by Peter Smith
and the idea wasn't looked upon all that favorably?

Regards,
Greg Nancarrow
Fujitsu Australia

Triage for unimplemented geometric operators

Over in [1] it was pointed out that I overenthusiastically
documented several geometric operators that, in fact, are
only stubs that throw errors when called. Specifically
these are

dist_lb:<->(line,box)
dist_bl:<->(box,line)
close_sl: lseg ## line
close_lb: line ## box
poly_distance: polygon <-> polygon
path_center:@@ path
(this also underlies point(path), which is not documented anyway)

There are three reasonable responses:

1. Remove the documentation, leave the stubs in place;
2. Rip out the stubs and catalog entries too (only possible in HEAD);
3. Supply implementations.

I took a brief look at these, and none of them seem exactly hard
to implement, with the exception of path_center which seems not to
have a non-arbitrary definition. (We could model it on poly_center
but that one seems rather arbitrary; also, should open paths behave
any differently than closed ones?) close_lb would also be rather
arbitrary for the case of a line that intersects the box, though
we could follow close_sb's lead and return the line's closest point
to the box center.

On the other hand, they've been unimplemented for more than twenty years
and no one has stepped forward to fill the gap, which sure suggests that
nobody cares and we shouldn't expend effort and code space on them.

The only one I feel a bit bad about dropping is poly_distance, mainly
on symmetry grounds: we have distance operators for all the geometric
types, so dropping this one would leave a rather obvious hole. The
appropriate implementation seems like a trivial copy and paste job:
distance is zero if the polygons overlap per poly_overlap, otherwise
it's the same as the closed-path case of path_distance.

So my inclination for HEAD is to implement poly_distance and nuke
the others. I'm a bit less sure about the back branches, but maybe
just de-document all of them there.

Thoughts?

regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/5405b243-4523-266e-6139-ad9f80a9d9fc%40postgrespro.ru

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-06 Thread Peter Geoghegan

On Mon, Dec 6, 2021 at 12:07 PM Robert Haas  wrote:
> So does this. If some of the table is now all-visible when it wasn't
> before, it's certainly a good guess that the portions that still
> aren't have about the same distribution of dead tuples that they did
> before ... although the other direction is less clear: it seems
> possible that newly not-all-visible pages have fewer dead tuples than
> ones which have been not-all-visible for a while. But you have to make
> some guess.

To me, it seems natural to accept and even embrace the inherent
uncertainty about the number of dead tuples. We should model our
current belief about how many dead tuples are in the table as a
probability density function (or something along the same lines).
There is a true "sample space" here. Once we focus on not-all-visible
pages, using authoritative VM info, many kinds of misestimations are
clearly impossible. For example, there are only so many
not-all-visible heap pages, and they can only hold so many dead tuples
(up to MaxHeapTuplesPerPage). This is a certainty.

The number of dead tuples in the table is an inherently dynamic thing,
which makes it totally dissimilar to the pg_statistics-based stats.
And so a single snapshot of a point in time is inherently much less
useful -- we ought to keep a few sets of old statistics within our new
pgstat_report_analyze() -- maybe 3 or 5. Each set of statistics
includes the total number of relpages at the time, the total number of
not-all-visible pages (i.e. interesting pages) at the time, and the
average number of live and dead tuples encountered. This is
interpreted (along with a current visibilitymap_count()) to get our
so-called probability density function (probably not really a PDF,
probably something simpler and only vaguely similar) within
autovacuum.c.

It just occurred to me that it makes zero sense that
pgstat_report_vacuum() does approximately the same thing as
pgstat_report_analyze() -- we make no attempt to compensate for the
fact that the report is made by VACUUM specifically, and so reflects
the state of each page in the table immediately after it was processed
by VACUUM. ISTM that this makes it much more likely to appear as an
underestimate later on --  pgstat_report_vacuum() gets the furthest
possible thing from a random sample. Whereas if we had more context
(specifically that there are very few or even 0 all-visible pages), it
wouldn't hurt us at all, and we wouldn't need to have special
pgstat_report_vacuum()-only heuristics.

-- 
Peter Geoghegan

Re: GiST operator class for bool

2021-12-06 Thread Pavel Luzanov


Hello,

I don't see any changes in the documentation.[1]

Should bool appear in the looong list of supported operator classes?

[1] https://www.postgresql.org/docs/devel/btree-gist.html

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: pg_dump versus ancient server versions

Robert Haas  writes:
> On Sun, Dec 5, 2021 at 7:41 PM Tom Lane  wrote:
>> Based on these results, I think maybe we should raise our ambitions
>> a bit compared to Peter's original proposal.  Specifically,
>> I wonder if it wouldn't be wise to try to silence compile warnings
>> in these branches.

> Yep. I have long been of the view, and have said before, that there is
> very little harm in doing some maintenance of EOL branches. Making it
> easy to test against them is a great way to improve our chances of
> actually having the amount of backward-compatibility that we say we
> want to have.

Right.  The question that's on the table is how much is the right
amount of maintenance.  I think that back-patching user-visible bug
fixes, for example, is taking things too far.  What we want is to
be able to replicate the behavior of the branch's last released
version, using whatever build tools we are currently using.  So
back-patching something like that is counterproductive, because
now the behavior is not what was released.

A minimal amount of maintenance would be "only back-patch fixes
for issues that cause failure-to-build".  The next step up is "fix
issues that cause failure-to-pass-regression-tests", and then above
that is "fix developer-facing annoyances, such as compiler warnings
or unwanted test output, as long as you aren't changing user-facing
behavior".  I now think that it'd be reasonable to include this
last group, although I'm pretty sure Peter didn't have that in mind
in his policy sketch.

regards, tom lane

Re: parse_subscription_options - suggested improvements

2021-12-06 Thread Peter Smith

On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan  wrote:
>
> On 12/5/21, 9:21 PM, "Michael Paquier"  wrote:
> > On Mon, Dec 06, 2021 at 11:28:12AM +1100, Peter Smith wrote:
> >> For the initialization of opts I put memset within the function to
> >> make it explicit that the bit-masks will work as intended without
> >> having to look back at calling code for the initial values. In any
> >> case, I think the caller declarations of SubOpts are trivial, (e.g.
> >> SubOpts opts = {0};) so I felt caller initializations don't need to be
> >> changed regardless of the memset.
> >
> > It seems to me that not initializing these may cause some compilation
> > warnings.  memset(0) at the beginning of parse_subscription_options()
> > is an improvement.
>
> I'll admit I was surprised that my compiler didn't complain about
> this, but I wouldn't be surprised at all if others did.  I agree that
> there is no strong need to remove the initializations from the calling
> functions.
>
> >> My patch was meant only to remove all the redundant conditions of the
> >> HEAD code, so I did not rearrange any of the logic at all. Personally,
> >> I also think your v13 is better and easier to read, but those subtle
> >> behaviour differences were something I'd deliberately avoided in v12.
> >> However, if the committer thinks it does not matter then your v13 is
> >> fine by me.
> >
> > Well, there is always the argument that it could be confusing as a
> > different combination of options generates a slightly-different error,
> > but the user would get warned about each one of his/her mistakes at
> > the end, so the result is the same.
> >
> > -   if (opts->enabled &&
> > -   IsSet(supported_opts, SUBOPT_ENABLED) &&
> > -   !IsSet(opts->specified_opts, SUBOPT_ENABLED))
> > +   if (opts->enabled)
> >
> > I see.   The last condition on the specified options in the last two
> > checks is removed thanks to the first two checks.  As a matter of
> > consistency with those error strings, keeping each !IsSet() would be
> > cleaner.  But I agree that v13 is better than that, without removing
> > the two initializations.
>
> Attached a v14 with the initializations added back.
>

LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: ExecTypeSetColNames is fundamentally broken

Robert Haas  writes:
> I don't understand the code so I can't comment on the code, but I find
> the regression test changes pretty suspect. Attaching any alias list
> to the RTE ought to rename the output columns for all purposes, not
> just the ones we as implementers find convenient.

Well, that was what I thought when I wrote bf7ca1587, but it leads
to logical contradictions.  Consider

create table t (a int, b int);

create function f(t) returns ... ;

select f(t) from t;

select f(t) from t(x,y);

If we adopt the "rename for all purposes" interpretation, then
the second SELECT must fail, because what f() is being passed is
no longer of type t.  If you ask me, that'll be a bigger problem
for users than the change I'm proposing (quite independently of
how hard it might be to implement).  It certainly will break
a behavior that goes back much further than bf7ca1587.

regards, tom lane

Re: MSVC SSL test failure

2021-12-06 Thread Andrew Dunstan



On 12/6/21 10:30, Alexander Lakhin wrote:
> Hello Andrew,
> 06.12.2021 17:56, Andrew Dunstan wrote:
>> Yeah, quite annoying, especially because only some combinations of MSVC
>> runtime / openssl version seem to trigger the problem.
>>
>>
>> Adding a shutdown() before the closesocket() also fixes the issue.
>>
> Can you confirm that adding shutdown(MyProcPort->sock, SD_BOTH) fixes
> the issue for many test runs?
> I don't see the stable test passing here.
> Without shutdown() the test failed on iterations 1, 5, 4, but with
> shutdown() it failed too, on iterations 3, 1, 4.
> Without close() and shutdown() the test passes 20 iterations.
> (I'm still researching how openssl affects the shutdown sequence.)




I have been getting 100% failures on the SSL tests with closesocket()
alone, and 100% success over 10 tests with this:


diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 96ab37c7d0..5998c089b0 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -295,6 +295,7 @@ socket_close(int code, Datum arg)
 * Windows too.  But it's a lot more fragile than the other way.
 */
 #ifdef WIN32
+   shutdown(MyProcPort->sock, SD_SEND);
    closesocket(MyProcPort->sock);
 #endif


That said, your results are quite worrying.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Transparent column encryption

2021-12-06 Thread Jacob Champion

On Fri, 2021-12-03 at 22:32 +0100, Peter Eisentraut wrote:
> This feature does support deterministic
> encryption as an alternative to the default randomized encryption, so
> in that mode you can do equality lookups, at the cost of some
> security.

> + if (enc_det)
> + memset(iv, ivlen, 0);

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)

> The encryption algorithms are mostly hardcoded right now, but there
> are facilities for picking algorithms and adding new ones that will be
> expanded.  The CMK process uses RSA-OAEP.  The CEK process uses
> AES-128-CBC right now; a more complete solution should probably
> involve some HMAC thrown in.

Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.

> This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
> on the direction.

What kinds of attacks are you hoping to prevent (and not prevent)?

--Jacob

[1] https://datatracker.ietf.org/doc/html/rfc8452

Re: pg_dump versus ancient server versions

On Sun, Dec 5, 2021 at 7:41 PM Tom Lane  wrote:
> Based on these results, I think maybe we should raise our ambitions
> a bit compared to Peter's original proposal.  Specifically,
> I wonder if it wouldn't be wise to try to silence compile warnings
> in these branches.  The argument for this is basically that if we
> don't, then every time someone builds one of these branches, they
> have to tediously go through the warnings and verify that
> they're not important.  It won't take long for the accumulated
> time-wastage from that to exceed the cost of back-patching whatever
> we did to silence the warning in later branches.

Yep. I have long been of the view, and have said before, that there is
very little harm in doing some maintenance of EOL branches. Making it
easy to test against them is a great way to improve our chances of
actually having the amount of backward-compatibility that we say we
want to have.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ExecTypeSetColNames is fundamentally broken

On Sun, Dec 5, 2021 at 1:46 PM Tom Lane  wrote:
> So 0001 attached fixes this by revoking the decision to apply
> ExecTypeSetColNames in cases where a Var or RowExpr is declared
> to return a named composite type.  This causes a couple of regression
> test results to change, but they are ones that were specifically
> added to exercise this behavior that we now see to be invalid.

I don't understand the code so I can't comment on the code, but I find
the regression test changes pretty suspect. Attaching any alias list
to the RTE ought to rename the output columns for all purposes, not
just the ones we as implementers find convenient. I understand that we
have to do *something* here and that the present behavior is buggy and
unacceptable ... but I'm not sure I accept that the only possible fix
is the one you propose here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

On Sun, Dec 5, 2021 at 12:28 AM Peter Geoghegan  wrote:
> I wonder why we're counting the number of dead tuples (or LP_DEAD stub
> items) in the relation as a whole in ANALYZE's acquire_sample_rows()
> function. Wouldn't it make more sense to focus on the "live vs dead
> tuple properties" of heap pages that are not known to be all-visible
> when we generate statistics for our pgstat_report_analyze() report?
> These statistic collector stats are only for the benefit of autovacuum
> scheduling -- and so they're *consumed* in a way that is totally
> different to the nearby pg_statistic stats.

I think this could be the right idea. I'm not certain that it is, but
it does sound believable.

> This new approach also buys us the ability to extrapolate a new
> estimated number of dead tuples using old, stale stats. The stats can
> be combined with the authoritative/known number of not-all-visible
> pages right this second, since it's cheap enough to *accurately*
> determine the total number of not-all-visible pages for a heap
> relation by calling visibilitymap_count(). My guess is that this would
> be much more accurate in practice: provided the original average
> number of dead/live tuples (tuples per not-all-visible block) was
> still reasonably accurate, the extrapolated "total dead tuples right
> now" values would also be accurate.

So does this. If some of the table is now all-visible when it wasn't
before, it's certainly a good guess that the portions that still
aren't have about the same distribution of dead tuples that they did
before ... although the other direction is less clear: it seems
possible that newly not-all-visible pages have fewer dead tuples than
ones which have been not-all-visible for a while. But you have to make
some guess.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

On 12/6/21, 4:54 AM, "Bharath Rupireddy" 
 wrote:
> The function PreallocXlogFiles doesn't get called during
> end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
> becomes operational after the end-of-recovery checkpoint and may need
> WAL files. However, I'm not sure how beneficial it is going to be if
> the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
> extra WAL file).

There is another thread for adding more effective WAL pre-allocation
[0] that you might be interested in.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbndh5%40alap3.anarazel.de

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

On 12/6/21, 4:34 AM, "Bharath Rupireddy" 
 wrote:
> While the database is performing end-of-recovery checkpoint, the
> control file gets updated with db state as "shutting down" in
> CreateCheckPoint (see the code snippet at [1]) and at the end it sets
> it back to "shut down" for a brief moment and then finally to "in
> production". If the end-of-recovery checkpoint takes a lot of time or
> the db goes down during the end-of-recovery checkpoint for whatever
> reasons, the control file ends up having the wrong db state.
>
> Should we add a new db state something like
> DB_IN_END_OF_RECOVERY_CHECKPOINT/"in end-of-recovery checkpoint" or
> something else to represent the correct state?

This seems like a reasonable change to me.  From a quick glance, it
looks like it should be a simple fix that wouldn't add too much
divergence between the shutdown and end-of-recovery checkpoint code
paths.

Nathan

Re: O(n) tasks cause lengthy startups and checkpoints

On 12/6/21, 3:44 AM, "Bharath Rupireddy" 
 wrote:
> On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan  wrote:
>> I might hack something together for the separate worker approach, if
>> for no other reason than to make sure I really understand how these
>> functions work.  If/when a better idea emerges, we can alter course.
>
> Thanks. As I said upthread we've been discussing the approach of
> offloading some of the checkpoint tasks like (deleting snapshot files)
> internally for quite some time and I would like to share a patch that
> adds a new background cleaner process (currently able to delete the
> logical replication snapshot files, if required can be extended to do
> other tasks as well). I don't mind if it gets rejected. Please have a
> look.

Thanks for sharing!  I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()).  I'll
take a look at your patch as well.

Nathan

Re: pg_replslotdata - a tool for displaying replication slot information

On 12/5/21, 11:10 PM, "Michael Paquier"  wrote:
> On Thu, Dec 02, 2021 at 08:32:08AM +0530, Bharath Rupireddy wrote:
>> On Thu, Dec 2, 2021 at 4:22 AM Andres Freund  wrote:
 I don't have any other compelling use- cases at the moment, but I will say
 that it is typically nice from an administrative standpoint to be able to
 inspect things like this without logging into a running server.
>>>
>>> Weighed against the cost of maintaining (including documentation) a separate
>>> tool this doesn't seem sufficient reason.
>> 
>> IMHO, this shouldn't be a reason to not get something useful (useful
>> IMO and few others in this thread) into the core. The maintenance of
>> the tools generally is low compared to the core server features once
>> they get reviewed and committed.
>
> Well, a bit less maintenance is always better than more maintenance.
> An extra cost that you may be missing is related to the translation of
> the documentation, as well as the translation of any new strings this
> would require.  FWIW, I don't directly see a use for this tool that
> could not be solved with an online server.

Bharath, perhaps you should maintain this outside of core PostgreSQL
for now.  If some compelling use-cases ever surface that make it seem
worth the added maintenance burden, this thread could probably be
revisited.

Nathan

Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

2021-12-06 Thread SATYANARAYANA NARLAPURAM

If the segment size is 16MB it shouldn't take much time but higher segment
values this can be a problem. But again, the current segment has to be
filled 75% to precreate new one. I am not sure how much we gain. Do you
have some numbers with different segment sizes?

On Mon, Dec 6, 2021 at 4:51 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> The function PreallocXlogFiles doesn't get called during
> end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
> becomes operational after the end-of-recovery checkpoint and may need
> WAL files. However, I'm not sure how beneficial it is going to be if
> the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
> extra WAL file).
>
> Thoughts?
>
> [1]
> /*
>  * An end-of-recovery checkpoint is really a shutdown checkpoint, just
>  * issued at a different time.
>  */
> if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
> shutdown = true;
> else
> shutdown = false;
>
> /*
>  * Make more log segments if needed.  (Do this after recycling old log
>  * segments, since that may supply some of the needed files.)
>  */
> if (!shutdown)
> PreallocXlogFiles(recptr, checkPoint.ThisTimeLineID);
>
> Regards,
> Bharath Rupireddy.
>
>
>

Re: PostgreSQL server: authentication method 10 not supported

Ram Pratap Maurya  writes:
> "PHP Warning:  pg_connect(): Unable to connect to PostgreSQL server: 
> authentication method 10 not supported in
> /var/www/html/myLavaUat/app/webroot/myLavaCronDirect/cron/mylava_stg_arch.php 
> on line 9 "

You need to update your client's libpq to a version that knows about
SCRAM authentication, or else not use a SCRAM-encrypted password.

regards, tom lane

Re: Transparent column encryption

On Fri, Dec 3, 2021 at 4:32 PM Peter Eisentraut
 wrote:
> But it's missing the remaining 90% of the work,
> including additional DDL support, error handling, robust memory
> management, protocol versioning, forward and backward compatibility,
> pg_dump support, psql \d support, refinement of the cryptography, and
> so on.  But I think obvious solutions exist to all of those things, so
> it isn't that interesting to focus on them for now.

Right, we wouldn't want to get bogged down at this stage in little
details like, uh, everything.

> Some protocol extensions are required.  These should be guarded by
> some _pq_... setting, but this is not done in this patch yet.  As
> mentioned above, extra messages are added for sending the CMKs and
> CEKs.  In the RowDescription message, I have commandeered the format
> field to add a bit that indicates that the field is encrypted.  This
> could be made a separate field, and there should probably be
> additional fields to indicate the algorithm and CEK name, but this was
> easiest for now.  The ParameterDescription message is extended to
> contain format fields for each parameter, for the same purpose.
> Again, this could be done differently.

I think this is reasonable. I would choose to use an additional bit in
the format field as opposed to a separate field. It is worth
considering whether it makes more sense to extend the existing
ParameterDescription message conditionally on some protocol-level
option, or whether we should instead, say, add ParameterDescription2
or the moral equivalent. As I see it, the latter feels conceptually
simpler, but on the other hand, our wire protocol supposes that we
will never run out of 1-byte codes for messages, so perhaps some
prudence is needed.

> Speaking of parameter descriptions, the trickiest part of this whole
> thing appears to be how to get transparently encrypted data into the
> database (as opposed to reading it out).  It is required to use
> protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

On Sun, Dec 5, 2021 at 11:44 PM Sadhuprasad Patro  wrote:
> 1.
> --- a/doc/src/sgml/ref/create_database.sgml
> +++ b/doc/src/sgml/ref/create_database.sgml
> @@ -31,7 +31,8 @@ CREATE DATABASE  class="parameter">name
> -   [ IS_TEMPLATE [=]  class="parameter">istemplate ] ]
> +   [ IS_TEMPLATE [=]  class="parameter">istemplate ]
> +   [ OID [=]  class="parameter">db_oid ] ]
>
> Replace "db_oid" with 'oid'. Below in the listitem, we have mentioned 'oid'.

I agree that the listitem and the synopsis need to be consistent, but
it could be made consistent either by changing that one to db_oid or
this one to oid.

> 2.
> --- a/src/backend/commands/dbcommands.c
> +++ b/src/backend/commands/dbcommands.c
> + if ((dboid < FirstNormalObjectId) &&
> + (strcmp(dbname, "template0") != 0) &&
> + (!IsBinaryUpgrade))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
> + errmsg("Invalid value for option \"%s\"", defel->defname),
> + errhint("The specified OID %u is less than the minimum OID for user
> objects %u.",
> + dboid, FirstNormalObjectId));
> + }
>
> Are we sure that 'IsBinaryUpgrade' will be set properly, before the
> createdb function is called? Can we recheck once ?

How could it be set incorrectly, and how could we recheck this?

> 3.
> @@ -504,11 +525,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
>   */
>   pg_database_rel = table_open(DatabaseRelationId, RowExclusiveLock);
>
> - do
> + /* Select an OID for the new database if is not explicitly configured. */
> + if (!OidIsValid(dboid))
>   {
> - dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
> -Anum_pg_database_oid);
> - } while (check_db_file_conflict(dboid));
>
> I think we need to do 'check_db_file_conflict' for the USER given OID
> also.. right? It may already be present.

Hopefully, if that happens, we straight up fail later on.

> 4.
> --- a/src/bin/initdb/initdb.c
> +++ b/src/bin/initdb/initdb.c
>
> /*
> + * Create template0 database with oid Template0ObjectId i.e, 4
> + */
> +
>
> Better to mention here, why OID 4 is reserved for template0 database?.

I'm not sure how we would give a reason for selecting an arbitrary
constant? We could perhaps explain why we use a fixed OID. But there's
no reason it has to be 4, I think.

> 5.
> + /*
> + * Create template0 database with oid Template0ObjectId i.e, 4
> + */
> + static const char *const template0_setup[] = {
> + "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID 
> "
> + CppAsString2(Template0ObjectId) ";\n\n",
>
> Can we write something like, 'OID = CppAsString2(Template0ObjectId)'?
> mention "=".

That seems like a good idea, because it would be more consistent.

> 6.
> +
> + /*
> + * We use the OID of postgres to determine datlastsysoid
> + */
> + "UPDATE pg_database SET datlastsysoid = "
> + "(SELECT oid FROM pg_database "
> + "WHERE datname = 'postgres');\n\n",
> +
>
> Make the above comment a single line comment.

I think what Shruthi did is more correct. It doesn't have to be done
as a single-line comment just because it can fit on one line. And
Shruthi didn't write this comment anyway, it's only moved slightly
from where it was before.

> 7.
> There are some spelling mistakes in the comments as below, please
> correct the same
> + /*
> + * Make sure that binary upgrade propogate the database OID to the
> new => correct spelling
> + * cluster
> + */
>
> +/* OID 4 is reserved for Templete0 database */
>  > Correct spelling
> +#define Template0ObjectId 4

Yes, those would be good to fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[PATCH] Document heuristics for parameterized paths

2021-12-06 Thread Steinar H. Gunderson

Hi,

A small patch to the documentation about how to reduce the number of
parameterized paths, because it kept me searching for a long time :-)
(The code this documents is in add_paths_to_joinrel(), the loop
foreach(lc, root->join_info_list).)


diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 41c120e0cd..79e270188d 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -863,6 +863,19 @@ that.  An exception occurs for parameterized paths for the 
RHS relation of
 a SEMI or ANTI join: in those cases, we can stop the inner scan after the
 first match, so it's primarily startup not total cost that we care about.
 
+Furthermore, join trees involving parameterized paths are kept as left-deep
+as possible; for nested loops consisting of inner joins only, bushy plans
+are equivalent to left-deep ones, so keeping bushy plans is only a waste.
+This is enforced by refusing to join parameterized paths together unless
+the parameterization is resolved, *or* the remaining parameterization is
+one that must cannot be delayed right away (because of outer join
+restrictions).  This ensures that we do not keep around large subplans that
+are parameterized on a whole host of external relations, without losing
+any plans.  An exception is that we are allowed to keep a parameterization
+around if we *partially* resolve it, i.e., we had a multi-part index and
+resolved only one table from it.  This is known as the "star-schema"
+exception.
+
 
 LATERAL subqueries
 --

/* Steinar */
-- 
Homepage: https://www.sesse.net/

PostgreSQL server: authentication method 10 not supported

2021-12-06 Thread Ram Pratap Maurya

Hi Team,



I am facing connectivity  issue with PostgreSQL -13  , can you please suggest .



Error coming:



"PHP Warning:  pg_connect(): Unable to connect to PostgreSQL server: 
authentication method 10 not supported in

/var/www/html/myLavaUat/app/webroot/myLavaCronDirect/cron/mylava_stg_arch.php 
on line 9 "





DB "pg_hba.conf" Conf file setting is as below (172.16.40.32 is my application 
IP address).





# TYPE  DATABASEUSERADDRESS METHOD



# "local" is for Unix domain socket connections only

local   all all peer

#local   all postgrespeer

local   all postgresident

# IPv4 local connections:

hostall all 127.0.0.1/32md5

# IPv6 local connections:

hostall all ::1/128 md5

# Allow replication connections from localhost, by a user with the

# replication privilege.

local   replication all peer

hostreplication all 127.0.0.1/32md5

hostreplication all ::1/128 md5

hostreplication replication 192.168.1.133/32md5

hostreplication postgres192.168.1.133/32trust

hostreplication replication 192.168.1.138/32md5

hostreplication postgres192.168.1.138/32trust

hostreplication replication 172.16.40.30/32 md5

hostreplication postgres172.16.40.30/32 trust

hostreplication postgres127.0.0.1/32trust

hostall all 0.0.0.0/0   md5

hostreplication replication 172.16.40.32/32md5

hostreplication postgres172.16.40.32/32trust





Regards,

Ram Pratap.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma  wrote:
> One last point - If we try to clone a huge database, as expected CREATE 
> DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which 
> seems to be affecting the performance slightly.

Yes, I think this needs to be characterized better. If you have a big
shared buffers setting and a lot of those buffers are dirty and the
template database is small, all of which is fairly normal, then this
new approach should be much quicker. On the other hand, what if the
situation is reversed? Perhaps you have a small shared buffers and not
much of it is dirty and the template database is gigantic. Then maybe
this new approach will be slower. But right now I think we don't know
where the crossover point is, and I think we should try to figure that
out.

So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Optionally automatically disable logical replication subscriptions on error

2021-12-06 Thread Mark Dilger

> On Dec 5, 2021, at 10:56 PM, osumi.takami...@fujitsu.com wrote:
> 
> In my humble opinion, I felt the original purpose of the patch was to 
> partially remedy
> the situation that during the failure of apply, the apply process keeps going
> into the infinite error loop.

I agree.

> I'd say that in this sense, if we include such resource errors, we fail to 
> achieve
> the purpose in some cases, because of some left possibilities of infinite 
> loop.
> Disabling the subscription with even one any error excludes this irregular 
> possibility,
> since there's no room to continue the infinite loop.

I don't think there is any right answer here.  It's a question of policy 
preferences.

My concern about disabling a subscription in response to *any* error is that 
people may find the feature does more harm than good.  Disabling the 
subscription in response to an occasional deadlock against other database 
users, or occasional resource pressure, might annoy people and lead to the 
feature simply not being used.

I am happy to defer to your policy preference.  Thanks for your work on the 
patch!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should we document an example to set multiple libraries in shared_preload_libraries?

Bharath Rupireddy  writes:
> Am I missing something here? Or is there a distinction between parsing
> of postgresql.conf and ALTER SYSTEM SET command for GUC_LIST_QUOTE
> values? If so, what is it?

One context is SQL, the other is not.  The quoting rules are
really quite different.

regards, tom lane

Re: Non-superuser subscription owners

2021-12-06 Thread Mark Dilger

> On Dec 6, 2021, at 2:19 AM, Amit Kapila  wrote:
> 
>>> If we want to maintain the property that subscriptions can only be
>>> owned by superuser

We don't want to maintain such a property, or at least, that's not what I want. 
 I don't think that's what Jeff wants, either.

To clarify, I'm not entirely sure how to interpret the verb "maintain" in your 
question, since before the patch the property does not exist, and after the 
patch, it continues to not exist.  We could *add* such a property, of course, 
though this patch does not attempt any such thing.

> I understand that but won't that get verified when we look up the
> information in pg_authid as part of superuser() check?

If we added a superuser() check, then yes, but that would take things in a 
direction I do not want to go.

As I perceive the roadmap:

1) Fix the current bug wherein subscription changes are applied with superuser 
force after the subscription owner has superuser privileges revoked.
2) Allow the transfer of subscriptions to non-superuser owners.
3) Allow the creation of subscriptions by non-superusers who are members of 
some as yet to be created predefined role, say "pg_create_subscriptions"

I may be wrong, but it sounds like you interpret the intent of this patch as 
enforcing superuserness.  That's not so.  This patch intends to correctly 
handle the situation where a subscription is owned by a non-superuser (task 1, 
above) without going so far as creating new paths by which that situation could 
arise (tasks 2 and 3, above).

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should we document an example to set multiple libraries in shared_preload_libraries?

On Fri, Dec 3, 2021 at 11:25 PM Bossart, Nathan  wrote:
>
> On 12/3/21, 6:21 AM, "Bharath Rupireddy" 
>  wrote:
> > +1 to add here in the "Parameter Names and Values section", but do we
> > want to backlink every string parameter to this section? I think it
> > needs more effort. IMO, we can just backlink for
> > shared_preload_libraries alone. Thoughts?
>
> IMO this is most important for GUC_LIST_QUOTE parameters, of which
> there are only a handful.  I don't think adding a link to every string
> parameter is necessary.

Agree.

Should we specify something like below in the "Parameter Names and
Values" section's "String:" para? Do we use generic terminology like
'name' and val1, val2, val3 and so on?

ALTER SYSTEM SET name = val1,val2,val3;
ALTER SYSTEM SET name = 'val1', 'val2', 'val3';
ALTER SYSTEM SET name = '"val 1"', '"val,2"', 'val3';

Another thing I observed is the difference between how the
postgresql.conf file and ALTER SYSTEM SET command is parsed for
GUC_LIST_QUOTE values.

For instance, in postgresql.conf file, by default search_path is
specified as follows:
search_path = '"$user", public',
postgres=# show search_path ;
   search_path
-
 "$user", public
(1 row)

When I use the same style with ALTER SYSTEM SET command, the value is
treated as single string value:
postgres=# ALTER SYSTEM SET search_path = '"$user", public';
ALTER SYSTEM
postgres=# show search_path ;
   search_path
-
 "$user", public
(1 row)

postgres=# select pg_reload_conf();
 pg_reload_conf

 t
(1 row)

postgres=# show search_path ;
 search_path
-
 """$user"", public"
(1 row)

Am I missing something here? Or is there a distinction between parsing
of postgresql.conf and ALTER SYSTEM SET command for GUC_LIST_QUOTE
values? If so, what is it?

Regards,
Bharath Rupireddy.

Re: MSVC SSL test failure

2021-12-06 Thread Alexander Lakhin

Hello Andrew,
06.12.2021 17:56, Andrew Dunstan wrote:
> Yeah, quite annoying, especially because only some combinations of MSVC
> runtime / openssl version seem to trigger the problem.
>
>
> Adding a shutdown() before the closesocket() also fixes the issue.
>
Can you confirm that adding shutdown(MyProcPort->sock, SD_BOTH) fixes
the issue for many test runs?
I don't see the stable test passing here.
Without shutdown() the test failed on iterations 1, 5, 4, but with
shutdown() it failed too, on iterations 3, 1, 4.
Without close() and shutdown() the test passes 20 iterations.
(I'm still researching how openssl affects the shutdown sequence.)

Best regards,
Alexander

Re: pg_get_publication_tables() output duplicate relid

On Mon, Dec 6, 2021 at 6:05 PM Amit Kapila  wrote:
> On Mon, Dec 6, 2021 at 1:00 PM Amit Langote  wrote:
> > On Mon, Dec 6, 2021 at 3:55 PM houzj.f...@fujitsu.com
> >  wrote:
> > > After thinking more on this. I find there might be some usage about 
> > > adding both
> > > child and parent to the publication.
> > >
> > > For the future feature publication row filter(and maybe column filter), It
> > > could be useful for user to adding child and parent with different filter
> > > expression. If pubviaroot=true, user can expect the parent's filter take
> > > effect, If pubviaroot=false, they can expect the child's filter take 
> > > effect.
> > >
> > > If we disallow adding both child and parent to publication, it could be 
> > > harder
> > > for these features to implement.
> >
> > It's possible that one may want such a feature, and yes, outright
> > preventing adding both the parent and a partition to a publication
> > would perhaps make it harder.  Mind you though that any such feature
> > would need to fix the current implementation anyway to make the
> > publication-behavior-related catalog entries for child tables, which
> > we currently don't, unless children are added explicitly.
> >
> > That said, we should also be careful in implementing new features that
> > offer a finer per-partition granularity, because the fact that we do
> > currently offer that for many of the existing old features such as
> > constraints, indexes, etc. limits the scalability of our architecture.
> > That point may be off-topic for the thread, but something to consider,
> > or maybe I need to take a look at the other patch to see if my
> > concerns are addressable / have been addressed.
> >
>
> Your concern is not very clear to me. Can you be more specific in what
> kind of problem you see with such a design for row filtering?

I guess my characterization of it is still vague but the problem I see
is that we'll be adding one more feature that allows manipulating
partitions directly, which means more system catalogs than there
already are that know about partitions as objects.  I remember being
deliberate about avoiding pg_publication_rel entries for all
partitions when a partitioned table is added to a publication for that
very reason.

Is covering the use case of defining different filters for each
partition important for developing that feature?  I feel that defining
one for the root table and just using that for all partitions (albeit
translating as needed to account for possibly different attribute
numbers) would be enough for most users' use cases.  Maybe I'm wrong
on that.

> You
> might want to once look at that thread. You can find a summary of that
> patch in the email [1].

Okay, I'll try to do that.


--
Amit Langote
EDB: http://www.enterprisedb.com

Re: MSVC SSL test failure

2021-12-06 Thread Daniel Gustafsson

> On 6 Dec 2021, at 15:56, Andrew Dunstan  wrote:

> Yeah, quite annoying, especially because only some combinations of MSVC
> runtime / openssl version seem to trigger the problem.
> 
> Adding a shutdown() before the closesocket() also fixes the issue.

If you have a patch you're testing I'm happy to run it through Cirrus in as
many combinations of MSVC and OpenSSL as I can muster there.

--
Daniel Gustafsson   https://vmware.com/

psql: exit status with multiple -c and -f

2021-12-06 Thread Justin Pryzby

I raised this issue a few years ago.
https://www.postgresql.org/message-id/20181217175841.GS13019%40telsasoft.com

|[pryzbyj@database ~]$ psql -v VERBOSITY=terse ts -xtc 'ONE' -c "SELECT 'TWO'"; 
echo "exit status $?"
|ERROR:  syntax error at or near "ONE" at character 1
|?column? | TWO
|
|exit status 0

The documentation doen't say what the exit status should be in this case:
| psql returns 0 to the shell if it finished normally, 1 if a fatal error of 
its own occurs (e.g., out of memory, file not found), 2 if the connection to 
the server went bad and the session was not interactive, and 3 if an error 
occurred in a script and the variable ON_ERROR_STOP was set.

It returns 1 if the final command fails, even though it's not a "fatal error"
(it would've happily kept running more commands).

| x=`some_command_that_fails`
| rm -fr "$x/$y # removes all your data

| psql -c "begin; C REATE TABLE newtable(LIKE oldtable) INSERT INTO newtable 
SELECT * FROM oldtable; commit" -c "DROP TABLE oldtable
| psql -v VERBOSITY=terse ts -xtc '0CREATE TABLE newtbl(i int)' -c 'INSERT INTO 
newtbl SELECT * FROM tbl' -c 'DROP TABLE IF EXISTS tbl' -c 'ALTER TABLE newtbl 
RENAME TO tbl'; echo ret=$?

David J suggested to change the default value of ON_ERROR_STOP.  The exit
status in the non-default case would have to be documented.  That's one
solution, and allows the old behavior if anybody wants it.  That probably does
what most people want, too.  This is more likely to expose a real problem that
someone would have missed than to break a legitimate use.  That doesn't appear
to break regression tests at all.

The alternative to David's suggestion is to define some non-zero exit status to
mean "an error occurred and ON_ERROR_STOP was not set".

If the new definition said that the new exit status was only used for errors
which occur in the final command (-c or -f), and if the exit status in that
case were "1", then this would be a back-patchable documentation change,
removing the word "fatal" and removing or updating the parenthetical examples.
That would make psql behave exactly like /bin/sh does when used without
"set -e" - which is usually the opposite of the desirable behavior...

I think it's not viable to change the exit status in the case of a single
-c/-f, nor to change the exit status in back branches.  David's suggestion is
more radical than the minimal change to a nonzero exit status, but maybe that's
okay ?

-- 
Justin

Re: MSVC SSL test failure

2021-12-06 Thread Andrew Dunstan

On 12/6/21 01:02, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 12/5/21 12:50, Tom Lane wrote:
>>> This looks quite a bit like the sort of failure that commit
>>> 6051857fc was meant to forestall.  I wonder whether reverting
>>> that commit changes the results?  You might also try inserting
>>> a shutdown() call, as we'd decided not to do [1].
>> Commenting out the closesocket() worked.
> Man, that's annoying.  Apparently OpenSSL is doing something to
> screw up the shutdown sequence.  According to [1], the graceful
> shutdown sequence will happen by default, but changing SO_LINGER
> or SO_DONTLINGER can get you into abortive shutdown anyway.
> Maybe they change one of those settings (why?)
>
>   regards, tom lane
>
> [1] 
> https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2

Yeah, quite annoying, especially because only some combinations of MSVC
runtime / openssl version seem to trigger the problem.

Adding a shutdown() before the closesocket() also fixes the issue.

says:

To assure that all data is sent and received on a connected socket
before it is closed, an application should use shutdown to close
connection before calling closesocket.
...

Note  The shutdown function does not block regardless of the
SO_LINGER setting on the socket.

Since we're not expecting anything back from the client I don't think we
need any of the recv calls the recipes there suggest.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Multi-Column List Partitioning

2021-12-06 Thread Amul Sul

On Mon, Dec 6, 2021 at 7:27 PM Nitin Jadhav
 wrote:
>
> Thank you for reviewing the patch.
>
> > partbounds.c: In function ‘get_qual_for_list.isra.18’:
> > partbounds.c:4284:29: warning: ‘boundinfo’ may be used uninitialized
> > in this function [-Wmaybe-uninitialized]
> > datumCopy(bound_info->datums[i][j],
> >   ~~^~~~
> > partbounds.c:4335:21: note: ‘boundinfo’ was declared here
> >  PartitionBoundInfo boundinfo;
> > ^
> > partbounds.c: In function ‘partition_bounds_merge’:
> > partbounds.c:1305:12: warning: ‘inner_isnull’ may be used
> > uninitialized in this function [-Wmaybe-uninitialized]
> >   bool*inner_isnull;
> >^~~~
> > partbounds.c:1304:12: warning: ‘outer_isnull’ may be used
> > uninitialized in this function [-Wmaybe-uninitialized]
> >   bool*outer_isnull;
> >   ^~~~
>
> Fixed.
>
> > This function is unnecessarily complicated, I think you can avoid
> > inner for loops; simply replace for-loop-block with  "if
> > (equal(lfirst(cell), new_bound)) return true".
>
> Thank you for the suggestion. Fixed.
>
> > + char   **colname = (char **) palloc0(partnatts * sizeof(char *));
> > + Oid*coltype = palloc0(partnatts * sizeof(Oid));
> > + int32*coltypmod = palloc0(partnatts * sizeof(int));
> > + Oid*partcollation = palloc0(partnatts * sizeof(Oid));
> > +
> > This allocation seems to be worthless, read ahead.
> >
> > I think there is no need for this separate loop inside
> > transformPartitionListBounds, you can do that same in the next loop as
> > well. And instead of  get_partition_col_* calling and storing, simply
> > use that directly as an argument to transformPartitionBoundValue().
>
> Yes. The loop can be avoided and content of the above loop can be
> included in the next loop but the next loop iterates over a list of
> multi column datums. For each iteration, we need the information of
> all the columns. The above data (colname, coltype, coltypmod and
> partcollation) remains same for each iteration of the loop, If we
> modify as suggested, then the function to fetch these information has
> to be called every-time. To avoid this situation I have made a
> separate loop outside which only runs as many number of columns and
> stores in a variable which can be reused later. Please let me correct
> if I am wrong.
>

Ok, colname can be fetched in advance but I don't think it worth it to
fetch coltype, coltypmod & partcollation;  and, store in the
explicitly allocated memory, instead, you can directly call
get_partition_col_* inline functions.

> > I think this should be inside the "else" block after "!IsA(rowexpr,
> > RowExpr)" error and you can avoid IsA() check too.
>
> This is required to handle the situation when one partition key is
> mentioned and multiple values are provided in the partition bound
> specification.
>
> > Looks difficult to understand at first glance, how about the following:
> >
> > if (b1->isnulls != b2->isnulls)
> >return false;
> >
> > if (b1->isnulls)
> > {
> >if (b1->isnulls[i][j] != b2->isnulls[i][j])
> >return false;
> >if (b1->isnulls[i][j])
> >continue;
> > }
> >
> > See how range partitioning infinite values are handled. Also, place
> > this before the comment block that was added for the "!datumIsEqual()"
> > case.
>
> Fixed. I feel the 'continue' block is not required and hence removed it.
>
> > Nothing wrong with this but if we could have checked "dest->isnulls"
> > instead of "src->isnulls" would be much better.
>
> Here we are copying the data from 'src' to 'dest'. If there is no data
> in 'src', it is unnecessary to copy. Hence checking 'src'.
>

I am not sure how that makes a difference since you do allocate 'dest'
based on 'src'; anyway, I leave that choice to you.

> > Condition "key->strategy != PARTITION_STRATEGY_LIST" seems to be 
> > unnecessary.
>
> Fixed.
>
> > Can't be a single loop?
>
> Yes. Fixed.
>

Thanks, will have a look.

Regards,
Amul

Re: pg_get_publication_tables() output duplicate relid

On Mon, Dec 6, 2021 at 1:59 PM Amit Kapila  wrote:
> On Fri, Dec 3, 2021 at 6:04 PM Amit Langote  wrote:
> > On Fri, Dec 3, 2021 at 12:37 PM Amit Kapila  wrote:
> > > On Thu, Dec 2, 2021 at 7:18 PM Amit Langote  
> > > wrote:
> > > > Okay, I did write a PoC patch this morning after sending out my
> > > > earlier email.  I polished it a bit, which is attached.
> > >
> > > I see multiple problems with this patch and idea.
> >
> > Thanks for looking at it.  Yeah, I have not looked very closely at ALL
> > TABLES [IN SCHEMA], though only because I suspected that those cases
> > deal with partitioning in such a way that the partition duplication
> > issue doesn't arise.  That is, only the FOR TABLE list_of_tables and
> > ADD TABLE syntax allow for the duplication issue to occur.
> >
> > > (a) I think you
> > > forgot to deal with "All Tables In Schema" Publication which will be
> > > quite tricky to deal with during attach operation. How will you remove
> > > a particular relation from such a publication if there is a need to do
> > > so?
> >
> > Hmm, my understanding of how FOR ALL TABLES... features work is that
> > one cannot remove a particular relation from such publications?
> >
> > create schema sch;
> > create table sch.p (a int primary key) partition by list (a);
> > create table sch.p1 partition of sch.p for values in (1);
> > create table sch.p2 partition of sch.p for values in (2);
> > create table p (a int primary key) partition by list (a);
> > create table p1 partition of p for values in (1);
> > create table p2 partition of p for values in (2);
> > create publication puball for all tables;
> > create publication pubsch for all tables in schema sch;
> >
> > alter publication puball drop table p;
> > ERROR:  publication "puball" is defined as FOR ALL TABLES
> > DETAIL:  Tables cannot be added to or dropped from FOR ALL TABLES 
> > publications.
> >
> > alter publication pubsch drop table sch.p;
> > ERROR:  relation "p" is not part of the publication
> >
> > What am I missing?
>
> Currently, in your patch, you are trying to remove a particular
> relation/partition during attach but how will you do that if such a
> relation is part of All Tables In Schema publication?

So IIUC the scenario of concern is when a table to be attached as a
partition is in a schema that's present in pg_publication_namespace.
The only way to stop it from being published is to move it to another
schema that is not published using that publication.

I think I misunderstood how the IN SCHEMA feature works.
Specifically, I didn't know that one can add a partitioned table to
the same publication (or any table other than those in a particular
schema for that matter).  Then the attached partition would still be
present in the publication by way of being part of the schema that is
present in the publication, along with the partitioned table that is
added separately.

Yes, my proposal in its current form can't prevent that kind of duplication.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: Failed transaction statistics to measure the logical replication progress

2021-12-06 Thread vignesh C

On Sat, Dec 4, 2021 at 6:32 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Friday, December 3, 2021 3:12 PM vignesh C  wrote:
> > Thanks for the updated patch.
> > Currently we are storing the commit count, error_count and abort_count for
> > each table of the table sync operation. If we have thousands of tables, we 
> > will
> > be storing the information for each of the tables.
> > Shouldn't we be storing the consolidated information in this case.
> > diff --git a/src/backend/replication/logical/tablesync.c
> > b/src/backend/replication/logical/tablesync.c
> > index f07983a..02e9486 100644
> > --- a/src/backend/replication/logical/tablesync.c
> > +++ b/src/backend/replication/logical/tablesync.c
> > @@ -1149,6 +1149,11 @@ copy_table_done:
> > MyLogicalRepWorker->relstate_lsn = *origin_startpos;
> > SpinLockRelease(&MyLogicalRepWorker->relmutex);
> >
> > +   /* Report the success of table sync. */
> > +   pgstat_report_subworker_xact_end(MyLogicalRepWorker->subid,
> > +
> >   MyLogicalRepWorker->relid,
> > +
> >   0 /* no logical message type */ );
> Okay.
>
> I united all stats into that of apply worker.
> In line with this change, I fixed the TAP tests as well
> to cover the updates of stats done by table sync workers.
>
> Also, during my self-review, I noticed that
> I should call pgstat_report_subworker_xact_end() before
> process_syncing_tables() because it can lead to process
> exit, which results in missing one increment of the stats columns.
> I noted this point in a comment as well.

Thanks for the updated patch, few comments:
1) We can keep the documentation similar to mention the count includes
both table sync worker / main apply worker in case of
commit_count/error_count and abort_count to keep it consistent.
+   commit_count bigint
+  
+  
+   Number of transactions successfully applied in this subscription.
+   COMMIT and COMMIT PREPARED increments this counter.
+  
+ 
+
+ 
+  
+   error_count bigint
+  
+  
+   Number of transactions that failed to be applied by the table
+   sync worker or main apply worker in this subscription.
+   
+ 
+
+ 
+  
+   abort_count bigint
+  
+  
+   Number of transactions aborted in this subscription.
+   ROLLBACK PREPARED increments this counter.
+  
+ 

2) Can this be changed:
+   /*
+* If this is a new error reported by table sync worker,
consolidate this
+* error count into the entry of apply worker.
+*/
+   if (OidIsValid(msg->m_subrelid))
+   {
+   /* Gain the apply worker stats */
+   subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
InvalidOid, true);
+   subwentry->error_count++;
+   }
+   else
+   subwentry->error_count++;   /* increment the apply
worker's counter. */
To:
+   /*
+* If this is a new error reported by table sync worker,
consolidate this
+* error count into the entry of apply worker.
+*/
+   if (OidIsValid(msg->m_subrelid))
+   /* Gain the apply worker stats */
+   subwentry = pgstat_get_subworker_entry(dbentry, msg->m_subid,
+
InvalidOid, true);
+
+   subwentry->error_count++;   /* increment the apply
worker's counter. */

3) Since both 026_worker_stats and 027_worker_xact_stats.pl are
testing pg_stat_subscription_workers, can we move the tests to
026_worker_stats.pl. If possible the error_count validation can be
combined with the existing tests.
diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
b/src/test/subscription/t/027_worker_xact_stats.pl
new file mode 100644
index 000..31dbea1
--- /dev/null
+++ b/src/test/subscription/t/027_worker_xact_stats.pl
@@ -0,0 +1,162 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Tests for subscription worker statistics during apply.
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 1;
+
+# Create publisher node

Regards,
Vignesh

Re: Make pg_waldump report replication origin ID, LSN, and timestamp.

On Mon, Dec 6, 2021 at 5:09 PM Michael Paquier  wrote:
>
> On Mon, Dec 06, 2021 at 04:35:07PM +0900, Masahiko Sawada wrote:
> > I've attached a patch to add replication origin information to
> > xact_desc_prepare().
>
> Yeah.
>
> +   if (origin_id != InvalidRepOriginId)
> +   appendStringInfo(buf, "; origin: node %u, lsn %X/%X, at %s",
> +origin_id,
> +LSN_FORMAT_ARGS(parsed.origin_lsn),
> +timestamptz_to_str(parsed.origin_timestamp));
>
> Shouldn't you check for parsed.origin_lsn instead?  The replication
> origin is stored there as far as I read EndPrepare().

Yeah, I was thinking check origin_lsn instead. That way, we don't show
invalid origin_timestamp and origin_lsn even if origin_id is set. But
as far as I read, the same is true for xact_desc_commit() (and
xact_desc_rollback()). That is, since apply workers always its origin
id and could do commit transactions that are not replicated from the
publisher, it's possible that xact_desc_commit() reports like:

rmgr: Transaction len (rec/tot):117/   117, tx:725, lsn:
0/014BE938, prev 0/014BE908, desc: COMMIT 2021-12-06 22:04:44.462200
JST; inval msgs: catcache 55 catcache 54 catcache 64; origin: node 1,
lsn 0/0, at 2000-01-01 09:00:00.00 JST

Also, looking at PrepareRedoAdd(), we check the replication origin id.
So I think that it'd be better to check origin_id for consistency.

>
> Commit records check after XACT_XINFO_HAS_ORIGIN, but
> xact_desc_abort() may include this information for ROLLBACK PREPARED
> transactions so we could use the same logic as xact_desc_commit() for
> the abort case, no?

Good catch! I'll submit an updated patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-06 Thread Ashutosh Sharma

Thank you, Dilip for the quick response. I am okay with the changes done in
the v7 patch.

One last point - If we try to clone a huge database, as expected CREATE
DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints
which seems to be affecting the performance slightly.

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:59 AM Dilip Kumar  wrote:

> On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma 
> wrote:
> >
> > Here are few more review comments:
>
> Thanks for reviewing it.
>
> > 1) It seems that we are not freeing the memory allocated for buf.data in
> CreateDirAndVersionFile().
>
> Yeah this was a problem in v6 but I have fixed in v7, can you check that.
> >
> > + */
> > +static void
> > +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
> > +{
> >
> > 2) Do we need to pass dbpath here? I mean why not reconstruct it from
> dbid and tsid.
>
> Yeah we can do that but I thought computing dbpath has some cost and
> since the caller already has it why not to pass it.
>
> >
> > 3) Not sure if this point has already been discussed, Will we be able to
> recover the data when wal_level is set to minimal because the following
> condition would be false with this wal level.
> >
> > +   use_wal = XLogIsNeeded() &&
> > +   (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
> >
>
> Since we are creating new relfilenode this is fine, refer "Skipping
> WAL for New RelFileNode" in src/backend/access/transam/README
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: MSVC SSL test failure

2021-12-06 Thread Daniel Gustafsson

> On 6 Dec 2021, at 07:02, Tom Lane  wrote:
> 
> Andrew Dunstan  writes:
>> On 12/5/21 12:50, Tom Lane wrote:
>>> This looks quite a bit like the sort of failure that commit
>>> 6051857fc was meant to forestall.  I wonder whether reverting
>>> that commit changes the results?  You might also try inserting
>>> a shutdown() call, as we'd decided not to do [1].
> 
>> Commenting out the closesocket() worked.
> 
> Man, that's annoying.  Apparently OpenSSL is doing something to
> screw up the shutdown sequence.  According to [1], the graceful
> shutdown sequence will happen by default, but changing SO_LINGER
> or SO_DONTLINGER can get you into abortive shutdown anyway.
> Maybe they change one of those settings (why?)

AFAICT they don't touch either, and nothing really sticks out looking at
setsockopt calls in either 1.1.1 or 3.0.0.

--
Daniel Gustafsson   https://vmware.com/

Re: GUC flags

2021-12-06 Thread Justin Pryzby

On Mon, Dec 06, 2021 at 03:58:39PM +0900, Michael Paquier wrote:
> On Sun, Dec 05, 2021 at 11:38:05PM -0600, Justin Pryzby wrote:
> > Thanks.  One more item.  The check_guc script currently outputs 68 false
> > positives - even though it includes a list of 20 exceptions.  This is not
> > useful.
> 
> Indeed.  Hmm.  This script does a couple of things:
> 1) Check the format of the options defined in the various lists of
> guc.c, which is something people format well, and pgindent also does 
> a part of this job.
> 2) Check that options in the hardcoded list of GUCs in
> INTENTIONALLY_NOT_INCLUDED are not included in
> postgresql.conf.sample
> 3) Check that nothing considered as a parameter in
> postgresql.conf.sample is listed in guc.c.
> 
> Your patch removes 1) and 2), but keeps 3) to check for dead
> parameter references in postgresql.conf.sample.

The script checks that guc.c and sample config are consistent.

I think your undertanding of INTENTIONALLY_NOT_INCLUDED is not right.
That's a list of stuff it "avoids reporting" as an suspected error, not an
additional list of stuff to checks.  INTENTIONALLY_NOT_INCLUDED is a list of
stuff like NOT_IN_SAMPLE, which is better done by parsing /NOT_IN_SAMPLE/.

> Is check_guc actually run on a periodic basis by somebody?  Based on
> the amount of false positives that has accumulated over the years, and
> what `git grep` can already do for 3), it seems to me that we have
> more arguments in favor of just removing it entirely.

I saw that Tom updated it within the last 12 months, which I took to mean that
it was still being maintained. But I'm okay with removing it.

-- 
Justin

Do we need pre-allocate WAL files during end-of-recovery checkpoint?

Hi,

The function PreallocXlogFiles doesn't get called during
end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
becomes operational after the end-of-recovery checkpoint and may need
WAL files. However, I'm not sure how beneficial it is going to be if
the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
extra WAL file).

Thoughts?

[1]
/*
 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 * issued at a different time.
 */
if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
shutdown = true;
else
shutdown = false;

/*
 * Make more log segments if needed.  (Do this after recycling old log
 * segments, since that may supply some of the needed files.)
 */
if (!shutdown)
PreallocXlogFiles(recptr, checkPoint.ThisTimeLineID);

Regards,
Bharath Rupireddy.

Re: row filtering for logical replication

2021-12-06 Thread Euler Taveira

On Mon, Dec 6, 2021, at 3:44 AM, Amit Kapila wrote:
> I think what you said as (b) is wrong because we want to allow builtin
> immutable functions. See discussion [1].
It was a typo. I mean "non-immutable" function.

> True, but that is the main reason the review and development are being
> done as separate sub-features. I suggest still keeping the similar
> separation till some of the reviews of each of the patches are done,
> otherwise, we need to rethink how to divide for easier review. We need
> to retain the 0005 patch because that handles many problems without
> which the main patch is incomplete and buggy w.r.t replica identity.
IMO we should merge sub-features as soon as we reach consensus. Every new
sub-feature breaks comments, tests and documentation if you want to remove or
rearrange patches. It seems I misread 0005. I agree that it is important. I'll
check it.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Fix inappropriate uses of PG_GETARG_UINT32()

2021-12-06 Thread Peter Eisentraut


On 01.12.21 22:59, Bossart, Nathan wrote:

On 12/1/21, 10:29 AM, "Peter Eisentraut"  
wrote:

The attached patch fixes this by accepting the argument using
PG_GETARG_INT32(), doing some checks, and then casting it to unsigned
for the rest of the code.

The patch also fixes another inappropriate use in an example in the
documentation.  These two were the only inappropriate uses I found,
after we had fixed a few recently.


LGTM


committed

Re: row filtering for logical replication

2021-12-06 Thread Euler Taveira

On Mon, Dec 6, 2021, at 3:35 AM, Dilip Kumar wrote:
> On Mon, Dec 6, 2021 at 6:49 AM Euler Taveira  wrote:
> >
> > On Fri, Dec 3, 2021, at 8:12 PM, Euler Taveira wrote:
> >
> > PS> I will update the commit message in the next version. I barely changed 
> > the
> > documentation to reflect the current behavior. I probably missed some 
> > changes
> > but I will fix in the next version.
> >
> > I realized that I forgot to mention a few things about the UPDATE behavior.
> > Regardless of 0003, we need to define which tuple will be used to evaluate 
> > the
> > row filter for UPDATEs. We already discussed it circa [1]. This current 
> > version
> > chooses *new* tuple. Is it the best choice?
> 
> But with 0003, we are using both the tuple for evaluating the row
> filter, so instead of fixing 0001, why we don't just merge 0003 with
> 0001?  I mean eventually, 0003 is doing what is the agreed behavior,
> i.e. if just OLD is matching the filter then convert the UPDATE to
> DELETE OTOH if only new is matching the filter then convert the UPDATE
> to INSERT.  Do you think that even we merge 0001 and 0003 then also
> there is an open issue regarding which row to select for the filter?
Maybe I was not clear. IIUC we are still discussing 0003 and I would like to
propose a different default based on the conclusion I came up. If we merged
0003, that's fine; this change will be useless. If we don't or it is optional,
it still has its merit.

Do we want to pay the overhead to evaluating both tuple for UPDATEs? I'm still
processing if it is worth it. If you think that in general the row filter
contains the primary key and it is rare to change it, it will waste cycles
evaluating the same expression twice. It seems this behavior could be
controlled by a parameter.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

Hi,

While the database is performing end-of-recovery checkpoint, the
control file gets updated with db state as "shutting down" in
CreateCheckPoint (see the code snippet at [1]) and at the end it sets
it back to "shut down" for a brief moment and then finally to "in
production". If the end-of-recovery checkpoint takes a lot of time or
the db goes down during the end-of-recovery checkpoint for whatever
reasons, the control file ends up having the wrong db state.

Should we add a new db state something like
DB_IN_END_OF_RECOVERY_CHECKPOINT/"in end-of-recovery checkpoint" or
something else to represent the correct state?

Thoughts?

[1]
void
CreateCheckPoint(int flags)
{

/*
 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 * issued at a different time.
 */
if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
shutdown = true;
else
shutdown = false;

if (shutdown)
{
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->state = DB_SHUTDOWNING;
UpdateControlFile();
LWLockRelease(ControlFileLock);
}

if (shutdown)
ControlFile->state = DB_SHUTDOWNED;

Regards,
Bharath Rupireddy.

Re: Shouldn't postgres_fdw report warning when it gives up getting result from foreign server?

On Mon, Dec 6, 2021 at 1:47 PM Fujii Masao  wrote:
> Yeah, I agree that's not elegant..
>
> So I'd like to propose new patch with different design from
> what I proposed before. Patch attached.
>
> This patch changes pgfdw_exec_cleanup_query() so that it tells
> its callers the information about whether the timeout expired
> or not. Then the callers (pgfdw_exec_cleanup_query and
> pgfdw_cancel_query) report the warning messages based on
> the results from pgfdw_exec_cleanup_query().

+1 for adding a new timed_out param to pgfdw_get_cleanup_result.
pgfdw_get_cleanup_result_v2 patch looks good to me.

Regards,
Bharath Rupireddy.

Re: O(n) tasks cause lengthy startups and checkpoints

On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan  wrote:
>
> On 12/3/21, 5:57 AM, "Bharath Rupireddy" 
>  wrote:
> > On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan  wrote:
> >>
> >> On 12/1/21, 6:48 PM, "Bharath Rupireddy" 
> >>  wrote:
> >> > +1 for the overall idea of making the checkpoint faster. In fact, we
> >> > here at our team have been thinking about this problem for a while. If
> >> > there are a lot of files that checkpoint has to loop over and remove,
> >> > IMO, that task can be delegated to someone else (maybe a background
> >> > worker called background cleaner or bg cleaner, of course, we can have
> >> > a GUC to enable or disable it). The checkpoint can just write some
> >>
> >> Right.  IMO it isn't optimal to have critical things like startup and
> >> checkpointing depend on somewhat-unrelated tasks.  I understand the
> >> desire to avoid adding additional processes, and maybe it is a bigger
> >> hammer than what is necessary to reduce the impact, but it seemed like
> >> a natural solution for this problem.  That being said, I'm all for
> >> exploring other ways to handle this.
> >
> > Having a generic background cleaner process (controllable via a few
> > GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
> > temp files etc.) or some other task on behalf of the checkpointer,
> > seems to be the easiest solution.
> >
> > I'm too open for other ideas.
>
> I might hack something together for the separate worker approach, if
> for no other reason than to make sure I really understand how these
> functions work.  If/when a better idea emerges, we can alter course.

Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.

Regards,
Bharath Rupireddy.


v1-0001-background-cleaner-to-offload-checkpoint-tasks.patch
Description: Binary data

Re: preserve timestamps when installing headers

2021-12-06 Thread Peter Eisentraut


On 06.12.21 12:15, Michael Paquier wrote:

FWIW, I am not on board with changing build semantics or any
assumptions the header installation relies on either, but I could see
a point in switching back to INSTALL_DATA instead of cp to be
consistent with the rest of the build, iff the argument made back in
2005 about the performance of this code path does not hold anymore.


I think you will find that it is still very slow.

Re: preserve timestamps when installing headers

2021-12-06 Thread Peter Eisentraut




On 06.12.21 07:51, Tom Lane wrote:

TBH, I am not convinced that the complained-of case is enough of a
problem to justify any change in our build rules, even if there
weren't any semantic issues.  If you are worried about build times,
you should be using ccache, and IME builds using ccache are not
terribly impacted by file timestamp changes.


I have never heard of a dependency-based build system taking into 
account the timestamps of files outside of the source (or build) tree. 
It does make sense to some degree, but it seems very unusual, and 
basically nothing works like that.  I'm also not sure how packaging 
systems preserve file timestamps.  Maybe it's a thing now, but I'd like 
to see a more comprehensive analysis before we commit to this.

Re: preserve timestamps when installing headers

2021-12-06 Thread Michael Paquier

On Mon, Dec 06, 2021 at 01:51:39AM -0500, Tom Lane wrote:
> TBH, I am not convinced that the complained-of case is enough of a
> problem to justify any change in our build rules, even if there
> weren't any semantic issues.  If you are worried about build times,
> you should be using ccache, and IME builds using ccache are not
> terribly impacted by file timestamp changes.

FWIW, I am not on board with changing build semantics or any
assumptions the header installation relies on either, but I could see
a point in switching back to INSTALL_DATA instead of cp to be
consistent with the rest of the build, iff the argument made back in
2005 about the performance of this code path does not hold anymore.
If we do that, it would then be possible to feed a custom INSTALL
command to ./configure.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Add foreign-server health checks infrastructure

2021-12-06 Thread Shinya Kato


On 2021-11-29 21:36, Zhihong Yu wrote:

On Mon, Nov 29, 2021 at 12:51 AM kuroda.hay...@fujitsu.com
 wrote:


Dear Zhihong,

Thank you for giving comments! I'll post new patches later.


+#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()

(CheckingRemoteServersHoldoffCount++)


The macro contains only one operation. Can the macro be removed

(with `CheckingRemoteServersHoldoffCount++` inlined) ?

Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and
HOLD_CANCEL_INTERRUPTS():

```
#define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)

#define RESUME_INTERRUPTS() \
do { \
Assert(InterruptHoldoffCount > 0); \
InterruptHoldoffCount--; \
} while(0)

#define HOLD_CANCEL_INTERRUPTS()  (QueryCancelHoldoffCount++)

#define RESUME_CANCEL_INTERRUPTS() \
do { \
Assert(QueryCancelHoldoffCount > 0); \
QueryCancelHoldoffCount--; \
} while(0)

#define START_CRIT_SECTION()  (CritSectionCount++)

#define END_CRIT_SECTION() \
do { \
Assert(CritSectionCount > 0); \
CritSectionCount--; \
} while(0)
```

So I want to keep the current style. Could you tell me if you have
any other reasons?


+   if (CheckingRemoteServersTimeoutPending &&

CheckingRemoteServersHoldoffCount != 0)

+   {
+   /*
+* Skip checking foreign servers while reading messages.
+*/
+   InterruptPending = true;
+   }
+   else if (CheckingRemoteServersTimeoutPending)

Would the code be more readable if the above two if blocks be

moved inside one enclosing if block (factoring the common
condition)?


+   if (CheckingRemoteServersTimeoutPending)


+1. Will fix.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Hi,

It is Okay to keep the macros.

Thanks


Hi, Kuroda-san. Sorry for late reply.

Even for local-only transaction, I thought it useless to execute 
CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I 
make it so that it determines outside whether it contains SQL to the 
remote or not?


The following points are minor.
1. In foreign.c, some of the lines are too long and should be broken.
2. In pqcomm.c, the newline have been removed, what is the intention of 
this?

3. In postgres.c,
3-1. how about inserting a comment between lines 2713 and 2714, similar 
to line 2707?

3-2. the line breaks in line 3242 and line 3243 should be aligned.
3-3. you should change
/*
* Skip checking foreign servers while reading messages.
*/
to
/*
 * Skip checking foreign servers while reading messages.
 */
4. In connection.c, There is a typo in line 1684, so "fucntion" should 
be changed to "function".



--
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Optionally automatically disable logical replication subscriptions on error

On Mon, Dec 6, 2021 at 10:07 AM Mark Dilger
 wrote:
>
> > On Dec 1, 2021, at 8:48 PM, Amit Kapila  wrote:
> >
> > The patch disables the subscription for non-transient errors. I am not
> > sure if we can easily make the call to decide whether any particular
> > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> > might not rectify itself. Why not just allow to disable the
> > subscription on any error? And then let the user check the error
> > either in view or logs and decide whether it would like to enable the
> > subscription or do something before it (like making space in disk, or
> > fixing the network).
>
> The original idea of the patch, back when I first wrote and proposed it, was 
> to remove the *absurdity* of retrying a transaction which, in the absence of 
> human intervention, was guaranteed to simply fail again ad infinitum.  
> Retrying in the face of resource errors is not *absurd* even though it might 
> fail again ad infinitum.  The reason is that there is at least a chance that 
> the situation will clear up without human intervention.
>
> > The other problem I see with this transient error stuff is maintaining
> > the list of error codes that we think are transient. I think we need a
> > discussion for each of the error_codes we are listing now and whatever
> > new error_code we add in the future which doesn't seem like a good
> > idea.
>
> A reasonable rule might be:  "the subscription will be disabled if the server 
> can determine that retries cannot possibly succeed without human 
> intervention."  We shouldn't need to categorize all error codes perfectly, as 
> long as we're conservative.  What I propose is similar to how we determine 
> whether to mark a function leakproof; we don't have to mark all leakproof 
> functions as such, we just can't mark one as such if it is not.
>
> If we're going to debate the error codes, I think we would start with an 
> empty list, and add to the list on sufficient analysis.
>

Yeah, an empty list is a sort of what I thought was a good start
point. I feel we should learn from real-world use cases to see if
people really want to continue retrying even after using this option.


-- 
With Regards,
Amit Kapila.

RE: Optionally automatically disable logical replication subscriptions on error

2021-12-06 Thread osumi.takami...@fujitsu.com

On Monday, December 6, 2021 1:16 PM Greg Nancarrow  wrote:
> On Sat, Dec 4, 2021 at 12:20 AM osumi.takami...@fujitsu.com
>  wrote:
> >
> > Hi, I've made a new patch v11 that incorporated suggestions described
> above.
> >
> 
> Some review comments for the v11 patch:
Thank you for your reviews !
 
> doc/src/sgml/ref/create_subscription.sgml
> (1) Possible wording improvement?
> 
> BEFORE:
> +  Specifies whether the subscription should be automatically disabled
> + if replicating data from the publisher triggers errors. The default
> + is false.
> AFTER:
> +  Specifies whether the subscription should be automatically disabled
> + if any errors are detected by subscription workers during data
> + replication from the publisher. The default is false.
Fixed.

> src/backend/replication/logical/worker.c
> (2) WorkerErrorRecovery comments
> Instead of:
> 
> + * As a preparation for disabling the subscription, emit the error,
> + * handle the transaction and clean up the memory context of
> + * error. ErrorContext is reset by FlushErrorState.
> 
> why not just say:
> 
> + Worker error recovery processing, in preparation for disabling the
> + subscription.
> 
> And then comment the function's code lines:
> 
> e.g.
> 
> /* Emit the error */
> ...
> /* Abort any active transaction */
> ...
> /* Reset the ErrorContext */
> ...
Agreed. Fixed.
 
> (3) DisableSubscriptionOnError return
> 
> The "if (!subform->subdisableonerr)" block should probably first:
>heap_freetuple(tup);
> 
> (regardless of the fact the only current caller will proc_exit anyway)
Fixed.
 
> (4) did_error flag
> 
> I think perhaps the previously-used flag name "disable_subscription"
> is better, or maybe "error_recovery_done".
> Also, I think it would look better if it was set AFTER
> WorkerErrorRecovery() was called.
Adopted error_recovery_done
and changed its places accordingly.
 
> (5) DisableSubscriptionOnError LOG message
> 
> This version of the patch removes the LOG message:
> 
> + ereport(LOG,
> + errmsg("logical replication subscription \"%s\" will be disabled due
> to error: %s",
> +MySubscription->name, edata->message));
> 
> Perhaps a similar error message could be logged prior to EmitErrorReport()?
> 
> e.g.
>  "logical replication subscription \"%s\" will be disabled due to an error"
Added.

I've attached the new version v12.

Best Regards,
Takamichi Osumi



v12-0001-Optionally-disable-subscriptions-on-error.patch
Description: v12-0001-Optionally-disable-subscriptions-on-error.patch

Re: row filtering for logical replication

2021-12-06 Thread Peter Smith

On Sat, Dec 4, 2021 at 10:13 AM Euler Taveira  wrote:
>
> On Thu, Dec 2, 2021, at 4:18 AM, Peter Smith wrote:
>
> PSA a new v44* patch set.
>
...

> I used the last patch series (v44) posted by Peter Smith [1]. I did a lot of
> improvements in this new version (v45). I merged 0001 (it is basically the 
> main
> patch I wrote) and 0004 (autocomplete). As I explained in [2], I implemented a
> patch (that is incorporated in the v45-0001) to fix this issue. I saw that
> Peter already proposed a slightly different patch (0006). I read this patch 
> and
> concludes that it  would be better to keep the version I have. It fixes a few
> things and also includes more comments.
> [1] 
> https://postgr.es/m/CAHut%2BPtJnnM8MYQDf7xCyFAp13U_0Ya2dv-UQeFD%3DghixFLZiw%40mail.gmail.com
> [2] 
> https://postgr.es/m/ca8d270d-f930-4d15-9f24-60f95b364173%40www.fastmail.com

>> As I explained in [2], I implemented a
patch (that is incorporated in the v45-0001) to fix this issue. I saw that
Peter already proposed a slightly different patch (0006). I read this patch and
concludes that it  would be better to keep the version I have. It fixes a few
things and also includes more comments.

Your ExprState exprstate array code is essentially exactly the same
logic that was int patch v44-0006 isn't it?

The main difference I saw was
1. I pass the cache index (e.g. IDX_PUBACTION_DELETE etc) to the
pgoutput_filter, but
2. You are passing in the ReorderBufferChangeType value.

IMO the ability to directly access the cache array is more efficient.

The function is called for every row operation (e.g. consider x 1
million rows) so I felt the overhead to have unnecessary if/else
should be avoided.
e.g.
--
if (action == REORDER_BUFFER_CHANGE_INSERT)
result = pgoutput_row_filter_exec_expr(entry->exprstate[0], ecxt);
else if (action == REORDER_BUFFER_CHANGE_UPDATE)
result = pgoutput_row_filter_exec_expr(entry->exprstate[1], ecxt);
else if (action == REORDER_BUFFER_CHANGE_DELETE)
result = pgoutput_row_filter_exec_expr(entry->exprstate[2], ecxt);
else
Assert(false);
--

Why not just use a direct index like was in patch v44-0006 in the first place?
e.g.
--
result = pgoutput_row_filter_exec_expr(entry->exprstate[idx_pubaction], ecxt);
--

Conveniently, those ReorderBufferChangeType first 3 enums are the ones
you want so you can still pass them if you want.
REORDER_BUFFER_CHANGE_INSERT,
REORDER_BUFFER_CHANGE_UPDATE,
REORDER_BUFFER_CHANGE_DELETE,

Just use them to directly index into entry->exprstate[action] and so
remove the excessive if/else.

What do you think?

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Non-superuser subscription owners

On Fri, Dec 3, 2021 at 10:37 PM Mark Dilger
 wrote:
>
> > On Dec 2, 2021, at 1:29 AM, Amit Kapila  wrote:
> >
> > If we want to maintain the property that subscriptions can only be
> > owned by superuser for your first version then isn't a simple check
> > like ((!superuser()) for each of the operations is sufficient?
>
> As things stand today, nothing prevents a superuser subscription owner from 
> having superuser revoked.  The patch does nothing to change this.
>

I understand that but won't that get verified when we look up the
information in pg_authid as part of superuser() check?

-- 
With Regards,
Amit Kapila.

Re: row filtering for logical replication

On Mon, Dec 6, 2021 at 6:49 AM Euler Taveira  wrote:
>
> On Fri, Dec 3, 2021, at 8:12 PM, Euler Taveira wrote:
>
> PS> I will update the commit message in the next version. I barely changed the
> documentation to reflect the current behavior. I probably missed some changes
> but I will fix in the next version.
>
> I realized that I forgot to mention a few things about the UPDATE behavior.
> Regardless of 0003, we need to define which tuple will be used to evaluate the
> row filter for UPDATEs. We already discussed it circa [1]. This current 
> version
> chooses *new* tuple. Is it the best choice?
>

Apart from the data inconsistency problems you outlined below, I think
there is a major design problem with that w.r.t toast tuples as
unchanged key values won't be part of *new* tuple.

> Let's check all cases. There are 2 rows on the provider. One row satisfies the
> row filter and the other one doesn't. For each case, I expect the initial rows
> to be there (no modifications). The DDLs are:
>
> CREATE TABLE foo (a integer, b text, PRIMARY KEY(a));
> INSERT INTO foo (a, b) VALUES(10, 'abc'),(30, 'abc');
> CREATE PUBLICATION bar FOR TABLE foo WHERE (a > 20);
>
> The table describes what happen on the subscriber. BEFORE is the current row 
> on
> subscriber. OLD, NEW and OLD & NEW are action/row if we consider different 
> ways
> to evaluate the row filter.
>
> -- case 1: old tuple (10, abc) ; new tuple (10, def)
> UPDATE foo SET b = 'def' WHERE a = 10;
>
> +---++--+--+
> |   BEFORE  |   OLD  |NEW   |OLD & NEW |
> +---++--+--+
> |NA |   NA   |   NA |   NA |
> +---++--+--+
>
> If the old and new tuple don't satisfy the row filter, there is no issue.
>
> -- case 2: old tuple (30, abc) ; new tuple (30, def)
> UPDATE foo SET b = 'def' WHERE a = 30;
>
> +---++--+--+
> |   BEFORE  |   OLD  |NEW   |OLD & NEW |
> +---++--+--+
> | (30, abc) | UPDATE (30, def)   | UPDATE (30, def) | UPDATE (30, def) |
> +---++--+--+
>
> If the old and new tuple satisfy the row filter, there is no issue.
>
> -- case 3: old tuple (30, abc) ; new tuple (10, def)
> UPDATE foo SET a = 10, b = 'def' WHERE a = 30;
>
> +---++--+--+
> |   BEFORE  |   OLD  |NEW   |OLD & NEW |
> +---++--+--+
> | (30, abc) | UPDATE (10, def) * | KEEP (30, abc) * | KEEP (30, abc) * |
> +---++--+--+
>
> If the old tuple satisfies the row filter but the new tuple doesn't,  we have 
> a
> data consistency issue. Since the old tuple satisfies the row filter, the
> initial table synchronization copies this row. However, after the UPDATE the
> new tuple doesn't satisfy the row filter then, from the data consistency
> perspective, that row should be removed on the subscriber.
>

This is the reason we decide to make such cases to transform UPDATE to DELETE.

> The OLD sends the UPDATE because it satisfies the row filter (if it is a
> sharding solution this new row should be moved to another node). The new row
> would likely not be modified by replication again. That's a data inconsistency
> according to the row filter.
>
> The NEW and OLD & NEW don't send the UPDATE because it doesn't satisfy the row
> filter. Keep the old row is undesirable because it doesn't reflect what we 
> have
> on the source. This row on the subscriber would likely not be modified by
> replication again. If someone inserted a new row with a = 30, replication will
> stop because there is already a row with that value.
>

This shouldn't be a problem with the v44 patch version (0003 handles it).

> -- case 4: old tuple (10, abc) ; new tuple (30, def)
> UPDATE foo SET a = 30, b = 'def' WHERE a = 10;
>
> +---++--+--+
> |   BEFORE  |   OLD  |NEW   |OLD & NEW |
> +---++--+--+
> |NA |   NA ! |   NA !   |   NA |
> +---++--+--+
>
> The OLD and OLD & NEW don't send the UPDATE because it doesn't satisfy the row
> filter. The NEW sends the UPDATE because it satisfies the row filter but there
> is no row to modify. The current behavior does nothing. However, it should
> INSERT the new tuple. Subsequent UPDATE or DELETE have no effect. It could be 
> a
> surprise for an application that expects the sa

Re: pg_get_publication_tables() output duplicate relid

On Mon, Dec 6, 2021 at 1:00 PM Amit Langote  wrote:
>
> On Mon, Dec 6, 2021 at 3:55 PM houzj.f...@fujitsu.com
>  wrote:
> > On Thursday, December 2, 2021 9:48 PM Amit Langote 
> >  wrote:
> > > On Thu, Dec 2, 2021 at 2:27 PM Amit Kapila  
> > > wrote:
> > > > On Thu, Dec 2, 2021 at 8:42 AM Amit Langote 
> > > wrote:
> > > > > Reading Alvaro's email at the top again gave me a pause to
> > > > > reconsider what I had said in reply.  It might indeed have been nice
> > > > > if the publication DDL itself had prevented the cases where a
> > > > > partition and its ancestor are added to a publication, though as
> > > > > Hou-san mentioned, partition ATTACHes make that a bit tricky to
> > > > > enforce at all times, though of course not impossible.  Maybe it's
> > > > > okay to just de-duplicate pg_publication_tables output as the patch
> > > > > does, even though it may appear to be a band-aid solution if we one
> > > > > considers Alvaro's point about consistency.
> > > >
> > > > True, I think even if we consider that idea it will be a much bigger
> > > > change, and also as it will be a behavioral change so we might want to
> > > > keep it just for HEAD and some of these fixes need to be backpatched.
> > > > Having said that, if you or someone want to pursue that idea and come
> > > > up with a better solution than what we have currently it is certainly
> > > > welcome.
> > >
> > > Okay, I did write a PoC patch this morning after sending out my earlier 
> > > email.  I
> > > polished it a bit, which is attached.
> >
> > After thinking more on this. I find there might be some usage about adding 
> > both
> > child and parent to the publication.
> >
> > For the future feature publication row filter(and maybe column filter), It
> > could be useful for user to adding child and parent with different filter
> > expression. If pubviaroot=true, user can expect the parent's filter take
> > effect, If pubviaroot=false, they can expect the child's filter take effect.
> >
> > If we disallow adding both child and parent to publication, it could be 
> > harder
> > for these features to implement.
>
> It's possible that one may want such a feature, and yes, outright
> preventing adding both the parent and a partition to a publication
> would perhaps make it harder.  Mind you though that any such feature
> would need to fix the current implementation anyway to make the
> publication-behavior-related catalog entries for child tables, which
> we currently don't, unless children are added explicitly.
>
> That said, we should also be careful in implementing new features that
> offer a finer per-partition granularity, because the fact that we do
> currently offer that for many of the existing old features such as
> constraints, indexes, etc. limits the scalability of our architecture.
> That point may be off-topic for the thread, but something to consider,
> or maybe I need to take a look at the other patch to see if my
> concerns are addressable / have been addressed.
>

Your concern is not very clear to me. Can you be more specific in what
kind of problem you see with such a design for row filtering? You
might want to once look at that thread. You can find a summary of that
patch in the email [1].

[1] - 
https://www.postgresql.org/message-id/49ba49f1-8bdb-40b7-ae9e-f17d88b3afcd%40www.fastmail.com

-- 
With Regards,
Amit Kapila.

Re: Alter all tables in schema owner fix