Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-23 Thread Amit Kapila
On Sat, Mar 23, 2024 at 1:12 PM Bharath Rupireddy
 wrote:
>
> On Sat, Mar 23, 2024 at 11:27 AM Amit Kapila  wrote:
> >
>
> > 2.
> > +# Get last_inactive_time value after slot's creation. Note that the
> > slot is still
> > +# inactive unless it's used by the standby below.
> > +my $last_inactive_time_1 = $primary->safe_psql('postgres',
> > + qq(SELECT last_inactive_time FROM pg_replication_slots WHERE
> > slot_name = '$sb_slot' AND last_inactive_time IS NOT NULL;)
> > +);
> >
> > We should check $last_inactive_time_1 to be a valid value and add a
> > similar check for logical slots.
>
> That's taken care by the type cast we do, right? Isn't that enough?
>
> is( $primary->safe_psql(
> 'postgres',
> qq[SELECT last_inactive_time >
> '$last_inactive_time'::timestamptz FROM pg_replication_slots WHERE
> slot_name = '$sb_slot' AND last_inactive_time IS NOT NULL;]
> ),
> 't',
> 'last inactive time for an inactive physical slot is updated correctly');
>
> For instance, setting last_inactive_time_1 to an invalid value fails
> with the following error:
>
> error running SQL: 'psql::1: ERROR:  invalid input syntax for
> type timestamp with time zone: "foo"
> LINE 1: SELECT last_inactive_time > 'foo'::timestamptz FROM pg_repli...
>

It would be found at a later point. It would be probably better to
verify immediately after the test that fetches the last_inactive_time
value.

-- 
With Regards,
Amit Kapila.




Proposal for Resumable Vacuum (again ...)

2024-03-23 Thread Jay
Hi,

I am aware of few previous attempts and discussions on this topic
(eventually shelved or didn't materialize):

- https://www.postgresql.org/message-id/45e2a6ae.1080...@oss.ntt.co.jp
-
https://www.postgresql.org/message-id/CA%2BTgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A%40mail.gmail.com

-
https://www.postgresql.org/message-id/flat/cad21aobqfmvwdk7odh4a4opf-m5gytrjxme5e8cegxvhsjb...@mail.gmail.com


And still I want to revise this topic for the obvious benefits.

I do not have any patch or code changes ready. The changes could be tricky
and might need efforts, possibly some re-factoring. Hence, before starting
the effort, I would like to get the proposal reviewed and consensus built
to avoid redundancy of efforts.

*Why do we need it? *

Since more and more large businesses are on-boarding PostgreSQL, it is only
fair that we make the essential utilities like vacuum more manageable and
scalable. The data sizes are definitely going to increase and maintenance
windows will reduce with businesses operating across the time zones and
24x7. Making the database more manageable with the least overhead is going
to be definitely a pressing need.

To avoid the repetition and duplicate efforts, I have picked up the snippet
below from the previous email conversation on the community (Ref:
https://www.postgresql.org/message-id/45e2a6ae.1080...@oss.ntt.co.jp)


*For a large table, although it can be vacuumed by enabling vacuum
cost-based delay, the processing may last for several days (maybe hours).
It definitely has a negative effect on system performance. So if systems
which have maintenance time, it is preferred to vacuum in the maintenance
window. Vacuum tasks can be split into small subtasks, and they can be
scheduled into maintenance window time slots. This can reduce the impact of
vacuum to system service.*

*But currently vacuum tasks can not be split: if an interrupt or error
occurs during vacuum processing, vacuum totally forgets what it has done
and terminates itself. Following vacuum on the same table has to scan from
the beginning of the heap block. This proposal enable vacuum has capability
to stop and resume.*


*External Interface*

This feature is especially useful when the size of table/s is quite large
and their bloat is quite high and it is expected vacuum runs will take long
time.

Ref: https://www.postgresql.org/docs/current/sql-vacuum.html
vacuum [ ( *option* [, ...], *[{ for time = hh:mm}| {resume [for time =
hh:mm]}] *) ] [ *table_and_columns* [, ...] ]

The additional options give flexibility to run the vacuum for a limited
time and stop or resume the vacuum from the last time when it was stopped
for a given time.

When vacuum is invoked with ‘for time ...’ option it will store the
intermediate state of the dead tuples accumulated periodically on the disk
as it progresses. It will run for a specified time and stop after that
duration.

When vacuum is invoked with ‘for time ...’ option and is stopped
automatically after the specified time or interrupted manually and if it is
invoked next time with ‘resume’ option, it will try to check the stored
state of the last run and try to start as closely as possible from where it
left last time and avoid repetition of work.

When resumed, it can either run for a specified time again (if the duration
is specified) or run till completion if the duration is not specified.

When vacuum is invoked with ‘resume for’ option when there was no earlier
incomplete run or an earlier run with ‘for time’ option, the ‘for resume’
option will be ignored with a message in the errorlog.

When vacuum is invoked without ‘for time’ or ‘resume for’ options after
preceding incomplete runs with those options , then the persisted data from
the previous runs is discarded and deleted. This is important because
successive runs with ‘for time’ or ‘resume for’ assume the persisted data
is valid and there’s no run in between to invalidate it and the state of
heap pages in terms of vacuum is the same for the given saved vacuum
horizon.

In further discussion in the rest of this proposal, we will refer to vacuum
invoked with ‘for time’ or ‘resume for’ option as listed above as
‘resumable vacuum’.

Internal Changes (High level)For each table, vacuum progresses in the
following steps or phases (taken from the documentation)
https://www.postgresql.org/docs/current/progress-reporting.html#VACUUM-PHASES
 :

1. Initializing -   VACUUM is preparing to begin scanning the heap. This
phase is expected to be very brief.

2. Scanning heap - VACUUM is currently scanning the heap. It will prune and
defragment each page if required, and possibly perform freezing activity.
The heap_blks_scanned column can be used to monitor the progress of the
scan.

3. Vacuuming Indexes - VACUUM is currently vacuuming the indexes. If a
table has any indexes, this will happen at least once per vacuum, after the
heap has been completely scanned. It 

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-23 Thread Bharath Rupireddy
On Fri, Mar 22, 2024 at 4:28 AM Peter Eisentraut  wrote:
>
> I had written in [0] about my questions related to using this with
> connection poolers.  I don't think this was addressed at all.  I haven't
> seen any discussion about how to make this kind of facility usable in a
> full system.  You have to manually query and send LSNs; that seems
> pretty cumbersome.  Sure, this is part of something that could be
> useful, but how would an actual user with actual application code get to
> use this?
>
> [0]:
> https://www.postgresql.org/message-id/8b5b172f-0ae7-d644-8358-e2851dded43b%40enterprisedb.com

I share the same concern as yours and had proposed something upthread
[1]. The idea is something like how each query takes a snapshot at the
beginning of txn/query (depending on isolation level), the same way
the standby can wait for the primary's current LSN as of the moment
(at the time of taking snapshot). And, primary keeps sending its
current LSN as part of regular WAL to standbys so that the standbys
doesn't have to make connections to the primary to know its current
LSN every time. Perhps, this may not even fully guarantee (considered
to be achieving) the read-after-write consistency on standbys unless
there's a way for the application to tell the wait LSN.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ%40mail.gmail.com

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-23 Thread Bharath Rupireddy
On Sat, Mar 23, 2024 at 2:34 PM Bertrand Drouvot
 wrote:
>
> > > How about adding the test in 019_replslot_limit? It is not a direct
> > > fit but I feel later we can even add 'invalid_timeout' related tests
> > > in this file which will use last_inactive_time feature.
> >
> > I'm thinking the other way. Now, the new TAP file 043_replslot_misc.pl
> > can have last_inactive_time tests, and later invalid_timeout ones too.
> > This way 019_replslot_limit.pl is not cluttered.
>
> I share the same opinion as Amit: I think 019_replslot_limit would be a better
> place, because I see the timeout as another kind of limit.

Hm. Done that way.

Please see the attached v17 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From a1210ae2dd86afdfdfea9b95861ffed9c7ff2d3a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 23 Mar 2024 23:03:55 +
Subject: [PATCH v17] Track last_inactive_time in pg_replication_slots.

Till now, the time at which the replication slot became inactive
is not tracked directly in pg_replication_slots. This commit adds
a new column 'last_inactive_time' for this. It is set to 0 whenever
a slot is made active/acquired and set to current timestamp
whenever the slot is inactive/released or restored from the disk.

The new column will be useful on production servers to debug and
analyze inactive replication slots. It will also help to know the
lifetime of a replication slot - one can know how long a streaming
standby, logical subscriber, or replication slot consumer is down.

The new column will be useful to implement inactive timeout based
replication slot invalidation in a future commit.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
---
 doc/src/sgml/system-views.sgml|  11 ++
 src/backend/catalog/system_views.sql  |   1 +
 src/backend/replication/slot.c|  27 +
 src/backend/replication/slotfuncs.c   |   7 +-
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/replication/slot.h|   3 +
 src/test/recovery/t/019_replslot_limit.pl | 122 ++
 src/test/regress/expected/rules.out   |   3 +-
 8 files changed, 175 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index b5da476c20..2b36b5fef1 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2523,6 +2523,17 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+ 
+  
+   last_inactive_time timestamptz
+  
+  
+The time at which the slot became inactive.
+NULL if the slot is currently actively being
+used.
+  
+ 
+
  
   
conflicting bool
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f69b7f5580..bc70ff193e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1023,6 +1023,7 @@ CREATE VIEW pg_replication_slots AS
 L.wal_status,
 L.safe_wal_size,
 L.two_phase,
+L.last_inactive_time,
 L.conflicting,
 L.invalidation_reason,
 L.failover,
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index cdf0c450c5..0f48d6dc7c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -409,6 +409,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_restart_valid = InvalidXLogRecPtr;
 	slot->candidate_restart_lsn = InvalidXLogRecPtr;
 	slot->last_saved_confirmed_flush = InvalidXLogRecPtr;
+	slot->last_inactive_time = 0;
 
 	/*
 	 * Create the slot on disk.  We haven't actually marked the slot allocated
@@ -622,6 +623,11 @@ retry:
 	if (SlotIsLogical(s))
 		pgstat_acquire_replslot(s);
 
+	/* The slot is active by now, so reset the last inactive time */
+	SpinLockAcquire(>mutex);
+	s->last_inactive_time = 0;
+	SpinLockRelease(>mutex);
+
 	if (am_walsender)
 	{
 		ereport(log_replication_commands ? LOG : DEBUG1,
@@ -645,6 +651,7 @@ ReplicationSlotRelease(void)
 	ReplicationSlot *slot = MyReplicationSlot;
 	char	   *slotname = NULL;	/* keep compiler quiet */
 	bool		is_logical = false; /* keep compiler quiet */
+	TimestampTz now;
 
 	Assert(slot != NULL && slot->active_pid != 0);
 
@@ -679,6 +686,12 @@ ReplicationSlotRelease(void)
 		ReplicationSlotsComputeRequiredXmin(false);
 	}
 
+	/*
+	 * Set the last inactive time after marking slot inactive. We get current
+	 * time beforehand to avoid system call while holding the lock.
+	 */
+	now = GetCurrentTimestamp();
+
 	if (slot->data.persistency == RS_PERSISTENT)
 	{
 		/*
@@ -687,9 +700,16 @@ ReplicationSlotRelease(void)
 		 */
 		SpinLockAcquire(>mutex);
 		slot->active_pid = 0;
+		

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2024-03-23 Thread Alexander Korotkov
On Tue, Mar 19, 2024 at 5:20 PM Alexander Korotkov  wrote:
> On Tue, Nov 28, 2023 at 11:00 AM Pavel Borisov  wrote:
> > > You're designing new APIs, days before the feature freeze.
> > On Wed, 5 Apr 2023 at 06:54, Michael Paquier  wrote:
> > >
> > > On Tue, Apr 04, 2023 at 01:25:46AM +0300, Alexander Korotkov wrote:
> > > > Pavel, thank you for you review, revisions and rebase.
> > > > We'll reconsider this once v17 is branched.
> >
> > I've looked through patches v16 once more and think they're good
> > enough, and previous issues are all addressed. I see that there is
> > nothing that blocks it from being committed except the last iteration
> > was days before v16 feature freeze.
> >
> > Recently in another thread [1] Alexander posted a new version of
> > patches v16 (as 0001 and 0002) In 0001 only indenation, comments, and
> > commit messages changed from v16 in this thread. In 0002 new test
> > eval-plan-qual-2 was integrated into the existing eval-plan-qual test.
> > For maintaining the most recent versions in this thread I'm attaching
> > them under v17. I suppose that we can commit these patches to v17 if
> > there are no objections or additional reviews.
> >
> > [1] 
> > https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
>
> The new revision of patches is attached.
>
> It has updated commit messages, new comments, and some variables were
> renamed to be more consistent with surroundings.
>
> I also think that all the design issues spoken before are resolved.
> It would be nice to hear from Andres about this.
>
> I'll continue rechecking these patches myself.

I've re-read this thread.  It still seems to me that the issues raised
before are addressed now.  Fingers crossed, I'm going to push this if
there are no objections.

--
Regards,
Alexander Korotkov




Re: pg_dump versus enum types, round N+1

2024-03-23 Thread Andrew Dunstan
On Sat, Mar 23, 2024 at 3:00 PM Tom Lane  wrote:

> I have a patch in the queue [1] that among other things tries to
> reduce the number of XIDs consumed during pg_upgrade by making
> pg_restore group its commands into batches of a thousand or so
> per transaction.  This had been passing tests, so I was quite
> surprised when the cfbot started to show it as falling over.
> Investigation showed that it is now failing because 6185c9737
> added these objects to the regression tests and didn't drop them:
>
> CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue',
> 'purple');
> CREATE DOMAIN rgb AS rainbow CHECK (VALUE IN ('red', 'green', 'blue'));
>
> In binary-upgrade mode, pg_dump dumps the enum type like this:
>
> CREATE TYPE public.rainbow AS ENUM (
> );
> ALTER TYPE public.rainbow ADD VALUE 'red';
> ALTER TYPE public.rainbow ADD VALUE 'orange';
> ALTER TYPE public.rainbow ADD VALUE 'yellow';
> ALTER TYPE public.rainbow ADD VALUE 'green';
> ALTER TYPE public.rainbow ADD VALUE 'blue';
> ALTER TYPE public.rainbow ADD VALUE 'purple';
>
> and then, if we're within the same transaction, creation of the domain
> falls over with
>
> ERROR:  unsafe use of new value "red" of enum type rainbow
> HINT:  New enum values must be committed before they can be used.
>
> So I'm glad we found that sooner not later, but something needs
> to be done about it if [1] is to get committed.  It doesn't seem
> particularly hard to fix though: we just have to track the enum
> type OIDs made in the current transaction, using largely the same
> approach as is already used in pg_enum.c to track enum value OIDs.
> enum.sql contains a comment opining that this is too expensive,
> but I don't see why it is as long as we don't bother to track
> commands that are nested within subtransactions.  That would be a bit
> complicated to do correctly, but pg_dump/pg_restore doesn't need it.
>
> Hence, I propose the attached.
>
>
>


Makes sense, Nice clear comments.

cheers

andrew


Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2024-03-23 Thread James Coleman
On Thu, Mar 21, 2024 at 1:09 PM Robert Haas  wrote:
>
> On Thu, Mar 14, 2024 at 9:07 PM James Coleman  wrote:
> > If the goal here is the most minimal patch possible, then please
> > commit what you proposed. I am interested in improving the document
> > further, but I don't know how to do that easily if the requirement is
> > effectively "must only change one specific detail at a time".
>
> So, yesterday I wrote a long email on how I saw the goals here.
> Despite our disagreements, I believe we agree that the text I proposed
> is better than what's there, so I've committed that change now. I've
> also marked the CF entry as committed. Please propose the other
> changes you want separately.

Thanks for committing the fix.

Regards,
James Coleman




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2024-03-23 Thread James Coleman
On Wed, Mar 20, 2024 at 2:15 PM Robert Haas  wrote:
>
> On Thu, Mar 14, 2024 at 9:07 PM James Coleman  wrote:
> > Obviously I have reasons for the other changes I made: for example,
> > "no longer visible" improves the correctness, since being an old
> > version isn't sufficient. I removed the "In summary" sentence because
> > it simply doesn't follow from the prose before it. That sentence
> > simply restates information already appearing earlier in almost as
> > simple a form, so it's redundant. But more importantly it's just not
> > actually a summary of the text before it, so removing it improves the
> > documentation.
> >
> > I can explain my reasoning further if desired, but I fear it would
> > simply frustrate you further, so I'll stop here.
> >
> > If the goal here is the most minimal patch possible, then please
> > commit what you proposed. I am interested in improving the document
> > further, but I don't know how to do that easily if the requirement is
> > effectively "must only change one specific detail at a time". So, that
> > leaves me feeling a bit frustrated also.
>
> I don't think the goal is to have the most minimal patch possible, but
> at the same time, I'm responsible for what I commit. If I commit a
> patch that changes something and somebody, especially another
> committer, writes an email and says "hey, why the heck did you change
> this, you dummy?" then I need to be able to answer that question, and
> saying "uh, well, James Coleman had it in his patch and he didn't
> explain why it was there and I didn't know either but I just kind of
> committed it anyway" is not where I want to be. Almost without
> exception, it's not the patch author who gets asked why they included
> a certain thing in the patch; it's the committer who gets asked why it
> got committed that way -- and at least as I see it, if there's not a
> good answer to that question, it's the committer who gets judged, not
> the patch author. In some cases, such questions and judgements arrive
> without warning YEARS after the commit.
>
> So, to protect myself against questions from other hackers that end,
> at least implicitly, in "you dummy," I try to make sure there's an
> adequate paper trail for everything I commit. Between the commit
> message, the comments, and the email discussion, it needs to be
> crystal clear why something was thought to be a good idea at the time.
> Hopefully, it will be clear enough that I never even get a question,
> because the reader will be able to understand ON THEIR OWN why
> something was done and either go "oh, that make sense" or "well, I get
> why they did that, but I disagree because $REASON" ... but if that
> doesn't quite work out, then I hope that at the very least the paper
> trail will be good enough that I can reconstruct the reasoning if
> needed. For a recent example of a case where I clearly failed do a
> good enough job to keep someone from asking a "you dummy" question,
> see the http://postgr.es/m/6030bdec-0de1-4da2-b0b3-335007eae...@iki.fi
> and in particular the paragraph that starts with "However".
>
> Therefore, if I realize when reading the patch that it contains
> odd-looking changes which I can't relate to the stated goal, it's
> getting bounced, pretty much 100% of the time. If it's a small patch
> and I really care about it and it's a new submitter who doesn't know
> any better, I might choose to remove those changes myself and commit
> the rest; and if I like those changes for other reasons, I might
> choose to commit them separately. In any other situation, I'm just
> going to tell the submitter that they need to take out the unrelated
> changes and move on to the next email thread.
>
> In this particular situation, I see that this whole discussion is
> slightly moot by virtue of 7a9328e8e40534fb4de41b4ac152e3c6989d84e7
> (Jeff Davis, 2024-03-05) which removed the "In summary, heap-only
> tuple updates can only be created if columns used by indexes are not
> updated" sentence that you also removed. But there is an important
> difference between that patch and yours, at least IMHO. You argue that
> the sentence in question didn't flow from what precedes it, but I
> don't agree with that as a rationale; I think that sentence flows
> perfectly well from what precedes it and is a generally adequate
> summary of the point of the section. That we disagree on the merits of
> that sentence is fine; there's no hope that we're all going to agree
> on everything. What makes me frustrated is that you didn't provide
> that rationale, which greatly complicates the whole discussion,
> because now I have to spend time getting you to either take it out or
> provide your justification before we can even reach the point of
> arguing about the substance.
>
> At least in my judgment, the patch Jeff committed doesn't have that
> problem, because the rationale that he gives in the commit message
> (partly by reference to another commit) is that heap-only tuples now
> can, in 

pg_dump versus enum types, round N+1

2024-03-23 Thread Tom Lane
I have a patch in the queue [1] that among other things tries to
reduce the number of XIDs consumed during pg_upgrade by making
pg_restore group its commands into batches of a thousand or so
per transaction.  This had been passing tests, so I was quite
surprised when the cfbot started to show it as falling over.
Investigation showed that it is now failing because 6185c9737
added these objects to the regression tests and didn't drop them:

CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 
'purple');
CREATE DOMAIN rgb AS rainbow CHECK (VALUE IN ('red', 'green', 'blue'));

In binary-upgrade mode, pg_dump dumps the enum type like this:

CREATE TYPE public.rainbow AS ENUM (
);
ALTER TYPE public.rainbow ADD VALUE 'red';
ALTER TYPE public.rainbow ADD VALUE 'orange';
ALTER TYPE public.rainbow ADD VALUE 'yellow';
ALTER TYPE public.rainbow ADD VALUE 'green';
ALTER TYPE public.rainbow ADD VALUE 'blue';
ALTER TYPE public.rainbow ADD VALUE 'purple';

and then, if we're within the same transaction, creation of the domain
falls over with

ERROR:  unsafe use of new value "red" of enum type rainbow
HINT:  New enum values must be committed before they can be used.

So I'm glad we found that sooner not later, but something needs
to be done about it if [1] is to get committed.  It doesn't seem
particularly hard to fix though: we just have to track the enum
type OIDs made in the current transaction, using largely the same
approach as is already used in pg_enum.c to track enum value OIDs.
enum.sql contains a comment opining that this is too expensive,
but I don't see why it is as long as we don't bother to track
commands that are nested within subtransactions.  That would be a bit
complicated to do correctly, but pg_dump/pg_restore doesn't need it.

Hence, I propose the attached.

regards, tom lane

[1] https://commitfest.postgresql.org/47/4713/

diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index d2fe5ca2c8..7768bc97c3 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -36,17 +36,35 @@
 Oid			binary_upgrade_next_pg_enum_oid = InvalidOid;
 
 /*
- * Hash table of enum value OIDs created during the current transaction by
- * AddEnumLabel.  We disallow using these values until the transaction is
+ * We keep two transaction-lifespan hash tables, one containing the OIDs
+ * of enum types made in the current transaction, and one containing the
+ * OIDs of enum values created during the current transaction by
+ * AddEnumLabel (but only if their enum type is not in the first hash).
+ *
+ * We disallow using enum values in the second hash until the transaction is
  * committed; otherwise, they might get into indexes where we can't clean
  * them up, and then if the transaction rolls back we have a broken index.
  * (See comments for check_safe_enum_use() in enum.c.)  Values created by
  * EnumValuesCreate are *not* entered into the table; we assume those are
  * created during CREATE TYPE, so they can't go away unless the enum type
  * itself does.
+ *
+ * The motivation for treating enum values as safe if their type OID is
+ * in the first hash is to allow CREATE TYPE AS ENUM; ALTER TYPE ADD VALUE;
+ * followed by a use of the value in the same transaction.  This pattern
+ * is really just as safe as creating the value during CREATE TYPE.
+ * We need to support this because pg_dump in binary upgrade mode produces
+ * commands like that.  But currently we only support it when the commands
+ * are at the outermost transaction level, which is as much as we need for
+ * pg_dump.  We could track subtransaction nesting of the commands to
+ * analyze things more precisely, but for now we don't bother.
  */
-static HTAB *uncommitted_enums = NULL;
+static HTAB *uncommitted_enum_types = NULL;
+static HTAB *uncommitted_enum_values = NULL;
 
+static void init_uncommitted_enum_types(void);
+static void init_uncommitted_enum_values(void);
+static bool EnumTypeUncommitted(Oid typ_id);
 static void RenumberEnumType(Relation pg_enum, HeapTuple *existing, int nelems);
 static int	sort_order_cmp(const void *p1, const void *p2);
 
@@ -56,6 +74,11 @@ static int	sort_order_cmp(const void *p1, const void *p2);
  *		Create an entry in pg_enum for each of the supplied enum values.
  *
  * vals is a list of String values.
+ *
+ * We assume that this is called only by CREATE TYPE AS ENUM, and that it
+ * will be called even if the vals list is empty.  So we can enter the
+ * enum type's OID into uncommitted_enum_types here, rather than needing
+ * another entry point to do it.
  */
 void
 EnumValuesCreate(Oid enumTypeOid, List *vals)
@@ -70,6 +93,18 @@ EnumValuesCreate(Oid enumTypeOid, List *vals)
 	CatalogIndexState indstate;
 	TupleTableSlot **slot;
 
+	/*
+	 * Remember the type OID as being made in the current transaction, but not
+	 * if we're in a subtransaction.
+	 */
+	if (GetCurrentTransactionNestLevel() == 1)
+	{
+		if (uncommitted_enum_types == 

Re: pg_stat_statements and "IN" conditions

2024-03-23 Thread Dmitry Dolgov
> On Sat, Mar 23, 2024 at 04:13:44PM +0900, Yasuo Honda wrote:
> Hi, I'm interested in this feature. It looks like these patches have
> some conflicts.
>
> http://cfbot.cputube.org/patch_47_2837.log
>
> Would you rebase these patches?

Sure, I can rebase, give me a moment. If you don't want to wait, there
is a base commit in the patch, against which it should be applied
without issues, 0eb23285a2.




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra


On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right?  That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think?  That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
> 
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
> 
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
> 

Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...

I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.

I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom d2b717d14638632c25d0e6919f5cd40333e9ebd0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 23 Mar 2024 18:26:21 +0100
Subject: [PATCH v20240323-2 2/2] write_reconstructed_file

---
 src/bin/pg_combinebackup/reconstruct.c | 32 +++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index f5c7af8a23c..4de92894bed 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -59,7 +59,8 @@ static void write_reconstructed_file(char *input_filename,
 	 off_t *offsetmap,
 	 pg_checksum_context *checksum_ctx,
 	 bool debug,
-	 bool dry_run);
+	 bool dry_run,
+	 CopyMode copy_mode);
 static void read_bytes(rfile *rf, void *buffer, unsigned length);
 
 /*
@@ -325,7 +326,8 @@ reconstruct_from_incremental_file(char *input_filename,
 	{
 		write_reconstructed_file(input_filename, output_filename,
  block_length, sourcemap, offsetmap,
- _ctx, debug, dry_run);
+ _ctx, debug, dry_run,
+ copy_method);
 		debug_reconstruction(n_prior_backups + 1, source, dry_run);
 	}
 
@@ -528,7 +530,8 @@ write_reconstructed_file(char *input_filename,
 		 off_t *offsetmap,
 		 pg_checksum_context *checksum_ctx,
 		 bool debug,
-		 bool dry_run)
+		 bool dry_run,
+		 CopyMode copy_mode)
 {
 	int			wfd = -1;
 	unsigned	i;
@@ -630,6 +633,29 @@ write_reconstructed_file(char *input_filename,
 		if (dry_run)
 			continue;
 
+		/*
+		 * If requested, copy the block using copy_file_range.
+		 *
+		 * We can'd do this if the block needs to be zero-filled or when we
+		 * need to update checksum.
+		 */
+		if ((copy_mode == COPY_MODE_COPY_FILE_RANGE) &&
+			(s != NULL) && (checksum_ctx->type == CHECKSUM_TYPE_NONE))
+		{
+#if defined(HAVE_COPY_FILE_RANGE)
+			do
+			{
+wb = copy_file_range(s->fd, [i], wfd, NULL, BLCKSZ, 0);
+if (wb < 0)
+	pg_fatal("error while copying file range from \"%s\" to \"%s\": 

altering a column's collation leaves an invalid foreign key

2024-03-23 Thread Paul Jungwirth

Dear hackers,

I was looking at how foreign keys deal with collations, and I came across this comment about not 
re-checking a foreign key if the column type changes in a compatible way:


  * Since we require that all collations share the same notion of
  * equality (which they do, because texteq reduces to bitwise
  * equality), we don't compare collation here.

But now that we have nondeterministic collations, isn't that out of date?

For instance I could make this foreign key:

paul=# create collation itext (provider = 'icu', locale = 'und-u-ks-level1', 
deterministic = false);
CREATE COLLATION
paul=# create table t1 (id text collate itext primary key);
CREATE TABLE
paul=# create table t2 (id text, parent_id text references t1);
CREATE TABLE

And then:

paul=# insert into t1 values ('a');
INSERT 0 1
paul=# insert into t2 values ('.', 'A');
INSERT 0 1

So far that behavior seems correct, because the user told us 'a' and 'A' were 
equivalent,
but now I can change the collation on the referenced table and the FK doesn't 
complain:

paul=# alter table t1 alter column id type text collate "C";
ALTER TABLE

The constraint claims to be valid, but I can't drop & add it:

paul=# alter table t2 drop constraint t2_parent_id_fkey;
ALTER TABLE
paul=# alter table t2 add constraint t2_parent_id_fkey foreign key (parent_id) 
references t1;
ERROR:  insert or update on table "t2" violates foreign key constraint 
"t2_parent_id_fkey"
DETAIL:  Key (parent_id)=(A) is not present in table "t1".

Isn't that a problem?

Perhaps if the previous collation was nondeterministic we should force a 
re-check.

(Tested on 17devel 697f8d266c and also 16.)

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: DRAFT: Pass sk_attno to consistent function

2024-03-23 Thread Michał Kłeczek


> On 22 Mar 2024, at 10:11, Michał Kłeczek  wrote:
> 
>> 
>> On 22 Mar 2024, at 01:29, Michał Kłeczek  wrote:
>> 
>> 
>> 
>>> On 21 Mar 2024, at 23:42, Matthias van de Meent 
>>>  wrote:
>>> 
>>> 
>>> You seem to already be using your own operator class, so you may want
>>> to look into CREATE FUNCTION's support_function parameter; which would
>>> handle SupportRequestIndexCondition and/or SupportRequestSimplify.
>>> I suspect a support function that emits multiple clauses that each
>>> apply to only a single partition at a time should get you quite far if
>>> combined with per-partition constraints that filter all but one of
>>> those. Also, at plan-time you can get away with much more than at
>>> runtime.
> 
> Thinking about it some more - the suggestion goes backwards - ie. there must 
> have been some deep misunderstanding:
> 
> If I was able to write my query such that the planner generates optimal plan, 
> I would not implement the custom operator in the first place.

To make it more concrete:

create extension pg_trgm;
create table t ( pk text not null primary key, data text not null ) partition 
by hash (pk);
create table t_2_0 partition of t for values with ( modulus 2, remainder 0 );
create table t_2_0 partition of t for values with ( modulus 2, remainder 1 );
create index ti on t using gist ( pk, data gist_trgm_ops);

explain select * from t where pk = any (array['1', '2', '4', '5']) and data % 
'what' order by data <-> 'what' limit 5;

Limit  (cost=41.32..41.33 rows=2 width=21)
   ->  Sort  (cost=41.32..41.33 rows=2 width=21)
 Sort Key: ((t.data <-> 'what'::text))
 ->  Append  (cost=16.63..41.31 rows=2 width=21)
   ->  Bitmap Heap Scan on t_2_0 t_1  (cost=16.63..20.65 rows=1 
width=21)
 Recheck Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data 
% 'what'::text))
 ->  Bitmap Index Scan on t_2_0_pk_data_idx  
(cost=0.00..16.63 rows=1 width=0)
   Index Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND 
(data % 'what'::text))
   ->  Bitmap Heap Scan on t_2_1 t_2  (cost=16.63..20.65 rows=1 
width=21)
 Recheck Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data 
% 'what'::text))
 ->  Bitmap Index Scan on t_2_1_pk_data_idx  
(cost=0.00..16.63 rows=1 width=0)
   Index Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND 
(data % 'what'::text))


That’s bad as the number of records to sort might be huge. So:

set enable_bitmapscan to off;

Limit  (cost=0.30..242.85 rows=2 width=21)
   ->  Merge Append  (cost=0.30..242.85 rows=2 width=21)
 Sort Key: ((t.data <-> 'what'::text))
 ->  Index Scan using t_2_0_pk_data_idx on t_2_0 t_1  
(cost=0.15..121.40 rows=1 width=21)
   Index Cond: (data % 'what'::text)
   Order By: (data <-> 'what'::text)
   Filter: (pk = ANY ('{1,2,4,5}'::text[]))
 ->  Index Scan using t_2_1_pk_data_idx on t_2_1 t_2  
(cost=0.15..121.42 rows=1 width=21)
   Index Cond: (data % 'what'::text)
   Order By: (data <-> 'what'::text)
   Filter: (pk = ANY ('{1,2,4,5}'::text[]))

That’s bad as well since pk = ANY([…]) is not in Index Cond.

Lets use ||= operator then:

drop index ti;
create index ti on t using gist ( pk gist_extra_text_ops, data gist_trgm_ops);

explain select * from t where pk = any (array['1', '2', '4', '5']) and pk ||= 
array['1', '2', '4', '5'] and data % 'what' order by data <-> 'what' limit 5;

Limit  (cost=0.30..153.70 rows=2 width=21)
   ->  Merge Append  (cost=0.30..153.70 rows=2 width=21)
 Sort Key: ((t.data <-> 'what'::text))
 ->  Index Scan using t_2_0_pk_data_idx on t_2_0 t_1  (cost=0.15..76.84 
rows=1 width=21)
   Index Cond: ((pk ||= '{1,2,4,5}'::text[]) AND (data % 
'what'::text))
   Order By: (data <-> 'what'::text)
   Filter: (pk = ANY ('{1,2,4,5}'::text[]))
 ->  Index Scan using t_2_1_pk_data_idx on t_2_1 t_2  (cost=0.15..76.84 
rows=1 width=21)
   Index Cond: ((pk ||= '{1,2,4,5}'::text[]) AND (data % 
'what'::text))
   Order By: (data <-> 'what'::text)
   Filter: (pk = ANY ('{1,2,4,5}'::text[]))


That’s much better. There is still Filter on SAOP but it is almost harmless: 
always true and does not require heap access.


A few observations:
1. I am not sure why SAOP can be Index Cond in case of Bitmap Index Scan but 
not in case of Index Scan - don’t know what the interplay between the planner 
and amsearcharray is.
2. In all cases array passed to (Bitmap) Index Scan is NOT filtered by 
partition pruning.

The second point means useless index page reads (amplified by the number of 
elements in input array and the width of the index).

It is the _second_ point I would like to address with this patch - for us it 
makes a big difference as it means almost order of magnitude (around 5-10 times 
based on preliminary tests) 

Re: [PATCH] plpython function causes server panic

2024-03-23 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 22, 2024 at 4:37 PM Tom Lane  wrote:
>> I think these things are already dealt with.  However, one thing
>> worth questioning is that CommitSubTransaction() will just silently
>> kill any workers started during the current subxact, and likewise
>> CommitTransaction() zaps workers without complaint.  Shouldn't these
>> instead throw an error about how you didn't close parallel mode,
>> and then the corresponding Abort function does the cleanup?
>> I did not change that behavior here, but it seems dubious.

> I'm not sure. I definitely knew when I wrote this code that we often
> emit warnings about resources that aren't cleaned up at (sub)commit
> time rather than just silently releasing them, and I feel like the
> fact that I didn't implement that behavior here was probably a
> deliberate choice to avoid some problem.

Ah, right, it's reasonable to consider this an end-of-xact resource
leak, which we generally handle with WARNING not ERROR.  And I see
that AtEOXact_Parallel and AtEOSubXact_Parallel already do

if (isCommit)
elog(WARNING, "leaked parallel context");

However, the calling logic seems a bit shy of a load, in that it
trusts IsInParallelMode() completely to decide whether to check for
leaked parallel contexts.  So we'd miss the case where somebody did
ExitParallelMode without having cleaned up workers.  It's not like
AtEOXact_Parallel and AtEOSubXact_Parallel cost a lot when they have
nothing to do, so I think we should call them unconditionally, and
separately from that issue a warning if parallelModeLevel isn't zero
(and we're committing).

regards, tom lane

diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5acc9537d6..dae9dd7f2f 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -545,10 +545,10 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   
 
   
-Functions and aggregates must be marked PARALLEL UNSAFE if
-they write to the database, access sequences, change the transaction state
-even temporarily (e.g., a PL/pgSQL function that establishes an
-EXCEPTION block to catch errors), or make persistent changes to
+Functions and aggregates must be marked PARALLEL UNSAFE
+if they write to the database, change the transaction state (other than by
+using a subtransaction for error recovery), access sequences, or make
+persistent changes to
 settings.  Similarly, functions must be marked PARALLEL
 RESTRICTED if they access temporary tables, client connection state,
 cursors, prepared statements, or miscellaneous backend-local state that
diff --git a/doc/src/sgml/ref/create_function.sgml b/doc/src/sgml/ref/create_function.sgml
index 863d99d1fc..0d240484cd 100644
--- a/doc/src/sgml/ref/create_function.sgml
+++ b/doc/src/sgml/ref/create_function.sgml
@@ -428,22 +428,24 @@ CREATE [ OR REPLACE ] FUNCTION
 PARALLEL
 
 
- PARALLEL UNSAFE indicates that the function
-  can't be executed in parallel mode and the presence of such a
+ 
+  PARALLEL UNSAFE indicates that the function
+  can't be executed in parallel mode; the presence of such a
   function in an SQL statement forces a serial execution plan.  This is
   the default.  PARALLEL RESTRICTED indicates that
-  the function can be executed in parallel mode, but the execution is
-  restricted to parallel group leader.  PARALLEL SAFE
+  the function can be executed in parallel mode, but only in the parallel
+  group leader process.  PARALLEL SAFE
   indicates that the function is safe to run in parallel mode without
-  restriction.
+  restriction, including in parallel worker processes.
  
 
  
   Functions should be labeled parallel unsafe if they modify any database
-  state, or if they make changes to the transaction such as using
-  sub-transactions, or if they access sequences or attempt to make
-  persistent changes to settings (e.g., setval).  They should
-  be labeled as parallel restricted if they access temporary tables,
+  state, change the transaction state (other than by using a
+  subtransaction for error recovery), access sequences (e.g., by
+  calling currval) or make persistent changes to
+  settings.  They should
+  be labeled parallel restricted if they access temporary tables,
   client connection state, cursors, prepared statements, or miscellaneous
   backend-local state which the system cannot synchronize in parallel mode
   (e.g.,  setseed cannot be executed other than by the group
diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel
index e486bffc47..9df3da91b0 100644
--- a/src/backend/access/transam/README.parallel
+++ b/src/backend/access/transam/README.parallel
@@ -137,7 +137,7 @@ Transaction Integration
 ===
 
 Regardless of what the 

Re: MIN/MAX functions for a record

2024-03-23 Thread Tom Lane
Aleksander Alekseev  writes:
> One thing I'm not 100% sure of is whether record_larger() should make
> a copy of its arguments or the current implementation is safe.

I don't see any copying happening in, say, text_larger or
numeric_larger, so this shouldn't need to either.

Personally I'd write "record_cmp(fcinfo) > 0" rather than indirecting
through record_gt.  The way you have it is not strictly correct anyhow:
you're cheating by not using DirectFunctionCall.

Also, given that you're passing the fcinfo, there's no need
to extract the arguments from it before that call.  So it
seems to me that code like

if (record_cmp(fcinfo) > 0)
PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(0));
else
PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(1));

should do, and possibly save one useless detoast step.  Or you could
do

if (record_cmp(fcinfo) > 0)
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
else
PG_RETURN_DATUM(PG_GETARG_DATUM(1));

because really there's no point in detoasting at all.

regards, tom lane




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra
On 3/23/24 13:38, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
>> Hmm, this discussion seems to assume that we only use
>> copy_file_range() to copy/clone whole segment files, right?  That's
>> great and may even get most of the available benefit given typical
>> databases with many segments of old data that never changes, but... I
>> think copy_write_range() allows us to go further than the other
>> whole-file clone techniques: we can stitch together parts of an old
>> backup segment file and an incremental backup to create a new file.
>> If you're interested in minimising disk use while also removing
>> dependencies on the preceding chain of backups, then it might make
>> sense to do that even if you *also* have to read the data to compute
>> the checksums, I think?  That's why I mentioned it: if
>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>> search of a problem, has the world ever seen a better problem than
>> pg_combinebackup?
> 
> That makes sense; it's just a different part of the code than I
> thought we were talking about.
> 

Yeah, that's in write_reconstructed_file() and the patch does not touch
that at all. I agree it would be nice to use copy_file_range() in this
part too, and it doesn't seem it'd be that hard to do, I think.

It seems we'd just need a "fork" that either calls pread/pwrite or
copy_file_range, depending on checksums and what was requested.

BTW is there a reason why the code calls "write" and not "pg_pwrite"?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra


On 3/22/24 19:40, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
>  wrote:
>> Right, this will happen:
>>
>>   pg_combinebackup: error: unable to use accelerated copy when manifest
>>   checksums are to be calculated. Use --no-manifest
>>
>> Are you saying we should just silently override the copy method and do
>> the copy block by block?
> 
> Yes.
> 
>> I'm not strongly opposed to that, but it feels
>> wrong to just ignore that the user explicitly requested cloning, and I'm
>> not sure why should this be different from any other case when the user
>> requests incompatible combination of options and/or options that are not
>> supported on the current configuration.
> 
> I don't feel like copying block-by-block when that's needed to compute
> a checksum is ignoring what the user requested. I mean, if we'd had to
> perform reconstruction rather than copying an entire file, we would
> have done that regardless of whether --clone had been there, and I
> don't see why the need-to-compute-a-checksum case is any different. I
> think we should view a flag like --clone as specifying how to copy a
> file when we don't need to do anything but copy it. I don't think it
> should dictate that we're not allowed to perform other processing when
> that other processing is required.
> 
> From my point of view, this is not a case of incompatible options
> having been specified. If you specify run pg_basebackup with both
> --format=p and --format=t, those are incompatible options; the backup
> can be done one way or the other, but not both at once. But here
> there's no such conflict. Block-by-block copying and fast-copying can
> happen as part of the same operation, as in the example that I showed,
> where some files need the block-by-block copying and some can be
> fast-copied. The user is entitled to specify which fast-copying method
> they would like to have used for the files where fast-copying is
> possible without getting a failure just because it isn't possible for
> every single file.
> 
> Or to say it the other way around, if there's 1 file that needs to be
> copied block by block and 99 files that can be fast-copied, you want
> to force the user to the block-by-block method for all 100 files. I
> want to force the use of the block-by-block method for the 1 file
> where that's the only valid method, and let them choose what they want
> to do for the other 99.
> 

OK, that makes sense. Here's a patch that should work like this - in
copy_file we check if we need to calculate checksums, and either use the
requested copy method, or fall back to the block-by-block copy.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 558321b7ee10fa3902aaed1d1a08857865a232bb Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 19 Mar 2024 15:16:29 +0100
Subject: [PATCH v20240323] pg_combinebackup - allow using
 clone/copy_file_range

---
 doc/src/sgml/ref/pg_combinebackup.sgml  |  34 +
 src/bin/pg_combinebackup/copy_file.c| 156 +++-
 src/bin/pg_combinebackup/copy_file.h|  18 ++-
 src/bin/pg_combinebackup/pg_combinebackup.c |  18 ++-
 src/bin/pg_combinebackup/reconstruct.c  |   5 +-
 src/bin/pg_combinebackup/reconstruct.h  |   5 +-
 6 files changed, 192 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index 8a0a600c2b2..60a60e3fae6 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -185,6 +185,40 @@ PostgreSQL documentation
   
  
 
+ 
+  --clone
+  
+   
+Use efficient file cloning (also known as reflinks on
+some systems) instead of copying files to the new cluster.  This can
+result in near-instantaneous copying of the data files, giving the
+speed advantages of -k/--link while
+leaving the old cluster untouched.
+   
+
+   
+File cloning is only supported on some operating systems and file
+systems.  If it is selected but not supported, the
+pg_combinebackup run will error.  At present,
+it is supported on Linux (kernel 4.5 or later) with Btrfs and XFS (on
+file systems created with reflink support), and on macOS with APFS.
+   
+  
+ 
+
+ 
+  --copy-file-range
+  
+   
+Use the copy_file_range system call for efficient
+copying.  On some file systems this gives results similar to
+--clone, sharing physical disk blocks, while on others
+it may still copy blocks, but do so via an optimized path.  At present,
+it is supported on Linux and FreeBSD.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index e6d2423278a..0874679e53a 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ 

Re: [PATCH] plpython function causes server panic

2024-03-23 Thread Robert Haas
On Fri, Mar 22, 2024 at 4:37 PM Tom Lane  wrote:
> Fair enough.  In the attached v2, I wrote "change the transaction
> state (other than by using a subtransaction for error recovery)";
> what do you think of that?

I think that's pretty good. I wonder if there are some bizarre cases
where the patch would allow slightly more than that ... who is to say
that you must pop the subtransaction you pushed? But that sort of
pedantry is probably not worth worrying about for purposes of the
documentation, especially because such a thing might not be a very
good idea anyway.

> I dug around in the docs and couldn't really find anything about
> parallel-query transaction limitations other than this bit in
> parallel.sgml and the more or less copy-pasted text in
> create_function.sgml; did you have any other spots in mind?
> (I did find the commentary in README.parallel, but that's not
> exactly user-facing.)

I don't have anything else in mind at the moment.

> I think these things are already dealt with.  However, one thing
> worth questioning is that CommitSubTransaction() will just silently
> kill any workers started during the current subxact, and likewise
> CommitTransaction() zaps workers without complaint.  Shouldn't these
> instead throw an error about how you didn't close parallel mode,
> and then the corresponding Abort function does the cleanup?
> I did not change that behavior here, but it seems dubious.

I'm not sure. I definitely knew when I wrote this code that we often
emit warnings about resources that aren't cleaned up at (sub)commit
time rather than just silently releasing them, and I feel like the
fact that I didn't implement that behavior here was probably a
deliberate choice to avoid some problem. But I have no memory of what
that problem was, and it is entirely possible that it was eliminated
at some later phase of development. I think that decision was made
quite early, before much of anything was working.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Robert Haas
On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
> Hmm, this discussion seems to assume that we only use
> copy_file_range() to copy/clone whole segment files, right?  That's
> great and may even get most of the available benefit given typical
> databases with many segments of old data that never changes, but... I
> think copy_write_range() allows us to go further than the other
> whole-file clone techniques: we can stitch together parts of an old
> backup segment file and an incremental backup to create a new file.
> If you're interested in minimising disk use while also removing
> dependencies on the preceding chain of backups, then it might make
> sense to do that even if you *also* have to read the data to compute
> the checksums, I think?  That's why I mentioned it: if
> copy_file_range() (ie sub-file-level block sharing) is a solution in
> search of a problem, has the world ever seen a better problem than
> pg_combinebackup?

That makes sense; it's just a different part of the code than I
thought we were talking about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SET ROLE documentation improvement

2024-03-23 Thread Robert Haas
On Fri, Mar 22, 2024 at 4:51 PM Nathan Bossart  wrote:
> Actually, shouldn't this one be back-patched to v16?  If so, I'd do that
> one separately from the other changes we are discussing.

Sure, that seems fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: MIN/MAX functions for a record

2024-03-23 Thread Aleksander Alekseev
Hi,

> Exactly Tom, I see no fundamental problem for it not to be implemented, since 
> comparison operator is already implemented. In fact, MIN/MAX should work for 
> all types for which comparison operator is defined.

On second thought, this should work reasonably well.

PFA a WIP patch. At this point it implements only MAX(record), no MIN, no tests:

```
=# SELECT MAX(row(year, month)) FROM (VALUES(2025, 1), (2024,2)) x(year, month);
   max
--
 (2025,1)
```

One thing I'm not 100% sure of is whether record_larger() should make
a copy of its arguments or the current implementation is safe.

-- 
Best regards,
Aleksander Alekseev


v1-0001-Support-MIN-MAX-record-aggregates-WIP.patch
Description: Binary data


Re: Switching XLog source from archive to streaming when primary available

2024-03-23 Thread Bharath Rupireddy
On Mon, Mar 18, 2024 at 11:38 AM Michael Paquier  wrote:
>
> On Sun, Mar 17, 2024 at 11:37:58AM +0530, Bharath Rupireddy wrote:
> > Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
> > conflict in meson.build. Please see the attached v23 patch.
>
> I've been reading this patch, and this is a very tricky one.  Please
> be *very* cautious.

Thanks for looking into this.

> +#streaming_replication_retry_interval = 0# time after which standby
> +# attempts to switch WAL source from archive to
> +# streaming replication in seconds; 0 disables
>
> This stuff allows a minimal retry interval of 1s.  Could it be useful
> to have more responsiveness here and allow lower values than that?
> Why not switching the units to be milliseconds?

Nathan had a different view on this to have it on the order of seconds
- https://www.postgresql.org/message-id/20240305020452.GA3373526%40nathanxps13.
If set to a too low value, the frequency of standby trying to connect
to primary increases. IMO, the order of seconds seems fine.

> +if (streaming_replication_retry_interval <= 0 ||
> +!StandbyMode ||
> +currentSource != XLOG_FROM_ARCHIVE)
> +return SWITCH_TO_STREAMING_NONE;
>
> Hmm.  Perhaps this should mention why we don't care about the
> consistent point.

Are you asking why we don't care whether the standby reached a
consistent point when switching to streaming mode due to this new
parameter? If this is the ask, the same applies when a standby
typically switches to streaming replication (get WAL
from primary) today, that is when receive from WAL archive finishes
(no more WAL left there) or fails for any reason. The standby doesn't
care about the consistent point even today, it just trusts the WAL
source and makes the switch.

> +/* See if we can switch WAL source to streaming */
> +if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
> +wal_source_switch_state = SwitchWALSourceToPrimary();
>
> Rather than a routine that returns as result the value to use for the
> GUC, I'd suggest to let this routine set the GUC as there is only one
> caller of SwitchWALSourceToPrimary().  This can also include a check
> on SWITCH_TO_STREAMING_NONE, based on what I'm reading that.

Firstly, wal_source_switch_state is not a GUC, it's a static variable
to be used across WaitForWALToBecomeAvailable calls. And, if you are
suggesting to turn SwitchWALSourceToPrimary so that it sets
wal_source_switch_state directly, I'd not do that because when
wal_source_switch_state is not SWITCH_TO_STREAMING_NONE, the function
gets called unnecessarily.

> -   if (lastSourceFailed)
> +   if (lastSourceFailed ||
> +   wal_source_switch_state == SWITCH_TO_STREAMING)
>
> Hmm.  This one may be tricky.  I'd recommend a separation between the
> failure in reading from a source and the switch to a new "forced"
> source.

Separation would just add duplicate code. Moreover, the code wrapped
within if (lastSourceFailed) doesn't do any error handling or such, it
just resets a few stuff from the previous source and sets the next
source.

FWIW, please check [1] (and the discussion thereon) for how the
lastSourceFailed flag is being used to consume all the streamed WAL in
pg_wal directly upon detecting promotion trigger file. Therefore, I
see no problem with the way it is right now for this new feature.

[1]
/*
 * Data not here yet. Check for trigger, then wait for
 * walreceiver to wake us up when new WAL arrives.
 */
if (CheckForStandbyTrigger())
{
/*
 * Note that we don't return XLREAD_FAIL immediately
 * here. After being triggered, we still want to
 * replay all the WAL that was already streamed. It's
 * in pg_wal now, so we just treat this as a failure,
 * and the state machine will move on to replay the
 * streamed WAL from pg_wal, and then recheck the
 * trigger and exit replay.
 */
lastSourceFailed = true;

> +if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
> +readFrom = XLOG_FROM_PG_WAL;
> +else
> +readFrom = currentSource == XLOG_FROM_ARCHIVE ?
> +XLOG_FROM_ANY : currentSource;
>
> WALSourceSwitchState looks confusing here, and are you sure that this
> is actualy correct?  Shouldn't we still try a READ_FROM_ANY or a read
> from the archives even with a streaming pending.

Please see the discussion starting from
https://www.postgresql.org/message-id/20221008215221.GA894639%40nathanxps13.
We wanted to keep the existing behaviour the same when we

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-23 Thread Bertrand Drouvot
Hi,

On Sat, Mar 23, 2024 at 01:11:50PM +0530, Bharath Rupireddy wrote:
> On Sat, Mar 23, 2024 at 11:27 AM Amit Kapila  wrote:
> >
> > How about adding the test in 019_replslot_limit? It is not a direct
> > fit but I feel later we can even add 'invalid_timeout' related tests
> > in this file which will use last_inactive_time feature.
> 
> I'm thinking the other way. Now, the new TAP file 043_replslot_misc.pl
> can have last_inactive_time tests, and later invalid_timeout ones too.
> This way 019_replslot_limit.pl is not cluttered.

I share the same opinion as Amit: I think 019_replslot_limit would be a better
place, because I see the timeout as another kind of limit.

> 
> > It is also
> > possible that some of the tests added by the 'invalid_timeout' feature
> > will obviate the need for some of these tests.
> 
> Might be. But, I prefer to keep both these tests separate but in the
> same file 043_replslot_misc.pl. Because we cover some corner cases the
> last_inactive_time is set upon loading the slot from disk.

Right but I think that this test does not necessary have to be in the same .pl
as the one testing the timeout. Could be added in one of the existing .pl like
001_stream_rep.pl for example.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-23 Thread Bharath Rupireddy
On Sat, Mar 23, 2024 at 11:27 AM Amit Kapila  wrote:
>
> How about adding the test in 019_replslot_limit? It is not a direct
> fit but I feel later we can even add 'invalid_timeout' related tests
> in this file which will use last_inactive_time feature.

I'm thinking the other way. Now, the new TAP file 043_replslot_misc.pl
can have last_inactive_time tests, and later invalid_timeout ones too.
This way 019_replslot_limit.pl is not cluttered.

> It is also
> possible that some of the tests added by the 'invalid_timeout' feature
> will obviate the need for some of these tests.

Might be. But, I prefer to keep both these tests separate but in the
same file 043_replslot_misc.pl. Because we cover some corner cases the
last_inactive_time is set upon loading the slot from disk.

> Review of v15
> ==
> 1.
> @@ -1026,7 +1026,8 @@ CREATE VIEW pg_replication_slots AS
>  L.conflicting,
>  L.invalidation_reason,
>  L.failover,
> -L.synced
> +L.synced,
> +L.last_inactive_time
>  FROM pg_get_replication_slots() AS L
>
> As mentioned previously, let's keep these new fields before
> conflicting and after two_phase.

Sorry, I forgot to notice that comment (out of a flood of comments
really :)). Now, done that way.

> 2.
> +# Get last_inactive_time value after slot's creation. Note that the
> slot is still
> +# inactive unless it's used by the standby below.
> +my $last_inactive_time_1 = $primary->safe_psql('postgres',
> + qq(SELECT last_inactive_time FROM pg_replication_slots WHERE
> slot_name = '$sb_slot' AND last_inactive_time IS NOT NULL;)
> +);
>
> We should check $last_inactive_time_1 to be a valid value and add a
> similar check for logical slots.

That's taken care by the type cast we do, right? Isn't that enough?

is( $primary->safe_psql(
'postgres',
qq[SELECT last_inactive_time >
'$last_inactive_time'::timestamptz FROM pg_replication_slots WHERE
slot_name = '$sb_slot' AND last_inactive_time IS NOT NULL;]
),
't',
'last inactive time for an inactive physical slot is updated correctly');

For instance, setting last_inactive_time_1 to an invalid value fails
with the following error:

error running SQL: 'psql::1: ERROR:  invalid input syntax for
type timestamp with time zone: "foo"
LINE 1: SELECT last_inactive_time > 'foo'::timestamptz FROM pg_repli...

> 3. BTW, why don't we set last_inactive_time for temporary slots
> (RS_TEMPORARY) as well? Don't we even invalidate temporary slots? If
> so, then I think we should set last_inactive_time for those as well
> and later allow them to be invalidated based on timeout parameter.

WFM. Done that way.

Please see the attached v16 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From ce85a48bbbd9de5d6ca0ce849993707cc01d1211 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 23 Mar 2024 07:27:48 +
Subject: [PATCH v16] Track last_inactive_time in pg_replication_slots.

Till now, the time at which the replication slot became inactive
is not tracked directly in pg_replication_slots. This commit adds
a new column 'last_inactive_time' for this. It is set to 0 whenever
a slot is made active/acquired and set to current timestamp
whenever the slot is inactive/released or restored from the disk.

The new column will be useful on production servers to debug and
analyze inactive replication slots. It will also help to know the
lifetime of a replication slot - one can know how long a streaming
standby, logical subscriber, or replication slot consumer is down.

The new column will be useful to implement inactive timeout based
replication slot invalidation in a future commit.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
---
 doc/src/sgml/system-views.sgml   |  11 ++
 src/backend/catalog/system_views.sql |   1 +
 src/backend/replication/slot.c   |  27 +
 src/backend/replication/slotfuncs.c  |   7 +-
 src/include/catalog/pg_proc.dat  |   6 +-
 src/include/replication/slot.h   |   3 +
 src/test/recovery/meson.build|   1 +
 src/test/recovery/t/043_replslot_misc.pl | 127 +++
 src/test/regress/expected/rules.out  |   3 +-
 9 files changed, 181 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/043_replslot_misc.pl

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index b5da476c20..2b36b5fef1 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2523,6 +2523,17 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+ 
+  
+   last_inactive_time timestamptz
+  
+  
+The time at which the slot became inactive.
+NULL if the slot is currently 

Re: pg_stat_statements and "IN" conditions

2024-03-23 Thread Yasuo Honda
Hi, I'm interested in this feature. It looks like these patches have
some conflicts.

http://cfbot.cputube.org/patch_47_2837.log

Would you rebase these patches?

Thanks,
--
Yasuo Honda

On Sat, Mar 23, 2024 at 4:11 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > Oh, I see, thanks. Give me a moment, will fix this.
>
> Here is it.