Re: Track the amount of time waiting due to cost_delay

2024-06-24 Thread Bertrand Drouvot
Hi,

On Sat, Jun 22, 2024 at 12:48:33PM +, Bertrand Drouvot wrote:
> 1. vacuuming indexes time has been longer on master because with v2, the 
> leader
> has been interrupted 342605 times while waiting, then making v2 "faster".
> 
> 2. the leader being interrupted while waiting is also already happening on 
> master
> due to the pgstat_progress_parallel_incr_param() calls in
> parallel_vacuum_process_one_index() (that have been added in 
> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> I think that 2. is less of a concern but I think that 1. is something that 
> needs
> to be addressed because the leader process is not honouring its cost delay 
> wait
> time in a noticeable way (at least during my test case).
> 
> I did not think of a proposal yet, just sharing my investigation as to why
> v2 has been faster than master during the vacuuming indexes phase.

I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).

Please find attached v3, that:

- ensures that there is at least 1 second between 2 reports, per parallel 
worker,
to the leader.

- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).

- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).

Remarks:

1. Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable. I don't think that the number of parallel workers has
to come into play as:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The throttling is not based on the cost limit as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The throttling is not based on the actual cost delay value because the leader
could be interrupted at the beginning, the midle or whatever part of the wait 
and
we are more interested about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample 
pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

With this approach in place, v3 attached applied, during my test case:

- the leader has been interrupted about 2500 times (instead of about 345000
times with v2)

- the vacuum index phase duration is very close to the master one (it has been
4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3
minutes faster with v2).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 99f417c0bcd7c29e126fdccdd6214ea37db67379 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Jun 2024 08:43:26 +
Subject: [PATCH v3] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does n

Re: Track the amount of time waiting due to cost_delay

2024-06-22 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 11:56:26AM +, Bertrand Drouvot wrote:
> == Observations 
> ===
> 
> As compare to v2:
> 
> 1. scanning heap time is about the same
> 2. vacuuming indexes time is about 3 minutes longer on master
> 3. vacuuming heap time is about the same

I had a closer look to understand why the vacuuming indexes time has been about
3 minutes longer on master.

During the vacuuming indexes phase, the leader is helping vacuuming the indexes
until it reaches WaitForParallelWorkersToFinish() (means when all the remaining
indexes are currently handled by the parallel workers, the leader has nothing
more to do and so it is waiting for the parallel workers to finish).

During the time the leader process is involved in indexes vacuuming it is
also subject to wait due to cost delay.

But with v2 applied, the leader may be interrupted by the parallel workers while
it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that
the patch is adding).

So, with v2 applied, the leader is waiting less (as interrupted while waiting)
when being involved in indexes vacuuming and that's why v2 is "faster" than
master.

To put some numbers, I did count the number of times the leader's pg_usleep() 
has
been interrupted (by counting the number of times the nanosleep() did return a
value < 0 in pg_usleep()). Here they are:

v2: the leader has been interrupted about 342605 times
master: the leader has been interrupted about 36 times

The ones on master are mainly coming from the 
pgstat_progress_parallel_incr_param() 
calls in parallel_vacuum_process_one_index().

The additional ones on v2 are coming from the 
pgstat_progress_parallel_incr_param()
calls added in vacuum_delay_point().

 Conclusion ==

1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".

2. the leader being interrupted while waiting is also already happening on 
master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in 
46ebdfe164). It has been the case "only" 36 times during my test case.

I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).

I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-21 Thread Bertrand Drouvot
Hi,

On Wed, Jun 19, 2024 at 02:11:50PM +, Bertrand Drouvot wrote:
> To sum up, I did not see any cases that did not lead to 1. or 2., so I think
> it's safe to not add an extra lock for the RelationRelationId case. If, for 
> any
> reason, there is still cases that are outside 1. or 2. then they may lead to
> orphaned dependencies linked to the RelationRelationId class. I think that's
> fine to take that "risk" given that a. that would not be worst than currently
> and b. I did not see any of those in our fleet currently (while I have seen a 
> non
> negligible amount outside of the RelationRelationId case).

Another thought for the RelationRelationId class case: we could check if there
is a lock first and if there is no lock then acquire one. That way that would
ensure the relation is always locked (so no "risk" anymore), but OTOH it may
add "unecessary" locking (see 2. mentioned previously).

I think I do prefer this approach to be on the safe side of thing, what do
you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-06-20 Thread Bertrand Drouvot
Hi,

On Thu, Jun 20, 2024 at 09:46:30AM +0900, Michael Paquier wrote:
> On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> > - How should the persistence of the custom stats be achieved?
> > Callbacks to give custom stats kinds a way to write/read their data,
> > push everything into a single file, or support both?
> > - Should this do like custom RMGRs and assign to each stats kinds ID
> > that are set in stone rather than dynamic ones?

> These two questions have been itching me in terms of how it would work
> for extension developers, after noticing that custom RMGRs are used
> more than I thought:
> https://wiki.postgresql.org/wiki/CustomWALResourceManagers
> 
> The result is proving to be nicer, shorter by 300 lines in total and
> much simpler when it comes to think about the way stats are flushed
> because it is possible to achieve the same result as the first patch
> set without manipulating any of the code paths doing the read and
> write of the pgstats file.

I think it makes sense to follow the same "behavior" as the custom
wal resource managers. That, indeed, looks much more simpler than v1.

> In terms of implementation, pgstat.c's KindInfo data is divided into
> two parts, for efficiency:
> - The exiting in-core stats with designated initializers, renamed as
> built-in stats kinds.
> - The custom stats kinds are saved in TopMemoryContext,

Agree that a backend lifetime memory area is fine for that purpose.

> and can only
> be registered with shared_preload_libraries.  The patch reserves a set
> of 128 harcoded slots for all the custom kinds making the lookups for
> the KindInfos quite cheap.

+   MemoryContextAllocZero(TopMemoryContext,
+  
sizeof(PgStat_KindInfo *) * PGSTAT_KIND_CUSTOM_SIZE);

and that's only 8 * PGSTAT_KIND_CUSTOM_SIZE bytes in total.

I had a quick look at the patches (have in mind to do more):

> With that in mind, the patch set is more pleasant to the eye, and the
> attached v2 consists of:
> - 0001 and 0002 are some cleanups, same as previously to prepare for
> the backend-side APIs.

0001 and 0002 look pretty straightforward at a quick look.

> - 0003 adds the backend support to plug-in custom stats.

1 ===

It looks to me that there is a mix of "in core" and "built-in" to name the
non custom stats. Maybe it's worth to just use one?
 
As I can see (and as you said above) this is mainly inspired by the custom
resource manager and 2 === and 3 === are probably copy/paste consequences.

2 ===

+   if (pgstat_kind_custom_infos[idx] != NULL &&
+   pgstat_kind_custom_infos[idx]->name != NULL)
+   ereport(ERROR,
+   (errmsg("failed to register custom cumulative 
statistics \"%s\" with ID %u", kind_info->name, kind),
+errdetail("Custom resource manager \"%s\" 
already registered with the same ID.",
+  
pgstat_kind_custom_infos[idx]->name)));

s/Custom resource manager/Custom cumulative statistics/

3 ===

+   ereport(LOG,
+   (errmsg("registered custom resource manager \"%s\" with 
ID %u",
+   kind_info->name, kind)));

s/custom resource manager/custom cumulative statistics/

> - 0004 includes documentation.

Did not look yet.

> - 0005 is an example of how to use them, with a TAP test providing
> coverage.

Did not look yet.

As I said, I've in mind to do a more in depth review. I've noted the above while
doing a quick read of the patches so thought it makes sense to share them
now while at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-06-20 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> Hi all,
> 
> 2) Make the shmem pgstats pluggable so as it is possible for extensions
> to register their own stats kinds.

Thanks for the patch! I like the idea of having custom stats (it has also been
somehow mentioned in [1]).

> 2) is actually something that can be used for more things than
> just pg_stat_statements, because people love extensions and
> statistics (spoiler: I do).

+1

> * Making custom stats data persistent is an interesting problem, and
> there are a couple of approaches I've considered:
> ** Allow custom kinds to define callbacks to read and write data from
> a source they'd want, like their own file through a fd.  This has the
> disadvantage to remove the benefit of c) above.
> ** Store everything in the existing stats file, adding one type of
> entry like 'S' and 'N' with a "custom" type, where the *name* of the
> custom stats kind is stored instead of its ID computed from shared
> memory.

What about having 2 files?

- One is the existing stats file
- One "predefined" for all the custom stats (so what you've done minus the
in-core stats). This one would not be configurable and the extensions will
not need to know about it.

Would that remove the benefit from c) that you mentioned up-thread?

[1]: 
https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vxymc%40awork3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-19 Thread Bertrand Drouvot
Hi,

On Mon, Jun 17, 2024 at 05:57:05PM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote:
> > So to try to sum up here: I'm not sure I agree with this design. But I
> > also feel like the design is not as clear and consistently implemented
> > as it could be. So even if I just ignored the question of whether it's
> > the right design, it feels like we're a ways from having something
> > potentially committable here, because of issues like the ones I
> > mentioned in the last paragraph.
> > 
> 
> Agree. I'll now move on with the "XXX Do we need a lock for 
> RelationRelationId?"
> comments that I put in v10 (see [1]) and study all the cases around them.

A. I went through all of them, did some testing for all, and reached the
conclusion that we must be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends of
the table...).

So we don't need to add a lock if this is a RelationRelationId object class for
the cases above.

As a consequence, I replaced the "XXX" related comments that were in v10 by
another set of comments in v11 (attached) like "No need to call 
LockRelationOid()
(through LockNotPinnedObject())". Reason is to make it clear in the code
and also to ease the review.

B. I explained in [1] (while sharing v10) that the object locking is now outside
of the dependency code except for (and I explained why):

recordDependencyOnExpr() 
recordDependencyOnSingleRelExpr()
makeConfigurationDependencies()

So I also did some testing, on the RelationRelationId case, for those and I
reached the same conclusion as the one shared above.

For A. and B. the testing has been done by adding a "ereport(WARNING.." at
those places when a RelationRelationId is involved. Then I run "make check"
and went to the failed tests (output were not the expected ones due to the
extra "WARNING"), reproduced them with gdb and checked for the lock on the
relation producing the "WARNING". All of those were linked to 1. or 2.

Note that adding an assertion on an existing lock would not work for the cases
described in 2.

So, I'm now confident that we must be in 1. or 2. but it's also possible
that I've missed some cases (though I think the way the testing has been done is
not that weak).

To sum up, I did not see any cases that did not lead to 1. or 2., so I think
it's safe to not add an extra lock for the RelationRelationId case. If, for any
reason, there is still cases that are outside 1. or 2. then they may lead to
orphaned dependencies linked to the RelationRelationId class. I think that's
fine to take that "risk" given that a. that would not be worst than currently
and b. I did not see any of those in our fleet currently (while I have seen a 
non
negligible amount outside of the RelationRelationId case).

> Once done, I think that it will easier to 1.remove ambiguity, 2.document and
> 3.do the "right" thing regarding the RelationRelationId object class.
> 

Please find attached v11, where I added more detailed comments in the commit
message and also in the code (I also removed the useless check on
AuthMemRelationId).

[1]: 
https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 50e11432960e0ad5d940d2e7d9557fc4770d8262 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v11] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the n

Re: Avoid orphaned objects dependencies, take 3

2024-06-17 Thread Bertrand Drouvot
Hi,

On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote:
> On Fri, Jun 14, 2024 at 3:54 AM Bertrand Drouvot
>  wrote:
> > > Ah, right. So, I was assuming that, with either this version of your
> > > patch or the earlier version, we'd end up locking the constraint
> > > itself. Was I wrong about that?
> >
> > The child contraint itself is not locked when going through
> > ConstraintSetParentConstraint().
> >
> > While at it, let's look at a full example and focus on your concern.
> 
> I'm not at the point of having a concern yet, honestly. I'm trying to
> understand the design ideas. The commit message just says that we take
> a conflicting lock, but it doesn't mention which object types that
> principle does or doesn't apply to. I think the idea of skipping it
> for cases where it's redundant with the relation lock could be the
> right idea, but if that's what we're doing, don't we need to explain
> the principle somewhere? And shouldn't we also apply it across all
> object types that have the same property?

Yeah, I still need to deeply study this area and document it. 

> Along the same lines:
> 
> + /*
> + * Those don't rely on LockDatabaseObject() when being dropped (see
> + * AcquireDeletionLock()). Also it looks like they can not produce
> + * orphaned dependent objects when being dropped.
> + */
> + if (object->classId == RelationRelationId || object->classId ==
> AuthMemRelationId)
> + return;
> 
> "It looks like X cannot happen" is not confidence-inspiring.

Yeah, it is not. It is just a "feeling" that I need to work on to remove
any ambiguity and/or adjust the code as needed.

> At the
> very least, a better comment is needed here. But also, that relation
> has no exception for AuthMemRelationId, only for RelationRelationId.
> And also, the exception for RelationRelationId doesn't imply that we
> don't need a conflicting lock here: the special case for
> RelationRelationId in AcquireDeletionLock() is necessary because the
> lock tag format is different for relations than for other database
> objects, not because we don't need a lock at all. If the handling here
> were really symmetric with what AcquireDeletionLock(), the coding
> would be to call either LockRelationOid() or LockDatabaseObject()
> depending on whether classid == RelationRelationId.

Agree.

> Now, that isn't
> actually necessary, because we already have relation-locking calls
> elsewhere, but my point is that the rationale this commit gives for
> WHY it isn't necessary seems to me to be wrong in multiple ways.

Agree. I'm not done with that part yet (should have made it more clear).

> So to try to sum up here: I'm not sure I agree with this design. But I
> also feel like the design is not as clear and consistently implemented
> as it could be. So even if I just ignored the question of whether it's
> the right design, it feels like we're a ways from having something
> potentially committable here, because of issues like the ones I
> mentioned in the last paragraph.
> 

Agree. I'll now move on with the "XXX Do we need a lock for RelationRelationId?"
comments that I put in v10 (see [1]) and study all the cases around them.

Once done, I think that it will easier to 1.remove ambiguity, 2.document and
3.do the "right" thing regarding the RelationRelationId object class.

[1]: 
https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow logical failover slots to wait on synchronous replication

2024-06-17 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 10:00:46AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote:
> > On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote:
> > > The existing 'standby_slot_names' isn't great for users who are running
> > > clusters with quorum-based synchronous replicas. For instance, if
> > > the user has  synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
> > > bit tedious to have to reconfigure the standby_slot_names to set it to
> > > the most updated 3 sync replicas whenever different sync replicas start
> > > lagging. In the event that both GUCs are set, 'standby_slot_names' takes
> > > precedence.
> > 
> > Hm.  IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E"
> > to get the desired behavior today.  That might ordinarily be okay, but it
> > could cause logical replication to be held back unnecessarily if one of the
> > replicas falls behind for whatever reason.  A way to tie standby_slot_names
> > to synchronous replication instead does seem like it would be useful in
> > this case.
> 
> FWIW, I have the same understanding and also think your proposal would be
> useful in this case.

A few random comments about v1:

1 

+   int mode = SyncRepWaitMode;

It's set to SyncRepWaitMode and then never change. Worth to get rid of "mode"?

2 

+   static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE] = 
{InvalidXLogRecPtr};

I did some testing and saw that the lsn[] values were not always set to
InvalidXLogRecPtr right after. It looks like that, in that case, we should
avoid setting the lsn[] values at compile time. Then, what about?

1. remove the "static". 

or

2. keep the static but set the lsn[] values after its declaration.

3 

-   /*
-* Return false if not all the standbys have caught up to the specified
-* WAL location.
-*/
-   if (caught_up_slot_num != standby_slot_names_config->nslotnames)
-   return false;
+   if (!XLogRecPtrIsInvalid(lsn[mode]) && lsn[mode] >= 
wait_for_lsn)
+   return true;

lsn[] values are/(should have been, see 2 above) just been initialized to
InvalidXLogRecPtr so that XLogRecPtrIsInvalid(lsn[mode]) will always return
true. I think this check is not needed then.

4 

> > > I did some very brief pgbench runs to compare the latency. Client instance
> > > was running pgbench and 10 logical clients while the Postgres box hosted
> > > the writer and 5 synchronous replicas.

> > > There's a hit to TPS

Out of curiosity, did you compare with standby_slot_names_from_syncrep set to 
off
and standby_slot_names not empty?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-17 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 04:52:09PM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote:
> > On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot
> >  wrote:
> > > Do you still find the code hard to maintain with v9?
> > 
> > I don't think it substantially changes my concerns as compared with
> > the earlier version.
> 
> Thanks for the feedback, I'll give it more thoughts.

Please find attached v10 that puts the object locking outside of the dependency
code.

It's done that way except for:

recordDependencyOnExpr() 
recordDependencyOnSingleRelExpr()
makeConfigurationDependencies()

The reason is that I think that it would need part of the logic that his inside
the above functions to be duplicated and I'm not sure that's worth it.

For example, we would probably need to:

- make additional calls to find_expr_references_walker() 
- make additional scan on the config map

It's also not done outside of recordDependencyOnCurrentExtension() as:

1. I think it is clear enough that way (as it is clear that the lock is taken on
a ExtensionRelationId object class).
2. why to include "commands/extension.h" in more places (locking would
depend of "creating_extension" and "CurrentExtensionObject"), while 1.?

Remarks:

1. depLockAndCheckObject() and friends in v9 have been renamed to
LockNotPinnedObject() and friends (as the vast majority of their calls are now
done outside of the dependency code).

2. regarding the concern around RelationRelationId (discussed in [1]), v10 adds
a comment "XXX Do we need a lock for RelationRelationId?" at the places we
may want to lock this object class. I did not think about it yet (but will do),
I only added this comment at multiple places.

I think that v10 is easier to follow (as compare to v9) as we can easily see for
which object class we'll put a lock on.

Thoughts?

[1]: 
https://www.postgresql.org/message-id/Zmv3TPfJAyQXhIdu%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From dc75e3255803617d21c55d77dc218904bd729d81 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v10] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 109 ++-
 src/backend/catalog/heap.c|   8 ++
 src/backend/catalog/index.c   |  16 +++
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  11 ++
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  27 +++-
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   5 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c |  33 +
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/alter.c  |   4 +
 src/backend/commands/amcmds.c |  

Re: Avoid orphaned objects dependencies, take 3

2024-06-14 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 02:27:45PM -0400, Robert Haas wrote:
> On Thu, Jun 13, 2024 at 12:52 PM Bertrand Drouvot
>  wrote:
> > > > table_open(childRelId, ...) would lock any "ALTER TABLE  
> > > > DROP CONSTRAINT"
> > > > already. Not sure I understand your concern here.
> > >
> > > I believe this is not true. This would take a lock on the table, not
> > > the constraint itself.
> >
> > I agree that it would not lock the constraint itself. What I meant to say 
> > is that
> > , nevertheless, the constraint can not be dropped. Indeed, the "ALTER TABLE"
> > necessary to drop the constraint (ALTER TABLE  DROP CONSTRAINT) 
> > would
> > be locked by the table_open(childRelId, ...).
> 
> Ah, right. So, I was assuming that, with either this version of your
> patch or the earlier version, we'd end up locking the constraint
> itself. Was I wrong about that?

The child contraint itself is not locked when going through
ConstraintSetParentConstraint().

While at it, let's look at a full example and focus on your concern.

Let's do that with this gdb file:

"
$ cat gdb.txt
b dependency.c:1542
command 1
printf "Will return for: classId %d and objectId %d\n", 
object->classId, object->objectId
c
end
b dependency.c:1547 if object->classId == 2606
command 2
printf "Will lock constraint: classId %d and objectId %d\n", 
object->classId, object->objectId
c
end
"

knowing that:

"
Line 1542 is the return here in depLockAndCheckObject() (your concern):

if (object->classId == RelationRelationId || object->classId == 
AuthMemRelationId)
return;

Line 1547 is the lock here in depLockAndCheckObject():

/* assume we should lock the whole object not a sub-object */
LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
"

So, with gdb attached to a session let's:

1. Create the parent relation

CREATE TABLE upsert_test (
a   INT,
b   TEXT
) PARTITION BY LIST (a);

gdb produces:

---
Will return for: classId 1259 and objectId 16384
Will return for: classId 1259 and objectId 16384
---

Oid 16384 is upsert_test, so I think the return (dependency.c:1542) is fine as
we are creating the object (it can't be dropped as not visible to anyone else).

2. Create another relation (will be the child)

CREATE TABLE upsert_test_2 (b TEXT, a int);

gdb produces:

---
Will return for: classId 1259 and objectId 16391
Will return for: classId 1259 and objectId 16394
Will return for: classId 1259 and objectId 16394
Will return for: classId 1259 and objectId 16391
---

Oid 16391 is upsert_test_2
Oid 16394 is pg_toast_16391

so I think the return (dependency.c:1542) is fine as we are creating those
objects (can't be dropped as not visible to anyone else).

3. Attach the partition

ALTER TABLE upsert_test ATTACH PARTITION upsert_test_2 FOR VALUES IN (2);

gdb produces:

---
Will return for: classId 1259 and objectId 16384
---

That's fine because we'd already had locked the relation 16384 through
AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid().

4. Add a constraint on the child relation

ALTER TABLE upsert_test_2 add constraint bdtc2 UNIQUE (a);

gdb produces:

---
Will return for: classId 1259 and objectId 16391
Will lock constraint: classId 2606 and objectId 16397
---

That's fine because we'd already had locked the relation 16391 through
AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid().

Oid 16397 is the constraint we're creating (bdtc2).

5. Add a constraint on the parent relation (this goes through
ConstraintSetParentConstraint())

ALTER TABLE upsert_test add constraint bdtc1 UNIQUE (a);

gdb produces:

---
Will return for: classId 1259 and objectId 16384
Will lock constraint: classId 2606 and objectId 16399
Will return for: classId 1259 and objectId 16398
Will return for: classId 1259 and objectId 16391
Will lock constraint: classId 2606 and objectId 16399
Will return for: classId 1259 and objectId 16391
---

Regarding "Will return for: classId 1259 and objectId 16384":
That's fine because we'd already had locked the relation 16384 through
AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid().

Regarding "Will lock constraint: classId 2606 and objectId 16399":
Oid 16399 is the constraint that we're creating.

Regarding "Will return for: classId 1259 and objectId 16398":
That's fine because Oid 16398 is an index that we're creating while creating
the constraint (so the index can't be dropped as not visible to anyone else).

Regarding "Will return for: classId 1259 and objectId 16391":
That's fine because 16384 we'd be already locked (as mentioned above). And
I think that's fine because trying to drop "upsert_test_2" (ak

Re: Avoid orphaned objects dependencies, take 3

2024-06-13 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote:
> On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot
>  wrote:
> > Do you still find the code hard to maintain with v9?
> 
> I don't think it substantially changes my concerns as compared with
> the earlier version.

Thanks for the feedback, I'll give it more thoughts.

> 
> > > but we're not similarly careful about other operations e.g.
> > > ConstraintSetParentConstraint is called by DefineIndex which calls
> > > table_open(childRelId, ...) first, but there's no logic in DefineIndex
> > > to lock the constraint.
> >
> > table_open(childRelId, ...) would lock any "ALTER TABLE  DROP 
> > CONSTRAINT"
> > already. Not sure I understand your concern here.
> 
> I believe this is not true. This would take a lock on the table, not
> the constraint itself.

I agree that it would not lock the constraint itself. What I meant to say is 
that
, nevertheless, the constraint can not be dropped. Indeed, the "ALTER TABLE"
necessary to drop the constraint (ALTER TABLE  DROP CONSTRAINT) 
would
be locked by the table_open(childRelId, ...).

That's why I don't understand your concern with this particular example. But
anyway, I'll double check your related concern:

+ if (object->classId == RelationRelationId || object->classId ==
AuthMemRelationId)
+ return;

in depLockAndCheckObject(). 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-13 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 08:26:23AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
> 
> > Thank you for the proposal and the patch. I understand the motivation
> > of this patch.
> 
> Thanks for looking at it!
> 
> > Beside the point Nathan mentioned, I'm slightly worried
> > that massive parallel messages could be sent to the leader process
> > when the cost_limit value is low.
> 
> I see, I can/will do some testing in this area and share the numbers.

Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.

== Context 

The testing has been done with this relation (large_tbl) and its indexes:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
 FROM pg_buffercache b JOIN pg_class c
 ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
  WHERE datname = current_database()))
 JOIN pg_namespace n ON n.oid = c.relnamespace
 GROUP BY n.nspname, c.relname
 ORDER BY 3 DESC
 LIMIT 22;

 nspname |  relname   | buffers
-++-
 public  | large_tbl  |  80
 public  | large_tbl_filler13 |  125000
 public  | large_tbl_filler6  |  125000
 public  | large_tbl_filler5  |  125000
 public  | large_tbl_filler3  |  125000
 public  | large_tbl_filler15 |  125000
 public  | large_tbl_filler4  |  125000
 public  | large_tbl_filler20 |  125000
 public  | large_tbl_filler18 |  125000
 public  | large_tbl_filler14 |  125000
 public  | large_tbl_filler8  |  125000
 public  | large_tbl_filler11 |  125000
 public  | large_tbl_filler19 |  125000
 public  | large_tbl_filler7  |  125000
 public  | large_tbl_filler1  |  125000
 public  | large_tbl_filler12 |  125000
 public  | large_tbl_filler9  |  125000
 public  | large_tbl_filler17 |  125000
 public  | large_tbl_filler16 |  125000
 public  | large_tbl_filler10 |  125000
 public  | large_tbl_filler2  |  125000
 public  | large_tbl_pkey |5486
(22 rows)

All of them completly fit in memory (to avoid I/O read latency during the 
vacuum).

The config, outside of default is:

max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10

My system is not overloaded, has enough resources to run this test and only this
test is running.

== Results 

== With v2 (attached) applied on master

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)

The duration is splitted that way:

Vacuum phase: cumulative time (cumulative time delayed)
===
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |   wait_event   | count
+++---
 452996 | 453225 | VacuumDelay|   366
 452996 | 453223 | VacuumDelay|   363
 452996 | 453226 | VacuumDelay|   362
 452996 | 453224 | VacuumDelay|   361
 452996 | 453222 | VacuumDelay|   359
 452996 | 453221 | VacuumDelay|   359
| 452996 | VacuumDelay|   331
| 452996 | CPU|30
 452996 | 453224 | WALWriteLock   |23
 452996 | 453222 | WALWriteLock   |20
 452996 | 453226 | WALWriteLock   |20
 452996 | 453221 | WALWriteLock   |19
| 452996 | WalSync|18
 452996 | 453225 | WALWriteLock   |18
 452996 | 453223 | WALWriteLock   |16
| 452996 | WALWriteLock   |15
 452996 | 453221 | CPU|14
 452996 | 453222 | CPU|14
 452996 | 453223 | CPU|12
 452996 | 453224 | CPU|10
 452996 | 453226 | CPU|10
 452996 | 453225 | CPU| 8
 452996 | 453223 | WalSync| 4
 452996 | 453221 | WalSync| 2
 452996 | 453226 | WalWrite   | 2
 452996 | 453221 | WalWrite   | 1
| 452996 | ParallelFinish | 1
 452996 | 453224 | WalSync| 1
 452996 | 453225 | WalSync| 1
 452996 | 453222 | WalWrite   | 1
 452996 | 453225 | WalWrite   | 1
 452996 | 453222 | WalSync| 1
 452996 | 4

Re: relfilenode statistics

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 03:35:23PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot 
>  wrote in 
> > 
> > My idea was to move all that is in pg_statio_all_tables to relfilenode stats
> > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) 
> > ensure
> > the user can still retrieve the stats from pg_statio_all_tables in such a 
> > way
> > that it survives a rewrite, 3) provide dedicated APIs to retrieve
> > relfilenode stats but only for the current relfilenode, 4) document this
> > behavior. This is what the POC patch is doing for heap_blks_written (would
> > need to do the same for heap_blks_read and friends) except for the 
> > documentation
> > part. What do you think?
> 
> In my opinion,

Thanks for looking at it!

> it is certainly strange that bufmgr is aware of
> relation kinds, but introducing relfilenode stats to avoid this skew
> doesn't seem to be the best way, as it invites inconclusive arguments
> like the one raised above. The fact that we transfer counters from old
> relfilenodes to new ones indicates that we are not really interested
> in counts by relfilenode. If that's the case, wouldn't it be simpler
> to call pgstat_count_relation_buffer_read() from bufmgr.c and then
> branch according to relkind within that function? If you're concerned
> about the additional branch, some ingenuity may be needed.

That may be doable for "read" activities but what about write activities?
Do you mean not relying on relfilenode stats for reads but relying on 
relfilenode
stats for writes?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
>  wrote:
> > As we can see the actual wait time is 30ms less than the intended wait time 
> > with
> > this simple test. So I still think we should go with 1) actual wait time 
> > and 2)
> > report the number of waits (as mentioned in [1]). Does that make sense to 
> > you?
> 
> I like the idea of reporting the actual wait time better,

+1

> provided
> that we verify that doing so isn't too expensive. I think it probably
> isn't, because in a long-running VACUUM there is likely to be disk
> I/O, so the CPU overhead of a few extra gettimeofday() calls should be
> fairly low by comparison.

Agree.

> I wonder if there's a noticeable hit when
> everything is in-memory. I guess probably not, because with any sort
> of normal configuration, we shouldn't be delaying after every block we
> process, so the cost of those gettimeofday() calls should still be
> getting spread across quite a bit of real work.

I did some testing, with:

shared_buffers = 12GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 0
max_parallel_workers = 0

added to a default config file.

A table and all its indexes were fully in memory, the numbers are:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
 FROM pg_buffercache b JOIN pg_class c
 ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
  WHERE datname = current_database()))
 JOIN pg_namespace n ON n.oid = c.relnamespace
 GROUP BY n.nspname, c.relname
 ORDER BY 3 DESC
 LIMIT 11;

 nspname |  relname  | buffers
-+---+-
 public  | large_tbl |  80
 public  | large_tbl_pkey|5486
 public  | large_tbl_filler7 |1859
 public  | large_tbl_filler4 |1859
 public  | large_tbl_filler1 |1859
 public  | large_tbl_filler6 |1859
 public  | large_tbl_filler3 |1859
 public  | large_tbl_filler2 |1859
 public  | large_tbl_filler5 |1859
 public  | large_tbl_filler8 |1859
 public  | large_tbl_version |1576
(11 rows)


The observed timings when vacuuming this table are:

On master:

vacuum phase: cumulative duration
-

scanning heap: 00:00:37.808184
vacuuming indexes: 00:00:41.808176
vacuuming heap: 00:00:54.808156

On master patched with actual time delayed:

vacuum phase: cumulative duration
-

scanning heap: 00:00:36.502104 (time_delayed: 22202)
vacuuming indexes: 00:00:41.002103 (time_delayed: 23769)
vacuuming heap: 00:00:54.302096 (time_delayed: 34886)

As we can see there is no noticeable degradation while the vacuum entered about
34886 times in this instrumentation code path (cost_delay was set to 1).

> That said, I'm not sure this experiment shows a real problem with the
> idea of showing intended wait time. It does establish the concept that
> repeated signals can throw our numbers off, but 30ms isn't much of a
> discrepancy.

Yeah, the idea was just to show how easy it is to create a 30ms discrepancy.

> I'm worried about being off by a factor of two, or an
> order of magnitude. I think we still don't know if that can happen,
> but if we're going to show actual wait time anyway, then we don't need
> to explore the problems with other hypothetical systems too much.

Agree.

> I'm not convinced that reporting the number of waits is useful. If we
> were going to report a possibly-inaccurate amount of actual waiting,
> then also reporting the number of waits might make it easier to figure
> out when the possibly-inaccurate number was in fact inaccurate. But I
> think it's way better to report an accurate amount of actual waiting,
> and then I'm not sure what we gain by also reporting the number of
> waits.

Sami shared his thoughts in [1] and [2] and so did I in [3]. If some of us still
don't think that reporting the number of waits is useful then we can probably
start without it.

[1]: 
https://www.postgresql.org/message-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727%40amazon.com
[2]: 
https://www.postgresql.org/message-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D%40amazon.com
[3]: 
https://www.postgresql.org/message-id/ZmmGG4e%2BqTBD2kfn%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart  
> wrote:
> > I'm struggling to think of a scenario in which the number of waits would be
> > useful, assuming you already know the amount of time spent waiting.

If we provide the actual time spent waiting, providing the number of waits would
allow to see if there is a diff between the actual time and the intended time
(i.e: number of waits * cost_delay, should the cost_delay be the same during
the vacuum duration). That should trigger some thoughts if the diff is large
enough.

I think that what we are doing here is to somehow add instrumentation around the
"WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait
events (generaly speaking) we'd probably also expose the number of waits per
wait event (in addition to the time waited).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote:
> On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
> > So I think that in v2 we could: 1) measure the actual wait time instead, 2)
> > count the number of times the vacuum slept. We could also 3) reports the
> > effective cost limit (as proposed by Nathan up-thread) (I think that 3) 
> > could
> > be misleading but I'll yield to majority opinion if people think it's not).
> 
> I still think the effective cost limit would be useful, if for no other
> reason than to help reinforce that it is distributed among the autovacuum
> workers.

I also think it can be useful, my concern is more to put this information in
pg_stat_progress_vacuum. What about Sawada-san proposal in [1]? (we could
create a new view that would contain those data: per-worker dobalance, 
cost_lmit,
cost_delay, active, and failsafe). 

[1]: 
https://www.postgresql.org/message-id/CAD21AoDOu%3DDZcC%2BPemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow logical failover slots to wait on synchronous replication

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote:
> > The existing 'standby_slot_names' isn't great for users who are running
> > clusters with quorum-based synchronous replicas. For instance, if
> > the user has  synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
> > bit tedious to have to reconfigure the standby_slot_names to set it to
> > the most updated 3 sync replicas whenever different sync replicas start
> > lagging. In the event that both GUCs are set, 'standby_slot_names' takes
> > precedence.
> 
> Hm.  IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E"
> to get the desired behavior today.  That might ordinarily be okay, but it
> could cause logical replication to be held back unnecessarily if one of the
> replicas falls behind for whatever reason.  A way to tie standby_slot_names
> to synchronous replication instead does seem like it would be useful in
> this case.

FWIW, I have the same understanding and also think your proposal would be
useful in this case.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

Though I (now) think that it would make sense to record the actual delay time 
instead (see [1]), I think it's interesting to do some testing as you suggested.

With record_actual_time.txt (attached) applied on top of v1, we can see the
intended and actual wait time.

On my system, "no load at all" except the vacuum running, I see no diff:

 Tue Jun 11 09:22:06 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |41107 |   41107 | 
00:00:42.301851
(1 row)

 Tue Jun 11 09:22:07 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |42076 |   42076 | 
00:00:43.301848
(1 row)

 Tue Jun 11 09:22:08 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |43045 |   43045 | 
00:00:44.301854
(1 row)

But if I launch pg_reload_conf() 10 times in a row, I can see:

 Tue Jun 11 09:22:09 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |44064 |   44034 | 
00:00:45.302965
(1 row)

 Tue Jun 11 09:22:10 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |45033 |   45003 | 
00:00:46.301858


As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?


[1]: 
https://www.postgresql.org/message-id/Zmf712A5xcOM9Hlg%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 1345e99dcb..e4ba8de00a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
 S.param8 AS indexes_total, S.param9 AS indexes_processed,
-S.param10 AS time_delayed
+S.param10 AS time_delayed, S.param11 AS actual_time_delayed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2551408a86..bbb5002efe 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2381,18 +2381,29 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+   instr_time  delay_start;
+   instr_time  delay_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
 
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+   INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+   INSTR_TIME_SET_CURRENT(delay_time);
pgstat_report_wait_end();
/* Report the amount of time we slept */
+   INSTR_TIME_SUBTRACT(delay_time, delay_start);
if (VacuumSharedCostBalance != NULL)
+   {

pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
+   
pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_ACTUAL_TIME_DELA

Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:

> Thank you for the proposal and the patch. I understand the motivation
> of this patch.

Thanks for looking at it!

> Beside the point Nathan mentioned, I'm slightly worried
> that massive parallel messages could be sent to the leader process
> when the cost_limit value is low.

I see, I can/will do some testing in this area and share the numbers.

> 
> FWIW when I want to confirm the vacuum delay effect, I often use the
> information from the DEBUG2 log message in VacuumUpdateCosts()
> function. Exposing these data (per-worker dobalance, cost_lmit,
> cost_delay, active, and failsafe) somewhere in a view might also be
> helpful for users for checking vacuum delay effects.

Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the
other data you mentioned above in another dedicated view?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
>  wrote:
> > Hm.  Should we measure the actual time spent sleeping, or is a rough
> > estimate good enough?  I believe pg_usleep() might return early (e.g., if
> > the process is signaled) or late, so this field could end up being
> > inaccurate, although probably not by much.  If we're okay with millisecond
> > granularity, my first instinct is that what you've proposed is fine, but I
> > figured I'd bring it up anyway.
> 
> I bet you could also sleep for longer than planned, throwing the
> numbers off in the other direction.

Thanks for looking at it! Agree, that's how I read "or late" from Nathan's
comment above.

> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

OTOH Sami proposed in [1] to count the number of times the vacuum went into
delay. I think that's a good idea. His idea makes me think that (in addition to
the number of wait times) it would make sense to measure the "actual" sleep time
(and not the intended one) then (so that one could measure the difference 
between
the intended wait time (number of wait times * cost delay (if it does not change
during the vacuum duration)) and the actual measured wait time).

So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).

Thoughts?


[1]: 
https://www.postgresql.org/message-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 08:12:46PM +, Imseih (AWS), Sami wrote:
> >> This sounds like useful information to me.
> 
> > Thanks for looking at it!
> 
> The  VacuumDelay is the only visibility available to
> gauge the cost_delay. Having this information
> advertised by pg_stat_progress_vacuum as is being proposed
> is much better.

Thanks for looking at it!

> However, I also think that the
> "number of times"  the vacuum went into delay will be needed
> as well. Both values will be useful to tune cost_delay and cost_limit. 

Yeah, I think that's a good idea. With v1 one could figure out how many times
the delay has been triggered but that does not work anymore if: 1) cost_delay
changed during the vacuum duration or 2) the patch changes the way time_delayed
is measured/reported (means get the actual wait time and not the theoritical
time as v1 does). 

> 
> It may also make sense to accumulate the total_time in delay
> and the number of times delayed in a cumulative statistics [0]
> view to allow a user to trend this information overtime.
> I don't think this info fits in any of the existing views, i.e.
> pg_stat_database, so maybe a new view for cumulative
> vacuum stats may be needed. This is likely a separate
> discussion, but calling it out here.

+1

> >> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> >> pg_stat_activity in order to know what percentage of time was being spent
> >> in cost delay.  Is that how you'd expect for this to be used in practice?
> 
> > Yeah, one could use a query such as:
> 
> > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p 
> > JOIN pg_stat_activity a using (pid)
> 
> Maybe all  progress views should just expose the 
> "beentry->st_activity_start_timestamp " 
> to let the user know when the current operation began.

Yeah maybe, I think this is likely a separate discussion too, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
> > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> >> I wonder if we should also
> >> surface the effective cost limit for each autovacuum worker.
> > 
> > I'm not sure about it as I think that it could be misleading: one could 
> > query
> > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> > due to _this_ cost_limit. But that's not necessary true as the cost_limit 
> > could
> > have changed multiple times since the vacuum started. So, unless there is
> > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed 
> > and
> > the cost_limit could be misleadind IMHO.
> 
> Well, that's true for the delay, too, right (at least as of commit
> 7d71d3d)?

Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-10 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
> > During the last pgconf.dev I attended Robert´s presentation about 
> > autovacuum and
> > it made me remember of an idea I had some time ago: $SUBJECT
> 
> This sounds like useful information to me.

Thanks for looking at it!

> I wonder if we should also
> surface the effective cost limit for each autovacuum worker.

I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.

> > Currently one can change [autovacuum_]vacuum_cost_delay and
> > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the 
> > impact of
> > the changes on the vacuum duration: one could observe the vacuum duration
> > variation but the correlation to the changes is not accurate (as many others
> > factors could impact the vacuum duration (load on the system, i/o 
> > latency,...)).
> 
> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> pg_stat_activity in order to know what percentage of time was being spent
> in cost delay.  Is that how you'd expect for this to be used in practice?

Yeah, one could use a query such as:

select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p 
JOIN pg_stat_activity a using (pid)

for example. Worth to provide an example somewhere in the doc?

> > pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
> > pg_usleep(msec * 1000);
> > pgstat_report_wait_end();
> > +   /* Report the amount of time we slept */
> > +   if (VacuumSharedCostBalance != NULL)
> > +   
> > pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> > +   else
> > +   
> > pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> 
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and 
specially
during a long running vacuum (which is probably the case where users care the 
most).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-10 Thread Bertrand Drouvot
Hi,

On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 11:17 PM Andres Freund  wrote:
> > If we just want to keep prior stats upon arelation rewrite, we can just copy
> > the stats from the old relfilenode.  Or we can decide that those stats don't
> > really make sense anymore, and start from scratch.
> 
> I think we need to think carefully about what we want the user
> experience to be here. "Per-relfilenode stats" could mean "sometimes I
> don't know the relation OID so I want to use the relfilenumber
> instead, without changing the user experience" or it could mean "some
> of these stats actually properly pertain to the relfilenode rather
> than the relation so I want to associate them with the right object
> and that will affect how the user sees things." We need to decide
> which it is. If it's the former, then we need to examine whether the
> goal of hiding the distinction between relfilenode stats and relation
> stats from the user is in fact feasible. If it's the latter, then we
> need to make sure the whole patch reflects that design, which would
> include e.g. NOT copying stats from the old to the new relfilenode,
> and which would also include documenting the behavior in a way that
> will be understandable to users.

Thanks for sharing your thoughts!

Let's take the current heap_blks_read as an example: it currently survives
a relation rewrite and I guess we don't want to change the existing user
experience for it.

Now say we want to add "heap_blks_written" (like in this POC patch) then I think
that it makes sense for the user to 1) query this new stat from the same place
as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the 
same
experience as far the relation rewrite is concerned (keep the previous stats).

To achieve the rewrite behavior we could:

1) copy the stats from the OLD relfilenode to the relation (like in the POC 
patch)
2) copy the stats from the OLD relfilenode to the NEW one (could be in a 
dedicated
field)

The PROS of 1) is that the behavior is consistent with the current 
heap_blks_read
and that the user could still see the current relfilenode stats (through a new 
API)
if he wants to.

> In my experience, the worst thing you can do in cases like this is be
> somewhere in the middle. Then you tend to end up with stuff like: the
> difference isn't supposed to be something that the user knows or cares
> about, except that they do have to know and care because you haven't
> thoroughly covered up the deception, and often they have to reverse
> engineer the behavior because you didn't document what was really
> happening because you imagined that they wouldn't notice.

My idea was to move all that is in pg_statio_all_tables to relfilenode stats
and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure
the user can still retrieve the stats from pg_statio_all_tables in such a way
that it survives a rewrite, 3) provide dedicated APIs to retrieve
relfilenode stats but only for the current relfilenode, 4) document this
behavior. This is what the POC patch is doing for heap_blks_written (would
need to do the same for heap_blks_read and friends) except for the documentation
part. What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Proposal to add page headers to SLRU pages

2024-06-10 Thread Bertrand Drouvot
Hi,

On Tue, Mar 19, 2024 at 06:48:33AM +, Li, Yong wrote:
> 
> Unfortunately, the test requires a setup of two different versions of PG. I 
> am not
> aware of an existing test infrastructure which can run automated tests using 
> two
> PGs. I did manually verify the output of pg_upgrade. 

I think there is something in t/002_pg_upgrade.pl (see 
src/bin/pg_upgrade/TESTING),
that could be used to run automated tests using an old and a current version.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 06:29:17AM +0000, Bertrand Drouvot wrote:
> > Thanks for the report! I think it makes sense to add it to the list of known
> > failures.
> > 
> > One way to deal with those corner cases could be to make use of injection 
> > points
> > around places where RUNNING_XACTS is emitted, thoughts?
> 
> Ah.  You mean to force a wait in the code path generating the standby
> snapshots for the sake of this test?

Yeah.

>  That's interesting, I like it.

Great, will look at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Bertrand Drouvot
Hi Alexander,

On Sat, Jun 08, 2024 at 07:00:00AM +0300, Alexander Lakhin wrote:
> Hello Bertrand and Michael,
> 
> 23.01.2024 11:07, Bertrand Drouvot wrote:
> > On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
> > 
> > > Anyway, that's not the end of it.  What should we do for snapshot
> > > snapshot records coming from the bgwriter?
> > What about?
> > 
> > 3) depending on how stabilized this test (and others that suffer from 
> > "random"
> > xl_running_xacts) is, then think about the bgwriter.
> 
> A recent buildfarm failure [1] reminds me of that remaining question.
> 
> It's hard to determine from this info, why row_removal_activeslot was not
> invalidated, but running this test on a slowed down Windows VM, I (still)
> get the same looking failures caused by RUNNING_XACTS appeared just before
> `invalidating obsolete replication slot "row_removal_inactiveslot"`.
> So I would consider this failure as yet another result of bgwriter activity
> and add it to the list of known failures as such...

Thanks for the report! I think it makes sense to add it to the list of known
failures.

One way to deal with those corner cases could be to make use of injection points
around places where RUNNING_XACTS is emitted, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Track the amount of time waiting due to cost_delay

2024-06-10 Thread Bertrand Drouvot
Hi hackers,

During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT

Please find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view. 

Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).

This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).

The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).

[1]: 
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/29-how-autovacuum-goes-wrong-and-can-we-please-make-it-stop-doing-that/

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 750dfc26cd6fcf5a5618c3fe35fc42d5b5c66f00 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 6 Jun 2024 12:35:57 +
Subject: [PATCH v1] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml | 11 +++
 src/backend/catalog/system_views.sql |  3 ++-
 src/backend/commands/vacuum.c|  6 ++
 src/include/catalog/catversion.h |  2 +-
 src/include/commands/progress.h  |  1 +
 src/test/regress/expected/rules.out  |  3 ++-
 6 files changed, 23 insertions(+), 3 deletions(-)
  47.6% doc/src/sgml/
   3.7% src/backend/catalog/
  26.8% src/backend/commands/
   7.5% src/include/catalog/
   4.1% src/include/commands/
  10.0% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..cdd0f0e533 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid;
cleaning up indexes.
   
  
+
+ 
+  
+   time_delayed bigint
+  
+  
+   Total amount of time spent in milliseconds waiting due to vacuum_cost_delay
+   or autovacuum_vacuum_cost_delay. In case of parallel
+   vacuum the reported time is across all the workers and the leader.
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..1345e99dcb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
-S.param8 AS indexes_total, S.param9 AS indexes_processed
+S.param8 AS indexes_total, S.param9 AS indexes_processed,
+S.param10 AS time_delayed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..2551408a86 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -2386,6 +2387,11 @@ vacuum_delay_point(void)
 		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
 		pg_usleep(msec * 1000);
 		pgstat_report_wait_end();
+		/* Report the amount of time we slept */
+		if (VacuumSharedCostBalance != NULL)
+			pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
+		else
+			pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
 
 		/*
 		 * We don't want to ignore postmaster death during very long vacuums
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f0809c0e58..40b4f1d1e4 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +5

Re: relfilenode statistics

2024-06-07 Thread Bertrand Drouvot
Hi,

On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
> >  wrote:
> > > I think we should keep the stats in the relation during relfilenode 
> > > changes.
> > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes 
> > > in
> > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you 
> > > can
> > > see in the example provided up-thread the new heap_blks_written statistic 
> > > has
> > > been preserved during the TRUNCATE.
> >
> > Yeah, I think there's something weird about this design. Somehow we're
> > ending up with both per-relation and per-relfilenode counters:
> >
> > +   pg_stat_get_blocks_written(C.oid) +
> > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> > C.relfilenode) AS heap_blks_written,
> >
> > I'll defer to Andres if he thinks that's awesome, but to me it does
> > not seem right to track some blocks written in a per-relation counter
> > and others in a per-relfilenode counter.
> 
> It doesn't immediately sound awesome. Nor really necessary?
> 
> If we just want to keep prior stats upon arelation rewrite, we can just copy
> the stats from the old relfilenode.

Agree, that's another option. But I think that would be in another field like
"cumulative_XXX" to ensure one could still retrieve stats that are "dedicated"
to this particular "new" relfilenode. Thoughts?

> Or we can decide that those stats don't
> really make sense anymore, and start from scratch.
> 
> 
> I *guess* I could see an occasional benefit in having both counter for "prior
> relfilenodes" and "current relfilenode" - except that stats get reset manually
> and upon crash anyway, making this less useful than if it were really
> "lifetime" stats.

Right but currently they are not lost during a relation rewrite. If we decide to
not keep the relfilenode stats during a rewrite then things like heap_blks_read
would stop surviving a rewrite (if we move it to relfilenode stats) while it
currently does. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-07 Thread Bertrand Drouvot
Hi,

On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote:
> > The main argument is that we currently don’t have writes counters for 
> > relations.
> > The reason is that we don’t have the relation OID when writing buffers out.
> > Tracking writes per relfilenode would allow us to track/consolidate writes 
> > per
> > relation (example in the v1 patch and in the message up-thread).
> > 
> > I think that adding instrumentation in this area (writes counters) could be
> > beneficial (like it is for the ones we currently have for reads).
> > 
> > Second argument is that this is also beneficial for the "Split index and
> > table statistics into different types of stats" thread (mentioned in the 
> > previous
> > message). It would allow us to avoid additional branches in some situations 
> > (like
> > the one mentioned by Andres in the link I provided up-thread).
> 
> I think there's another *very* significant benefit:
> 
> Right now physical replication doesn't populate statistics fields like
> n_dead_tup, which can be a huge issue after failovers, because there's little
> information about what autovacuum needs to do.
> 
> Auto-analyze *partially* can fix it at times, if it's lucky enough to see
> enough dead tuples - but that's not a given and even if it works, is often
> wildly inaccurate.
> 
> 
> Once we put things like n_dead_tup into per-relfilenode stats,

Hm - I had in mind to populate relfilenode stats only with stats that are
somehow related to I/O activities. Which ones do you have in mind to put in 
relfilenode stats?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-07 Thread Bertrand Drouvot
Hi,

On Thu, Jun 06, 2024 at 04:00:23PM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot
>  wrote:
> > v9 is more invasive (as it changes code in much more places) than v8 but it 
> > is
> > easier to follow (as it is now clear where the new lock is acquired).
> 
> Hmm, this definitely isn't what I had in mind. Possibly that's a sign
> that what I had in mind was dumb, but for sure it's not what I
> imagined. What I thought you were going to do was add calls like
> LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock)
> in various places, or perhaps LockRelationOid(reloid,
> AccessShareLock), or whatever the case may be.

I see what you’re saying, doing things like:

LockDatabaseObject(TypeRelationId, returnType, 0, AccessShareLock);
in ProcedureCreate() for example.

> Here you've got stuff
> like this:
> 
> - record_object_address_dependencies(, addrs_auto,
> -DEPENDENCY_AUTO);
> + lock_record_object_address_dependencies(, addrs_auto,
> + DEPENDENCY_AUTO);
> 
> ...which to me looks like the locking is still pushed down inside the
> dependency code.

Yes but it’s now located in places where, I think, it’s easier to understand
what’s going on (as compare to v8), except maybe for:

recordDependencyOnExpr()
makeOperatorDependencies()
GenerateTypeDependencies()
makeParserDependencies()
makeDictionaryDependencies()
makeTSTemplateDependencies()
makeConfigurationDependencies()

but probably for:

heap_create_with_catalog()
StorePartitionKey()
index_create()
AggregateCreate()
CastCreate()
CreateConstraintEntry()
ProcedureCreate()
RangeCreate()
InsertExtensionTuple()
CreateTransform()
CreateProceduralLanguage()

The reasons I keep it linked to the dependency code are:

- To ensure we don’t miss anything (well, with the new Assert in place that’s
probably a tangential argument)

- It’s not only about locking the object: it’s also about 1) verifying the 
object
is pinned, 2) checking it still exists and 3) provide a description in the error
message if we can (in case the object does not exist anymore). Relying on an
already build object (in the dependency code) avoid to 1) define the object(s)
one more time or 2) create new functions that would do the same as 
isObjectPinned()
and getObjectDescription() with a different set of arguments.

That may sounds like weak arguments but it has been my reasoning.

Do you still find the code hard to maintain with v9?

> 
> And you also have stuff like this:
> 
>   ObjectAddressSet(referenced, RelationRelationId, childTableId);
> + depLockAndCheckObject();
>   recordDependencyOn(, , DEPENDENCY_PARTITION_SEC);
> 
> But in depLockAndCheckObject you have:
> 
> + if (object->classId == RelationRelationId || object->classId ==
> AuthMemRelationId)
> + return;
> 
> That doesn't seem right, because then it seems like the call isn't
> doing anything, but there isn't really any reason for it to not be
> doing anything. If we're dropping a dependency on a table, then it
> seems like we need to have a lock on that table. Presumably the reason
> why we don't end up with dangling dependencies in such cases now is
> because we're careful about doing LockRelation() in the right places,

Yeah, that's what I think: we're already careful when we deal with relations.

> but we're not similarly careful about other operations e.g.
> ConstraintSetParentConstraint is called by DefineIndex which calls
> table_open(childRelId, ...) first, but there's no logic in DefineIndex
> to lock the constraint.

table_open(childRelId, ...) would lock any "ALTER TABLE  DROP 
CONSTRAINT"
already. Not sure I understand your concern here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Bertrand Drouvot
On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote:
> Hello everyone,
> 
> At present, we use MVCC snapshots to identify dependent objects. This
> implies that if a new dependent object is inserted within a transaction
> that is still ongoing, our search for dependent objects won't include this
> recently added one. Consequently, if someone attempts to drop the
> referenced object, it will be dropped, and when the ongoing transaction
> completes, we will end up having an entry for a referenced object that has
> already been dropped. This situation can lead to an inconsistent state.
> Below is an example illustrating this scenario:
> 
> Session 1:
> - create table t1(a int);
> - insert into t1 select i from generate_series(1, 1000) i;
> - create extension btree_gist;
> - create index i1 on t1 using gist( a );
> 
> Session 2: (While the index creation in session 1 is in progress, drop the
> btree_gist extension)
> - drop extension btree_gist;
> 
> Above command succeeds and so does the create index command running in
> session 1, post this, if we try running anything on table t1, i1, it fails
> with an error: "cache lookup failed for opclass ..."
> 
> Attached is the patch that I have tried, which seems to be working for me.
> It's not polished and thoroughly tested, but just sharing here to clarify
> what I am trying to suggest. Please have a look and let me know your
> thoughts.

Thanks for the patch proposal!

The patch does not fix the other way around:

- session 1: BEGIN; DROP extension btree_gist;
- session 2: create index i1 on t1 using gist( a );
- session 1: commits while session 2 is creating the index

and does not address all the possible orphaned dependencies cases.

There is an ongoing thread (see [1]) to fix the orphaned dependencies issue.

v9 attached in [1] fixes the case you describe here.

[1]: 
https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-06 Thread Bertrand Drouvot
Hi,

On Mon, Jun 03, 2024 at 11:11:46AM +, Bertrand Drouvot wrote:
> Yeah, I’ll update the commit message in V2 with better explanations once I get
> feedback on V1 (should we decide to move on with the relfilenode stats idea).
> 

Please find attached v2, mandatory rebase due to cd312adc56. In passing it
provides a more detailed commit message (also making clear that the goal of this
patch is to start the discussion and agree on the design before moving forward.)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 81d25e077c9f4eafa5304c257d1b39ee8a811628 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Nov 2023 02:30:01 +
Subject: [PATCH v2] Provide relfilenode statistics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation.

relfilenode stats is also beneficial for the "Split index and table statistics
into different types of stats" work in progress: it would allow us to avoid
additional branches in some situations.

=== Remarks ===

This is a POC patch. There is still work to do: there is more places we should
add relfilenode counters, create more APIS to retrieve the relfilenode stats,
the patch takes care of rewrite generated by TRUNCATE but there is more to
care about like CLUSTER,VACUUM FULL.

The new logic to retrieve stats in pg_statio_all_tables has been implemented
only for the new blocks_written stat (we'd need to do the same for the existing
buffer read / buffer hit if we agree on the approach implemented here).

The goal of this patch is to start the discussion and agree on the design before
moving forward.
---
 src/backend/access/rmgrdesc/xactdesc.c|   5 +-
 src/backend/catalog/storage.c |   8 ++
 src/backend/catalog/system_functions.sql  |   2 +-
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/postmaster/checkpointer.c |   5 +
 src/backend/storage/buffer/bufmgr.c   |   6 +-
 src/backend/storage/smgr/md.c |   7 ++
 src/backend/utils/activity/pgstat.c   |  39 --
 src/backend/utils/activity/pgstat_database.c  |  12 +-
 src/backend/utils/activity/pgstat_function.c  |  13 +-
 src/backend/utils/activity/pgstat_relation.c  | 112 --
 src/backend/utils/activity/pgstat_replslot.c  |  13 +-
 src/backend/utils/activity/pgstat_shmem.c |  19 ++-
 .../utils/activity/pgstat_subscription.c  |  12 +-
 src/backend/utils/activity/pgstat_xact.c  |  60 +++---
 src/backend/utils/adt/pgstatfuncs.c   |  34 +-
 src/include/access/tableam.h  |  19 +++
 src/include/access/xact.h |   1 +
 src/include/catalog/pg_proc.dat   |  14 ++-
 src/include/pgstat.h  |  19 ++-
 src/include/utils/pgstat_internal.h   |  34 --
 src/test/recovery/t/029_stats_restart.pl  |  40 +++
 .../recovery/t/030_stats_cleanup_replica.pl   |   6 +-
 src/test/regress/expected/rules.out   |  12 +-
 src/test/regress/expected/stats.out   |  30 ++---
 src/test/regress/sql/stats.sql|  30 ++---
 src/test/subscription/t/026_stats.pl  |   4 +-
 src/tools/pgindent/typedefs.list  |   1 +
 28 files changed, 415 insertions(+), 147 deletions(-)
   4.6% src/backend/catalog/
  47.8% src/backend/utils/activity/
   6.5% src/backend/utils/adt/
   3.7% src/backend/
   3.3% src/include/access/
   3.3% src/include/catalog/
   6.2% src/include/utils/
   3.3% src/include/
  12.1% src/test/recovery/t/
   5.5% src/test/regress/expected/
   3.0% src/test/

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index dccca201e0..c02b079645 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -319,10 +319,11 @@ xact_desc_stats(StringInfo buf, const char *label,
 		appendStringInfo(buf, "; %sdropped stats:", label);
 		for (i = 0; i < ndropped; i++)
 		{
-			appendStringInfo(buf, " %d/%u/%u",
+			appendStringInfo(buf, " %d/%u/%u/%u",
 			 dropped_stats[i].kind,
 			 dropped_stats[i].dboid,
-			 dropped_stats[i].objoid);
+			 dropped_stats[i].objoid,
+			 dropped_stats[i].relfile);
 		}
 	}
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f56b3cc0f2..db6107cd90 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -33,6 +33,7 @@
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
+#include "utils/pgstat_internal.h"
 #include "utils/rel.h"
 
 /

Re: Avoid orphaned objects dependencies, take 3

2024-06-05 Thread Bertrand Drouvot
Hi,

On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote:
> On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot
>  wrote:
> > The reason why we are using a dirty snapshot here is for the cases where we 
> > are
> > recording a dependency on a referenced object that we are creating at the 
> > same
> > time behind the scene (for example, creating a composite type while creating
> > a relation). Without the dirty snapshot, then the object we are creating 
> > behind
> > the scene (the composite type) would not be visible and we would wrongly 
> > assume
> > that it has been dropped.
> 
> The usual reason for using a dirty snapshot is that you want to see
> uncommitted work by other transactions. It sounds like you're saying
> you just need to see uncommitted work by the same transaction. If
> that's true, I think using HeapTupleSatisfiesSelf would be clearer. Or
> maybe we just need to put CommandCounterIncrement() calls in the right
> places to avoid having the problem in the first place. Or maybe this
> is another sign that we're doing the work at the wrong level.

Thanks for having discussed your concern with Tom last week during pgconf.dev
and shared the outcome to me. I understand your concern regarding code
maintainability with the current approach.

Please find attached v9 that:

- Acquire the lock and check for object existence at an upper level, means 
before
calling recordDependencyOn() and recordMultipleDependencies().

- Get rid of the SNAPSHOT_SELF snapshot usage and relies on
CommandCounterIncrement() instead to ensure new entries are visible when
we check for object existence (for the cases where we create additional object
behind the scene: like composite type while creating a relation).

- Add an assertion in recordMultipleDependencies() to ensure that we locked the
object before recording the dependency (to ensure we don't miss any cases now 
that
the lock is acquired at an upper level).

A few remarks:

My first attempt has been to move eliminate_duplicate_dependencies() out of
record_object_address_dependencies() so that we get the calls in this order:

eliminate_duplicate_dependencies()
depLockAndCheckObjects()
record_object_address_dependencies()

What I'm doing instead in v9 is to rename record_object_address_dependencies()
to lock_record_object_address_dependencies() and add depLockAndCheckObjects()
in it at the right place. That way the caller of
[lock_]record_object_address_dependencies() is not responsible of calling
eliminate_duplicate_dependencies() (which would have been the case with my first
attempt).

We need to setup the LOCKTAG before calling the new Assert in
recordMultipleDependencies(). So, using "#ifdef USE_ASSERT_CHECKING" here to
not setup the LOCKTAG on a non Assert enabled build.

v9 is more invasive (as it changes code in much more places) than v8 but it is
easier to follow (as it is now clear where the new lock is acquired).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 6b6ee40fab0b011e9858a0a25624935c59732bfd Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v9] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  |  83 +--
 src/backend/catalog/heap.c|   7 +-
 src/b

Re: relfilenode statistics

2024-06-04 Thread Bertrand Drouvot
On Tue, Jun 04, 2024 at 09:26:27AM -0400, Robert Haas wrote:
> On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot
>  wrote:
> > We’d move the current buffer read and buffer hit counters from the relation 
> > stats
> > to the relfilenode stats (while still being able to retrieve them from the
> > pg_statio_all_tables/indexes views: see the example for the new 
> > heap_blks_written
> > stat added in the patch). Generally speaking, I think that tracking 
> > counters at
> > a common level (i.e relfilenode level instead of table or index level) is
> > beneficial (avoid storing/allocating space for the same counters in multiple
> > structs) and sounds more intuitive to me.
> 
> Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode
> changes. Does that mean I lose all of those stats? Is that a problem?
> Or is it good? Or what?

I think we should keep the stats in the relation during relfilenode changes.
As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
see in the example provided up-thread the new heap_blks_written statistic has
been preserved during the TRUNCATE. 

Please note that the v1 POC only takes care of the new heap_blks_written stat 
and
that the logic used in table_relation_set_new_filelocator() would probably need
to be applied in rebuild_relation() or such to deal with CLUSTER or VACUUM FULL.

For the relation, the new counter "blocks_written" has been added to the
PgStat_StatTabEntry struct (it's not needed in the PgStat_TableCounts one as the
relfilenode stat takes care of it). It's added in PgStat_StatTabEntry only
to copy/preserve the relfilenode stats during rewrite operations and to retrieve
the stats in pg_statio_all_tables.

Then, if later we split the relation stats to index/table stats, we'd have
blocks_written defined in less structs (as compare to doing the split without
relfilenode stat in place).

As mentioned up-thread, the new logic has been implemented in v1 only for the
new blocks_written stat (we'd need to do the same for the existing buffer read /
buffer hit if we agree on the approach implemented in v1).

> I also thought about the other direction. Suppose I drop the a
> relation and create a new one that gets a different relation OID but
> the same relfilenode. But I don't think that's a problem: dropping the
> relation should forcibly remove the old stats, so there won't be any
> conflict in this case.

Yeah.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-03 Thread Bertrand Drouvot
Hi Robert,

On Mon, May 27, 2024 at 09:10:13AM -0400, Robert Haas wrote:
> Hi Bertrand,
> 
> It would be helpful to me if the reasons why we're splitting out
> relfilenodestats could be more clearly spelled out. I see Andres's
> comment in the thread to which you linked, but it's pretty vague about
> why we should do this ("it's not nice") and whether we should do this
> ("I wonder if this is an argument for") and maybe that's all fine if
> Andres is going to be the one to review and commit this, but even if
> then it would be nice if the rest of us could follow along from home,
> and right now I can't.

Thanks for the feedback! 

You’re completely right, my previous message is missing clear explanation as to
why I think that relfilenode stats could be useful. Let me try to fix this.

The main argument is that we currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation (example in the v1 patch and in the message up-thread).

I think that adding instrumentation in this area (writes counters) could be
beneficial (like it is for the ones we currently have for reads).

Second argument is that this is also beneficial for the "Split index and
table statistics into different types of stats" thread (mentioned in the 
previous
message). It would allow us to avoid additional branches in some situations 
(like
the one mentioned by Andres in the link I provided up-thread).

If we agree that the main argument makes sense to think about having relfilenode
stats then I think using them as proposed in the second argument makes sense 
too:

We’d move the current buffer read and buffer hit counters from the relation 
stats
to the relfilenode stats (while still being able to retrieve them from the 
pg_statio_all_tables/indexes views: see the example for the new 
heap_blks_written
stat added in the patch). Generally speaking, I think that tracking counters at
a common level (i.e relfilenode level instead of table or index level) is
beneficial (avoid storing/allocating space for the same counters in multiple
structs) and sounds more intuitive to me.

Also I think this is open door for new ideas: for example, with relfilenode
statistics in place, we could probably also start thinking about tracking
checksum errors per relfllenode.

> The commit message is often a good place to spell this kind of thing
> out, because then it's included with every version of the patch you
> post, and may be of some use to the eventual committer in writing
> their commit message. The body of the email where you post the patch
> set can be fine, too.
> 

Yeah, I’ll update the commit message in V2 with better explanations once I get
feedback on V1 (should we decide to move on with the relfilenode stats idea).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




relfilenode statistics

2024-05-25 Thread Bertrand Drouvot
Hi hackers,

Please find attached a POC patch to implement $SUBJECT.

Adding relfilenode statistics has been proposed in [1]. The idea is to allow
tracking dirtied blocks, written blocks,... on a per relation basis.

The attached patch is not in a fully "polished" state yet: there is more places
we should add relfilenode counters, create more APIS to retrieve the relfilenode
stats

But I think that it is in a state that can be used to discuss the approach it
is implementing (so that we can agree or not on it) before moving forward.

The approach that is implemented in this patch is the following:

- A new PGSTAT_KIND_RELFILENODE is added
- A new attribute (aka relfile) has been added to PgStat_HashKey so that we
can record (dboid, spcOid and relfile) to identify a relfilenode entry
- pgstat_create_transactional() is used in RelationCreateStorage()
- pgstat_drop_transactional() is used in RelationDropStorage()
- RelationPreserveStorage() will remove the entry from the list of dropped stats

The current approach to deal with table rewrite is to:

- copy the relfilenode stats in table_relation_set_new_filelocator() from
the relfilenode stats entry to the shared table stats entry
- in the pg_statio_all_tables view: add the table stats entry (that contains
"previous" relfilenode stats (due to the above) that were linked to this 
relation
) to the current relfilenode stats linked to the relation

An example is done in the attached patch for the new heap_blks_written field
in pg_statio_all_tables. Outcome is:

"
postgres=# create table bdt (a int);
CREATE TABLE
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
 0
(1 row)

postgres=# insert into bdt select generate_series(1,1);
INSERT 0 1
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
 0
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
45
(1 row)

postgres=# truncate table bdt;
TRUNCATE TABLE
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
45
(1 row)

postgres=# insert into bdt select generate_series(1,1);
INSERT 0 1
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
45
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
90
(1 row)
"

Some remarks:

- My first attempt has been to call the pgstat_create_transactional() and
pgstat_drop_transactional() at the same places it is done for the relations but
that did not work well (mainly due to corner cases in case of rewrite).

- Please don't take care of the pgstat_count_buffer_read() and 
pgstat_count_buffer_hit() calls in pgstat_report_relfilenode_buffer_read()
and pgstat_report_relfilenode_buffer_hit(). Those stats will follow the same
flow as the one done and explained above for the new heap_blks_written one (
should we agree on it).

Looking forward to your comments, feedback.

Regards,

[1]: 
https://www.postgresql.org/message-id/20231113204439.r4lmys73tessqmak%40awork3.anarazel.de

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From e102c9d15c08c638879ece26008faee58cf4a07e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Nov 2023 02:30:01 +
Subject: [PATCH v1] Provide relfilenode statistics

---
 src/backend/access/rmgrdesc/xactdesc.c|   5 +-
 src/backend/catalog/storage.c |   8 ++
 src/backend/catalog/system_functions.sql  |   2 +-
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/postmaster/checkpointer.c |   5 +
 src/backend/storage/buffer/bufmgr.c   |   6 +-
 src/backend/storage/smgr/md.c |   7 ++
 src/backend/utils/activity/pgstat.c   |  39 --
 src/backend/utils/activity/pgstat_database.c  |  12 +-
 src/backend/utils/activity/pgstat_function.c  |  13 +-
 src/backend/utils/activity/pgstat_relation.c  | 112 --
 src/backend/utils/activity/pgstat_replslot.c  |  13 +-
 src/backend/utils/activity/pgstat_shmem.c |  19 ++-
 .../utils/activity/pgstat_subscription.c  |  12 +-
 src/backend/utils/activity/pgstat_xact.c  |  60 +++---
 src/backend/utils/adt/pgstatfuncs.c   |  34 +-
 src/include/access/tableam.h  |  19 +++
 src/include/access/xact.h |   1 +
 src/include/catalog/pg_proc.dat   |  14 ++-
 src/include/pgstat.h  |  19 ++-
 src/include/utils/

Re: Avoid orphaned objects dependencies, take 3

2024-05-23 Thread Bertrand Drouvot
Hi,

On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote:
> On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot
>  wrote:
> > The reason why we are using a dirty snapshot here is for the cases where we 
> > are
> > recording a dependency on a referenced object that we are creating at the 
> > same
> > time behind the scene (for example, creating a composite type while creating
> > a relation). Without the dirty snapshot, then the object we are creating 
> > behind
> > the scene (the composite type) would not be visible and we would wrongly 
> > assume
> > that it has been dropped.
> 
> The usual reason for using a dirty snapshot is that you want to see
> uncommitted work by other transactions. It sounds like you're saying
> you just need to see uncommitted work by the same transaction.

Right.

> If that's true, I think using HeapTupleSatisfiesSelf would be clearer.

Oh thanks! I did not know about the SNAPSHOT_SELF snapshot type (I should have
check all the snapshot types first though) and that's exactly what is needed 
here.

Please find attached v8 making use of SnapshotSelf instead of a dirty snapshot.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 2611fb874a476c1a8d665f97d973193beb70292a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v8] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  59 
 src/backend/catalog/objectaddress.c   |  66 +
 src/backend/catalog/pg_depend.c   |  12 ++
 src/backend/utils/errcodes.txt|   1 +
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 .../expected/test_dependencies_locks.out  | 129 ++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/test_dependencies_locks.spec|  89 
 9 files changed, 359 insertions(+)
  24.3% src/backend/catalog/
  45.3% src/test/isolation/expected/
  28.2% src/test/isolation/specs/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..4271ed7c5b 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,65 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a compo

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-22 Thread Bertrand Drouvot
Hi,

On Wed, May 22, 2024 at 07:01:54PM -0400, Jonathan S. Katz wrote:
> Thanks. As such I made it:
> 
> "which provides descriptions about wait events and can be combined with
> `pg_stat_activity` to give more insight into why an active session is
> waiting."
> 

Thanks! Works for me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-22 Thread Bertrand Drouvot
Hi,

On Wed, May 22, 2024 at 10:48:12AM -0400, Robert Haas wrote:
> On Wed, May 22, 2024 at 6:21 AM Bertrand Drouvot
>  wrote:
> > I started initially with [1] but it was focusing on function-schema only.
> 
> Yeah, that's what I thought we would want to do. And then just extend
> that to the other cases.
> 
> > Then I proposed [2] making use of a dirty snapshot when recording the 
> > dependency.
> > But this approach appeared to be "scary" and it was still failing to close
> > some race conditions.
> 
> The current patch still seems to be using dirty snapshots for some
> reason, which struck me as a bit odd. My intuition is that if we're
> relying on dirty snapshots to solve problems, we likely haven't solved
> the problems correctly, which seems consistent with your statement
> about "failing to close some race conditions". But I don't think I
> understand the situation well enough to be sure just yet.

The reason why we are using a dirty snapshot here is for the cases where we are
recording a dependency on a referenced object that we are creating at the same
time behind the scene (for example, creating a composite type while creating
a relation). Without the dirty snapshot, then the object we are creating behind
the scene (the composite type) would not be visible and we would wrongly assume
that it has been dropped.

Note that the usage of the dirty snapshot is only when the object is first
reported as "non existing" by the new ObjectByIdExist() function.

> > I think there is 2 cases here:
> >
> > First case: the "conflicting" lock mode is for one of our own lock then 
> > LockCheckConflicts()
> > would report this as a NON conflict.
> >
> > Second case: the "conflicting" lock mode is NOT for one of our own lock 
> > then LockCheckConflicts()
> > would report a conflict. But I've the feeling that the existing code would
> > already lock those sessions.
> >
> > Was your concern about "First case" or "Second case" or both?
> 
> The second case, I guess. It's bad to take a weaker lock first and
> then a stronger lock on the same object later, because it can lead to
> deadlocks that would have been avoided if the stronger lock had been
> taken at the outset.

In the example I shared up-thread that would be the opposite: the Session 1 
would
take an AccessExclusiveLock lock on the object before taking an AccessShareLock
during changeDependencyFor().

> Here, it seems like things would happen in the
> other order: if we took two locks, we'd probably take the stronger
> lock in the higher-level code and then the weaker lock in the
> dependency code.

Yeah, I agree.

> That shouldn't break anything; it's just a bit
> inefficient.

Yeah, the second lock is useless in that case (like in the example up-thread).

> My concern was really more about the maintainability of
> the code. I fear that if we add code that takes heavyweight locks in
> surprising places, we might later find the behavior difficult to
> reason about.
>

I think I understand your concern about code maintainability but I'm not sure
that adding locks while recording a dependency is that surprising.

> Tom, what is your thought about that concern?

+1

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-22 Thread Bertrand Drouvot
Hi,

On Tue, May 21, 2024 at 08:53:06AM -0400, Robert Haas wrote:
> On Wed, May 15, 2024 at 6:31 AM Bertrand Drouvot
>  wrote:
> > Please find attached v6 (only diff with v5 is moving the tests as suggested
> > above).
> 
> I don't immediately know what to think about this patch.

Thanks for looking at it!

> I've known about this issue for a long time, but I didn't think this was how 
> we
> would fix it.

I started initially with [1] but it was focusing on function-schema only.

Then I proposed [2] making use of a dirty snapshot when recording the 
dependency.
But this approach appeared to be "scary" and it was still failing to close
some race conditions.

Then, Tom proposed another approach in [2] which is that "creation DDL will have
to take a lock on each referenced object that'd conflict with a lock taken by 
DROP".
This is the one the current patch is trying to implement.

> What you've done here instead is add locking at a much
> lower level - whenever we are adding a dependency on an object, we
> lock the object. The advantage of that approach is that we definitely
> won't miss anything.

Right, as there is much more than the ones related to schemas, for example:

- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper

to name a few.

> The disadvantage of that approach is that it
> means we have some very low-level code taking locks, which means it's
> not obvious which operations are taking what locks.

Right, but the new operations are "only" the ones leading to recording or 
altering
a dependency.

> Maybe it could
> even result in some redundancy, like the higher-level code taking a
> lock also (possibly in a different mode) and then this code taking
> another one.

The one that is added here is in AccessShareLock mode. It could conflict with
the ones in AccessExclusiveLock means (If I'm not missing any):

- AcquireDeletionLock(): which is exactly what we want
- get_object_address()
- get_object_address_rv()
- ExecAlterObjectDependsStmt()
- ExecRenameStmt()
- ExecAlterObjectDependsStmt()
- ExecAlterOwnerStmt()
- RemoveObjects()
- AlterPublication()

I think there is 2 cases here:

First case: the "conflicting" lock mode is for one of our own lock then 
LockCheckConflicts()
would report this as a NON conflict.

Second case: the "conflicting" lock mode is NOT for one of our own lock then 
LockCheckConflicts()
would report a conflict. But I've the feeling that the existing code would
already lock those sessions.

One example where it would be the case:

Session 1: doing "BEGIN; ALTER FUNCTION noschemas.foo2() SET SCHEMA 
alterschema" would
acquire the lock in AccessExclusiveLock during 
ExecAlterObjectSchemaStmt()->get_object_address()->LockDatabaseObject()
(in the existing code and before the new call that would occur through 
changeDependencyFor()->depLockAndCheckObject()
with the patch in place).

Then, session 2: doing "alter function noschemas.foo2() owner to newrole;"
would be locked in the existing code while doing 
ExecAlterOwnerStmt()->get_object_address()->LockDatabaseObject()).

Means that in this example, the new lock this patch is introducing would not be
responsible of session 2 beging locked.

Was your concern about "First case" or "Second case" or both?

[1]: 
https://www.postgresql.org/message-id/flat/5a9daaae-5538-b209-6279-e903c3ea2157%40amazon.com
[2]: 
https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-21 Thread Bertrand Drouvot
Hi,

On Sun, May 19, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> Hello Bertrand,
> 
> Probably, it's worth to avoid ERRCODE_INTERNAL_ERROR here in light of [1]
> and [2], as these errors are not that abnormal (not Assert-like).
> 
> [1] https://www.postgresql.org/message-id/Zic_GNgos5sMxKoa%40paquier.xyz
> [2] https://commitfest.postgresql.org/48/4735/
> 

Thanks for mentioning the above examples, I agree that it's worth to avoid
ERRCODE_INTERNAL_ERROR here: please find attached v7 that makes use of a new
ERRCODE: ERRCODE_DEPENDENT_OBJECTS_DOES_NOT_EXIST 

I thought about this name as it is close enough to the already existing 
"ERRCODE_DEPENDENT_OBJECTS_STILL_EXIST" but I'm open to suggestion too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 79809f88311ef1cf4e8f3250e1508fe0b7d86602 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v7] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  58 
 src/backend/catalog/objectaddress.c   |  70 ++
 src/backend/catalog/pg_depend.c   |  12 ++
 src/backend/utils/errcodes.txt|   1 +
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 .../expected/test_dependencies_locks.out  | 129 ++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/test_dependencies_locks.spec|  89 
 9 files changed, 362 insertions(+)
  24.6% src/backend/catalog/
  45.1% src/test/isolation/expected/
  28.1% src/test/isolation/specs/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..89b2f18f98 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,64 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a composite type while creating a relation), so bypass the syscache
+		 * lookup and use a dirty snaphot instead to cover this scenario.
+		 */
+		if (!ObjectByIdExist(object, true))
+		{
+			/*
+			 * If the object has been dropped before we get a chance to get
+			 * its description, then emit a generic error message. That looks
+			 * like a good compromise over extra complexity.
+			 */
+			if (object_description)
+ereport(ERROR,
+		(errcode(ERRCODE_DEPE

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-20 Thread Bertrand Drouvot
Hi,

On Sun, May 19, 2024 at 05:10:10PM -0400, Jonathan S. Katz wrote:
> On 5/16/24 1:15 AM, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Wed, May 15, 2024 at 09:45:35PM -0400, Jonathan S. Katz wrote:
> > > Hi,
> > > 
> > > Attached is a copy of the PostgreSQL 17 Beta 1 release announcement draft.
> > 
> > Thanks for working on it!
> > 
> > I've one comment:
> > 
> > > PostgreSQL 17 also introduces a new view, 
> > > [`pg_wait_events`](https://www.postgresql.org/docs/17/view-pg-wait-events.html),
> > >  which provides descriptions about wait events and can be combined with 
> > > `pg_stat_activity` to give more insight into an operation.
> > 
> > Instead of "to give more insight into an operation", what about "to give 
> > more
> > insight about what a session is waiting for (should it be active)"?
> 
> I put:
> 
> "to give more in insight into why a session is blocked."

Thanks!

> 
> Does that work?
> 

I think using "waiting" is better (as the view is "pg_wait_events" and the
join with pg_stat_activity would be on the "wait_event_type" and "wait_event"
columns).

The reason I mentioned "should it be active" is because wait_event and 
wait_event_type
could be non empty in pg_stat_activity while the session is not in an active 
state
anymore (then not waiting).

A right query would be like the one in [1]:

"
SELECT a.pid, a.wait_event, w.description
  FROM pg_stat_activity a JOIN
   pg_wait_events w ON (a.wait_event_type = w.type AND
a.wait_event = w.name)
  WHERE a.wait_event is NOT NULL and a.state = 'active';
"

means filtering on the "active" state too, and that's what the description
proposal I made was trying to highlight.

[1]: https://www.postgresql.org/docs/devel/monitoring-stats.html

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 09:45:35PM -0400, Jonathan S. Katz wrote:
> Hi,
> 
> Attached is a copy of the PostgreSQL 17 Beta 1 release announcement draft.

Thanks for working on it!

I've one comment:

> PostgreSQL 17 also introduces a new view, 
> [`pg_wait_events`](https://www.postgresql.org/docs/17/view-pg-wait-events.html),
>  which provides descriptions about wait events and can be combined with 
> `pg_stat_activity` to give more insight into an operation.

Instead of "to give more insight into an operation", what about "to give more
insight about what a session is waiting for (should it be active)"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 08:31:43AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote:
> > +++ 
> > b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec
> >  
> > 
> > This introduces a module with only one single spec.  I could get
> > behind an extra module if splitting the tests into more specs makes
> > sense or if there is a restriction on installcheck.  However, for 
> > one spec file filed with a bunch of objects, and note that I'm OK to
> > let this single spec grow more for this range of tests, it seems to me
> > that this would be better under src/test/isolation/.
> 
> Yeah, I was not sure about this one (the location is from take 2 mentioned
> up-thread). I'll look at moving the tests to src/test/isolation/.

Please find attached v6 (only diff with v5 is moving the tests as suggested
above).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 59f7befd34b8aa4ba0483a100e85bacbc1a76707 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v6] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  54 
 src/backend/catalog/objectaddress.c   |  70 ++
 src/backend/catalog/pg_depend.c   |  12 ++
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 .../expected/test_dependencies_locks.out  | 129 ++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/test_dependencies_locks.spec|  89 
 8 files changed, 357 insertions(+)
  24.1% src/backend/catalog/
  45.9% src/test/isolation/expected/
  28.6% src/test/isolation/specs/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..a49357bbe2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,60 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a composite type while creating a relation), so bypass the syscache
+		 * lookup and use a dirty snaphot instead to cover this scenario.
+		 */
+		if (!ObjectByIdExist(object, true))
+		{
+			/*
+			 * If the object has been dropped before we get a chance to get
+			 * its description, then

Re: Avoid orphaned objects dependencies, take 3

2024-05-15 Thread Bertrand Drouvot
Hi,

On Tue, May 14, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 09.05.2024 15:20, Bertrand Drouvot wrote:
> > Oh I see, your test updates an existing dependency. v4 took care about 
> > brand new
> > dependencies creation (recordMultipleDependencies()) but forgot to take care
> > about changing an existing dependency (which is done in another code path:
> > changeDependencyFor()).
> > 
> > Please find attached v5 that adds:
> > 
> > - a call to the new depLockAndCheckObject() function in 
> > changeDependencyFor().
> > - a test when altering an existing dependency.
> > 
> > With v5 applied, I don't see the issue anymore.
> 
> Me too. Thank you for the improved version!
> I will test the patch in the background, but for now I see no other
> issues with it.

Thanks for confirming!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote:
> On Thu, May 09, 2024 at 12:20:51PM +0000, Bertrand Drouvot wrote:
> > Oh I see, your test updates an existing dependency. v4 took care about 
> > brand new 
> > dependencies creation (recordMultipleDependencies()) but forgot to take care
> > about changing an existing dependency (which is done in another code path:
> > changeDependencyFor()).
> > 
> > Please find attached v5 that adds:
> > 
> > - a call to the new depLockAndCheckObject() function in 
> > changeDependencyFor().
> > - a test when altering an existing dependency.
> > 
> > With v5 applied, I don't see the issue anymore.
> 
> +if (object_description)
> +ereport(ERROR, errmsg("%s does not exist", 
> object_description));
> +else
> +ereport(ERROR, errmsg("a dependent object does not ex
> 
> This generates an internal error code.  Is that intended?

Thanks for looking at it!

Yes, it's like when say you want to create an object in a schema that does not
exist (see get_namespace_oid()).

> --- /dev/null
> +++ 
> b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec 
> 
> This introduces a module with only one single spec.  I could get
> behind an extra module if splitting the tests into more specs makes
> sense or if there is a restriction on installcheck.  However, for 
> one spec file filed with a bunch of objects, and note that I'm OK to
> let this single spec grow more for this range of tests, it seems to me
> that this would be better under src/test/isolation/.

Yeah, I was not sure about this one (the location is from take 2 mentioned
up-thread). I'll look at moving the tests to src/test/isolation/.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Log details for stats dropped more than once

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 02:47:29PM +0900, Michael Paquier wrote:
> On Tue, May 14, 2024 at 10:07:14AM +0000, Bertrand Drouvot wrote:
> > While resuming the work on refilenode stats (mentioned in [1] but did not 
> > share
> > the patch yet), I realized that my current POC patch is buggy enough to 
> > produce
> > things like:
> > 
> > 024-05-14 09:51:14.783 UTC [1788714] FATAL:  can only drop stats once
> > 
> > While the CONTEXT provides the list of dropped stats:
> > 
> > 2024-05-14 09:51:14.783 UTC [1788714] CONTEXT:  WAL redo at 0/D75F478 for 
> > Transaction/ABORT: 2024-05-14 09:51:14.782223+00; dropped stats: 
> > 2/16384/27512/0 2/16384/27515/0 2/16384/27516/0
> 
> Can refcount be useful to know in this errcontext?

Thanks for looking at it!

Do you mean as part of the added errdetail_internal()? If so, yeah I think it's
a good idea (done that way in v2 attached).

> > Attached a tiny patch to report the stat that generates the error. The 
> > patch uses
> > errdetail_internal() as the extra details don't seem to be useful to average
> > users.
> 
> I think that's fine.  Overall that looks like useful information for
> debugging, so no objections from here.

Thanks! BTW, I just realized that adding more details for this error has already
been mentioned in [1] (and Andres did propose a slightly different version).

[1]: 
https://www.postgresql.org/message-id/20240505160915.6boysum4f34siqct%40awork3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 0866e8252f2038f3fdbd9cfabb214461689210df Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 14 May 2024 09:10:50 +
Subject: [PATCH v2] Log details for stats dropped more than once

Adding errdetail_internal() in pgstat_drop_entry_internal() to display the
PgStat_HashKey and the entry's refcount when the "can only drop stats once"
error is reported.
---
 src/backend/utils/activity/pgstat_shmem.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 91591da395..0595c08d6e 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -785,7 +785,12 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent,
 	 * backends to release their references.
 	 */
 	if (shent->dropped)
-		elog(ERROR, "can only drop stats once");
+		ereport(ERROR,
+errmsg("can only drop stats once"),
+errdetail_internal("Stats kind=%s dboid=%u objoid=%u refcount=%u.",
+   pgstat_get_kind_info(shent->key.kind)->name,
+   shent->key.dboid, shent->key.objoid,
+   pg_atomic_read_u32(>refcount)));
 	shent->dropped = true;
 
 	/* release refcount marking entry as not dropped */
-- 
2.34.1



Log details for stats dropped more than once

2024-05-14 Thread Bertrand Drouvot
Hi hackers,

While resuming the work on refilenode stats (mentioned in [1] but did not share
the patch yet), I realized that my current POC patch is buggy enough to produce
things like:

024-05-14 09:51:14.783 UTC [1788714] FATAL:  can only drop stats once

While the CONTEXT provides the list of dropped stats:

2024-05-14 09:51:14.783 UTC [1788714] CONTEXT:  WAL redo at 0/D75F478 for 
Transaction/ABORT: 2024-05-14 09:51:14.782223+00; dropped stats: 
2/16384/27512/0 2/16384/27515/0 2/16384/27516/0

It's not clear which one generates the error (don't pay attention to the actual
values, the issue comes from the new refilenode stats that I removed from the
output).

Attached a tiny patch to report the stat that generates the error. The patch 
uses
errdetail_internal() as the extra details don't seem to be useful to average
users.

[1]: 
https://www.postgresql.org/message-id/ZbIdgTjR2QcFJ2mE%40ip-10-97-1-34.eu-west-3.compute.internal

Looking forward to your feedback,

Regards

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From f296d4092f5e87ed63a323db3f1bc9cfa807e2e8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 14 May 2024 09:10:50 +
Subject: [PATCH v1] Log details for stats dropped more than once

Adding errdetail_internal() in pgstat_drop_entry_internal() to display the
PgStat_HashKey when the "can only drop stats once" error is reported.
---
 src/backend/utils/activity/pgstat_shmem.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 91591da395..1308c4439a 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -785,7 +785,11 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent,
 	 * backends to release their references.
 	 */
 	if (shent->dropped)
-		elog(ERROR, "can only drop stats once");
+		ereport(ERROR,
+errmsg("can only drop stats once"),
+errdetail_internal("Stats kind=%s dboid=%u objoid=%u.",
+   pgstat_get_kind_info(shent->key.kind)->name,
+   shent->key.dboid, shent->key.objoid));
 	shent->dropped = true;
 
 	/* release refcount marking entry as not dropped */
-- 
2.34.1



Re: Avoid orphaned objects dependencies, take 3

2024-05-09 Thread Bertrand Drouvot
Hi,

On Tue, Apr 30, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> But I've discovered yet another possibility to get a broken dependency.

Thanks for the testing and the report!

> Please try this script:
> res=0
> numclients=20
> for ((i=1;i<=100;i++)); do
> for ((c=1;c<=numclients;c++)); do
>   echo "
> CREATE SCHEMA s_$c;
> CREATE CONVERSION myconv_$c FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8;
> ALTER CONVERSION myconv_$c SET SCHEMA s_$c;
>   " | psql >psql1-$c.log 2>&1 &
>   echo "DROP SCHEMA s_$c RESTRICT;" | psql >psql2-$c.log 2>&1 &
> done
> wait
> pg_dump -f db.dump || { echo "on iteration $i"; res=1; break; }
> for ((c=1;c<=numclients;c++)); do
>   echo "DROP SCHEMA s_$c CASCADE;" | psql >psql3-$c.log 2>&1
> done
> done
> psql -c "SELECT * FROM pg_conversion WHERE connamespace NOT IN (SELECT oid 
> FROM pg_namespace);"
> 
> It fails for me (with the v4 patch applied) as follows:
> pg_dump: error: schema with OID 16392 does not exist
> on iteration 1
>   oid  | conname  | connamespace | conowner | conforencoding | contoencoding 
> |  conproc  | condefault
> ---+--+--+--++---+---+
>  16396 | myconv_6 |    16392 |   10 |  8 | 6 
> | iso8859_1_to_utf8 | f
> 

Thanks for sharing the test, I'm able to reproduce the issue with v4.

Oh I see, your test updates an existing dependency. v4 took care about brand 
new 
dependencies creation (recordMultipleDependencies()) but forgot to take care
about changing an existing dependency (which is done in another code path:
changeDependencyFor()).

Please find attached v5 that adds:

- a call to the new depLockAndCheckObject() function in changeDependencyFor().
- a test when altering an existing dependency.

With v5 applied, I don't see the issue anymore.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 6fc24a798210f730ab04833fa58074f142be968e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v5] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  54 
 src/backend/catalog/objectaddress.c   |  70 ++
 src/backend/catalog/pg_depend.c   |  12 ++
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_dependencies_locks/.gitignore|   3 +
 .../modules/test_dependencies_locks/Makefile  |  14 ++
 .../expected/test_dependencies_locks.out  | 129 ++
 .../test_dependencies_locks/meson.build   |  12 ++
 .../specs/test_dependencies_locks.spec|  89 
 12 files changed, 387 insertions(+)
  22.9% src/backend/catalog/
  43.7% src/test/modules/test_dependencies_locks/expected/
  27.2% src/test/modules/test_dependencies_locks/specs/
   4.5% src/test/modules/test_dependencies_locks/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..a49357bbe2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -15

Re: First draft of PG 17 release notes

2024-05-08 Thread Bertrand Drouvot
Hi,

On Thu, May 09, 2024 at 12:03:50AM -0400, Bruce Momjian wrote:
> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
> 
>   https://momjian.us/pgsql_docs/release-17.html

Thanks for working on that!
 
> I welcome feedback.

> Add system view pg_wait_events that reports wait event types (Michael 
> Paquier) 

Michael is the committer for 1e68e43d3f, the author is me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-05-08 Thread Bertrand Drouvot
Hi,

On Mon, Apr 29, 2024 at 11:58:09AM +, Zhijie Hou (Fujitsu) wrote:
> On Monday, April 29, 2024 5:11 PM shveta malik  wrote:
> > 
> > On Mon, Apr 29, 2024 at 11:38 AM shveta malik 
> > wrote:
> > >
> > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot
> >  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Since the standby_slot_names patch has been committed, I am
> > > > > > attaching the last doc patch for review.
> > > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > 1 ===
> > > > >
> > > > > +   continue subscribing to publications now on the new primary
> > > > > + server
> > > > > without
> > > > > +   any data loss.
> > > > >
> > > > > I think "without any data loss" should be re-worded in this
> > > > > context. Data loss in the sense "data committed on the primary and
> > > > > not visible on the subscriber in case of failover" can still occurs 
> > > > > (in case
> > synchronous replication is not used).
> > > > >
> > > > > 2 ===
> > > > >
> > > > > +   If the result (failover_ready) of both above 
> > > > > steps is
> > > > > +   true, existing subscriptions will be able to continue without data
> > loss.
> > > > > +  
> > > > >
> > > > > I don't think that's true if synchronous replication is not used.
> > > > > Say,
> > > > >
> > > > > - synchronous replication is not used
> > > > > - primary is not able to reach the standby anymore and
> > > > > standby_slot_names is set
> > > > > - new data is inserted into the primary
> > > > > - then not replicated to subscriber (due to standby_slot_names)
> > > > >
> > > > > Then I think the both above steps will return true but data would
> > > > > be lost in case of failover.
> > > >
> > > > Thanks for the comments, attach the new version patch which reworded
> > > > the above places.

Thanks!

> Here is the V3 doc patch.

Thanks! A few comments:

1 ===

+   losing any data that has been flushed to the new primary server.

Worth to add a few words about possible data loss, something like?

Please note that in case synchronous replication is not used and 
standby_slot_names
is set correctly, it might be possible to lose data that would have been 
committed
on the old primary server (in case the standby was not reachable during that 
time 
for example).

2 ===

+test_sub=# SELECT
+   array_agg(slotname) AS slots
+   FROM
+   ((
+   SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_', 
srrelid, '_', ctl.system_identifier) AS slotname
+   FROM pg_control_system() ctl, pg_subscription_rel r, 
pg_subscription s
+   WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover
+   ) UNION (

I guess this format comes from ReplicationSlotNameForTablesync(). What about
creating a SQL callable function on top of it and make use of it in the query
above? (that would ensure to keep the doc up to date even if the format changes
in ReplicationSlotNameForTablesync()).

3 ===

+test_sub=# SELECT
+   MAX(remote_lsn) AS remote_lsn_on_subscriber
+   FROM
+   ((
+   SELECT (CASE WHEN r.srsubstate = 'f' THEN 
pg_replication_origin_progress(CONCAT('pg_', r.srsubid, '_', r.srrelid), false)
+   WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn 
END) AS remote_lsn
+   FROM pg_subscription_rel r, pg_subscription s
+   WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND 
s.subfailover
+   ) UNION (
+   SELECT pg_replication_origin_progress(CONCAT('pg_', s.oid), 
false) AS remote_lsn
+   FROM pg_subscription s
+   WHERE s.subfailover
+   ));

What about adding a join to pg_replication_origin to get rid of the "hardcoded"
format "CONCAT('pg_', r.srsubid, '_', r.srrelid)" and "CONCAT('pg_', s.oid)"?

Idea behind 2 ===  and 3 === is to have the queries as generic as possible and
not rely on a hardcoded format (that would be more difficult to maintain should
those formats change in the future).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-04-25 Thread Bertrand Drouvot
Hi,

On Thu, Apr 25, 2024 at 09:00:00AM +0300, Alexander Lakhin wrote:
> Hi,
> 
> 25.04.2024 08:00, Bertrand Drouvot wrote:
> > 
> > > though concurrent create/drop operations can still produce
> > > the "cache lookup failed" error, which is probably okay, except that it is
> > > an INTERNAL_ERROR, which assumed to be not easily triggered by users.
> > I did not see any of those "cache lookup failed" during my testing 
> > with/without
> > your script. During which test(s) did you see those with v3 applied?
> 
> You can try, for example, table-trigger, or other tests that check for
> "cache lookup failed" psql output only (maybe you'll need to increase the
> iteration count). For instance, I've got (with v4 applied):
> 2024-04-25 05:48:08.102 UTC [3638763] ERROR:  cache lookup failed for 
> function 20893
> 2024-04-25 05:48:08.102 UTC [3638763] STATEMENT:  CREATE TRIGGER modified_c1 
> BEFORE UPDATE OF c ON t
>     FOR EACH ROW WHEN (OLD.c <> NEW.c) EXECUTE PROCEDURE 
> trigger_func('modified_c');
> 
> Or with function-function:
> 2024-04-25 05:52:31.390 UTC [3711897] ERROR:  cache lookup failed for 
> function 32190 at character 54
> 2024-04-25 05:52:31.390 UTC [3711897] STATEMENT:  CREATE FUNCTION f1() 
> RETURNS int LANGUAGE SQL RETURN f() + 1;
> --
> 2024-04-25 05:52:37.639 UTC [3720011] ERROR:  cache lookup failed for 
> function 34465 at character 54
> 2024-04-25 05:52:37.639 UTC [3720011] STATEMENT:  CREATE FUNCTION f1() 
> RETURNS int LANGUAGE SQL RETURN f() + 1;

I see, so this is during object creation.

It's easy to reproduce this kind of issue with gdb. For example set a breakpoint
on SearchSysCache1() and during the create function f1() once it breaks on:

#0  SearchSysCache1 (cacheId=45, key1=16400) at syscache.c:221
#1  0x5ad305beacd6 in func_get_detail (funcname=0x5ad308204d50, fargs=0x0, 
fargnames=0x0, nargs=0, argtypes=0x72ff9cc0, expand_variadic=true, 
expand_defaults=true, include_out_arguments=false, funcid=0x72ff9ba0, 
rettype=0x72ff9b9c, retset=0x72ff9b94, nvargs=0x72ff9ba4,
vatype=0x72ff9ba8, true_typeids=0x72ff9bd8, 
argdefaults=0x72ff9be0) at parse_func.c:1622
#2  0x5ad305be7dd0 in ParseFuncOrColumn (pstate=0x5ad30823be98, 
funcname=0x5ad308204d50, fargs=0x0, last_srf=0x0, fn=0x5ad308204da0, 
proc_call=false, location=55) at parse_func.c:266
#3  0x5ad305bdffb0 in transformFuncCall (pstate=0x5ad30823be98, 
fn=0x5ad308204da0) at parse_expr.c:1474
#4  0x5ad305bdd2ee in transformExprRecurse (pstate=0x5ad30823be98, 
expr=0x5ad308204da0) at parse_expr.c:230
#5  0x5ad305bdec34 in transformAExprOp (pstate=0x5ad30823be98, 
a=0x5ad308204e20) at parse_expr.c:990
#6  0x5ad305bdd1a0 in transformExprRecurse (pstate=0x5ad30823be98, 
expr=0x5ad308204e20) at parse_expr.c:187
#7  0x5ad305bdd00b in transformExpr (pstate=0x5ad30823be98, 
expr=0x5ad308204e20, exprKind=EXPR_KIND_SELECT_TARGET) at parse_expr.c:131
#8  0x5ad305b96b7e in transformReturnStmt (pstate=0x5ad30823be98, 
stmt=0x5ad308204ee0) at analyze.c:2395
#9  0x5ad305b92213 in transformStmt (pstate=0x5ad30823be98, 
parseTree=0x5ad308204ee0) at analyze.c:375
#10 0x5ad305c6321a in interpret_AS_clause (languageOid=14, 
languageName=0x5ad308204c40 "sql", funcname=0x5ad308204ad8 "f100", as=0x0, 
sql_body_in=0x5ad308204ee0, parameterTypes=0x0, inParameterNames=0x0, 
prosrc_str_p=0x72ffa208, probin_str_p=0x72ffa200, 
sql_body_out=0x72ffa210,
queryString=0x5ad3082040b0 "CREATE FUNCTION f100() RETURNS int LANGUAGE SQL 
RETURN f() + 1;") at functioncmds.c:953
#11 0x5ad305c63c93 in CreateFunction (pstate=0x5ad308186310, 
stmt=0x5ad308204f00) at functioncmds.c:1221

then drop function f() in another session. Then the create function f1() would:

postgres=# CREATE FUNCTION f() RETURNS int LANGUAGE SQL RETURN f() + 1;
ERROR:  cache lookup failed for function 16400

This stuff does appear before we get a chance to call the new 
depLockAndCheckObject()
function.

I think this is what Tom was referring to in [1]:

"
So the only real fix for this would be to make every object lookup in the entire
system do the sort of dance that's done in RangeVarGetRelidExtended.
"

The fact that those kind of errors appear also somehow ensure that no orphaned
dependencies can be created.

The patch does ensure that no orphaned depencies can occur after those "initial"
look up are done (should the dependent object be dropped).

I'm tempted to not add extra complexity to avoid those kind of errors and keep 
the
patch as it is. All of those servicing the same goal: no orphaned depencies are
created.

[1]: https://www.postgresql.org/message-id/2872252.1630851337%40sss.pgh.pa.us

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-04-24 Thread Bertrand Drouvot
Hi,

On Wed, Apr 24, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> 24.04.2024 11:38, Bertrand Drouvot wrote:
> > I gave more thought to v2 and the approach seems reasonable to me. 
> > Basically what
> > it does is that in case the object is already dropped before we take the 
> > new lock
> > (while recording the dependency) then the error message is a "generic" one 
> > (means
> > it does not provide the description of the "already" dropped object). I 
> > think it
> > makes sense to write the patch that way by abandoning the patch's ambition 
> > to
> > tell the description of the dropped object in all the cases.
> > 
> > Of course, I would be happy to hear others thought about it.
> > 
> > Please find v3 attached (which is v2 with more comments).
> 
> Thank you for the improved version!
> 
> I can confirm that it fixes that case.

Great, thanks for the test!

> I've also tested other cases that fail on master (most of them also fail
> with v1), please try/look at the attached script.

Thanks for all those tests!

> (There could be other broken-dependency cases, of course, but I think I've
> covered all the representative ones.)

Agree. Furthermore the way the patch is written should be agnostic to the
object's kind that are part of the dependency. Having said that, that does not
hurt to add more tests in this patch, so v4 attached adds some of your tests (
that would fail on the master branch without the patch applied).

The way the tests are written in the patch are less "racy" that when triggered
with your script. Indeed, I think that in the patch the dependent object can not
be removed before the new lock is taken when recording the dependency. We may
want to add injection points in the game if we feel the need.

> All tested cases work correctly with v3 applied —

Yeah, same on my side, I did run them too and did not find any issues.

> I couldn't get broken
> dependencies,

Same here.

> though concurrent create/drop operations can still produce
> the "cache lookup failed" error, which is probably okay, except that it is
> an INTERNAL_ERROR, which assumed to be not easily triggered by users.

I did not see any of those "cache lookup failed" during my testing with/without
your script. During which test(s) did you see those with v3 applied?

Attached v4, simply adding more tests to v3.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a9b34955fab0351b7b5a7ba6eb36f199f5a5822c Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v4] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  54 +
 src/backend/catalog/objectaddress.c   |  70 +++
 src/backend/catalog/pg_depend.c   |   6 +
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_dependencies_locks/.gitignore|   3 +
 .../modules/test_dependencies_locks/Makefile  |  14 +++
 .../expected/test_dependencies_locks.out  | 113 ++
 .../test_dependencies_locks/meson.build   |  12 ++
 .../specs/test_dependencies_locks.spec|  78 
 12 files changed

Re: Avoid orphaned objects dependencies, take 3

2024-04-24 Thread Bertrand Drouvot
Hi,

On Tue, Apr 23, 2024 at 04:20:46PM +, Bertrand Drouvot wrote:
> Please find attached v2 that should not produce the issue anymore (I launched 
> a
> lot of attempts without any issues). v1 was not strong enough as it was not
> always checking for the dependent object existence. v2 now always checks if 
> the
> object still exists after the additional lock acquisition attempt while 
> recording
> the dependency.
> 
> I still need to think about v2 but in the meantime could you please also give
> v2 a try on you side?

I gave more thought to v2 and the approach seems reasonable to me. Basically 
what
it does is that in case the object is already dropped before we take the new 
lock
(while recording the dependency) then the error message is a "generic" one 
(means
it does not provide the description of the "already" dropped object). I think it
makes sense to write the patch that way by abandoning the patch's ambition to
tell the description of the dropped object in all the cases.

Of course, I would be happy to hear others thought about it.

Please find v3 attached (which is v2 with more comments).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 32945912ceddad6b171fd8b345adefa4432af357 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v3] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after locking the object, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type
---
 src/backend/catalog/dependency.c  | 54 ++
 src/backend/catalog/objectaddress.c   | 70 +++
 src/backend/catalog/pg_depend.c   |  6 ++
 src/include/catalog/dependency.h  |  1 +
 src/include/catalog/objectaddress.h   |  1 +
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../test_dependencies_locks/.gitignore|  3 +
 .../modules/test_dependencies_locks/Makefile  | 14 
 .../expected/test_dependencies_locks.out  | 49 +
 .../test_dependencies_locks/meson.build   | 12 
 .../specs/test_dependencies_locks.spec| 39 +++
 12 files changed, 251 insertions(+)
  40.5% src/backend/catalog/
  29.6% src/test/modules/test_dependencies_locks/expected/
  18.7% src/test/modules/test_dependencies_locks/specs/
   8.3% src/test/modules/test_dependencies_locks/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..a49357bbe2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,60 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a composite type while creating a relation), so bypass the syscache

Re: Avoid orphaned objects dependencies, take 3

2024-04-23 Thread Bertrand Drouvot
Hi,

On Tue, Apr 23, 2024 at 04:59:09AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> > 22.04.2024 13:52, Bertrand Drouvot wrote:
> > > 
> > > That's weird, I just launched it several times with the patch applied and 
> > > I'm not
> > > able to see the seg fault (while I can see it constently failing on the 
> > > master
> > > branch).
> > > 
> > > Are you 100% sure you tested it against a binary with the patch applied?
> > > 
> > 
> > Yes, at least I can't see what I'm doing wrong. Please try my
> > self-contained script attached.
> 
> Thanks for sharing your script!
> 
> Yeah your script ensures the patch is applied before the repro is executed.
> 
> I do confirm that I can also see the issue with the patch applied (but I had 
> to
> launch multiple attempts, while on master one attempt is enough).
> 
> I'll have a look.

Please find attached v2 that should not produce the issue anymore (I launched a
lot of attempts without any issues). v1 was not strong enough as it was not
always checking for the dependent object existence. v2 now always checks if the
object still exists after the additional lock acquisition attempt while 
recording
the dependency.

I still need to think about v2 but in the meantime could you please also give
v2 a try on you side?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From f80fdfc32e791463555aa318f26ff5e7363ac3ac Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v2] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after locking the object, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type
---
 src/backend/catalog/dependency.c  | 48 +
 src/backend/catalog/objectaddress.c   | 70 +++
 src/backend/catalog/pg_depend.c   |  6 ++
 src/include/catalog/dependency.h  |  1 +
 src/include/catalog/objectaddress.h   |  1 +
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../test_dependencies_locks/.gitignore|  3 +
 .../modules/test_dependencies_locks/Makefile  | 14 
 .../expected/test_dependencies_locks.out  | 49 +
 .../test_dependencies_locks/meson.build   | 12 
 .../specs/test_dependencies_locks.spec| 39 +++
 12 files changed, 245 insertions(+)
  38.1% src/backend/catalog/
  30.7% src/test/modules/test_dependencies_locks/expected/
  19.5% src/test/modules/test_dependencies_locks/specs/
   8.7% src/test/modules/test_dependencies_locks/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..e3b66025dd 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,54 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */

Re: Avoid orphaned objects dependencies, take 3

2024-04-22 Thread Bertrand Drouvot
Hi,

On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> 22.04.2024 13:52, Bertrand Drouvot wrote:
> > 
> > That's weird, I just launched it several times with the patch applied and 
> > I'm not
> > able to see the seg fault (while I can see it constently failing on the 
> > master
> > branch).
> > 
> > Are you 100% sure you tested it against a binary with the patch applied?
> > 
> 
> Yes, at least I can't see what I'm doing wrong. Please try my
> self-contained script attached.

Thanks for sharing your script!

Yeah your script ensures the patch is applied before the repro is executed.

I do confirm that I can also see the issue with the patch applied (but I had to
launch multiple attempts, while on master one attempt is enough).

I'll have a look.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-04-22 Thread Bertrand Drouvot
Hi,

On Mon, Apr 22, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 22.04.2024 11:45, Bertrand Drouvot wrote:
> > Hi,
> > 
> > This new thread is a follow-up of [1] and [2].
> > 
> > Problem description:
> > 
> > We have occasionally observed objects having an orphaned dependency, the
> > most common case we have seen is functions not linked to any namespaces.
> > 
> > ...
> > 
> > Looking forward to your feedback,
> 
> This have reminded me of bug #17182 [1].

Thanks for the link to the bug!

> Unfortunately, with the patch applied, the following script:
> 
> for ((i=1;i<=100;i++)); do
>   ( { for ((n=1;n<=20;n++)); do echo "DROP SCHEMA s;"; done } | psql ) 
> >psql1.log 2>&1 &
>   echo "
> CREATE SCHEMA s;
> CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;';
> CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;';
> CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;';
> CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;';
> CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;';
>   "  | psql >psql2.log 2>&1 &
>   wait
>   psql -c "DROP SCHEMA s CASCADE" >psql3.log
> done
> echo "
> SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM 
> pg_proc pp
>   LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL" 
> | psql
> 
> still ends with:
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> 
> 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|LOG:  server process (PID
> 1388378) was terminated by signal 11: Segmentation fault
> 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|DETAIL:  Failed process was
> running: SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid
> FROM pg_proc pp
>   LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS 
> NULL
> 

Thanks for sharing the script.

That's weird, I just launched it several times with the patch applied and I'm 
not
able to see the seg fault (while I can see it constently failing on the master
branch).

Are you 100% sure you tested it against a binary with the patch applied?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Disallow changing slot's failover option in transaction block

2024-04-22 Thread Bertrand Drouvot
Hi,

On Mon, Apr 22, 2024 at 11:32:14AM +0530, shveta malik wrote:
> On Mon, Apr 22, 2024 at 5:57 AM Zhijie Hou (Fujitsu)
>  wrote:
> > Attach the V3 patch which also addressed Shveta[1] and Bertrand[2]'s 
> > comments.

Thanks!

>  Tested the patch, works well.

Same here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Patch to avoid orphaned dependencies

2024-04-22 Thread Bertrand Drouvot
Hi,

On Wed, Mar 23, 2022 at 12:49:06PM -0400, Tom Lane wrote:
> Realistically, if we want to prevent this type of problem, then all
> creation DDL will have to take a lock on each referenced object that'd
> conflict with a lock taken by DROP.  This might not be out of reach:
> I think we do already take such locks while dropping objects.  The
> reference-side lock could be taken by the recordDependency mechanism
> itself, ensuring that we don't miss anything; and that would also
> allow us to not bother taking such a lock on pinned objects, which'd
> greatly cut the cost (though not to zero).

Thanks for the idea (and sorry for the delay replying to it)! I had a look at it
and just created a new thread [1] based on your proposal.

[1]: 
https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Avoid orphaned objects dependencies, take 3

2024-04-22 Thread Bertrand Drouvot
Hi,

This new thread is a follow-up of [1] and [2].

Problem description:

We have occasionally observed objects having an orphaned dependency, the 
most common case we have seen is functions not linked to any namespaces.

Examples to produce such orphaned dependencies:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

A patch has been initially proposed to fix this particular 
(function-to-namespace) dependency (see [1]), but there could be much 
more scenarios (like the function-to-datatype one highlighted by Gilles 
in [1] that could lead to a function having an invalid parameter datatype).

As Tom said there are dozens more cases that would need to be 
considered, and a global approach to avoid those race conditions should 
be considered instead.

A first global approach attempt has been proposed in [2] making use of a dirty
snapshot when recording the dependency. But this approach appeared to be "scary"
and it was still failing to close some race conditions (see [2] for details).

Then, Tom proposed another approach in [2] which is that "creation DDL will have
to take a lock on each referenced object that'd conflict with a lock taken by
DROP".

This is what the attached patch is trying to achieve.

It does the following:

1) A new lock (that conflicts with a lock taken by DROP) has been put in place
when the dependencies are being recorded.

Thanks to it, the drop schema in scenario 2 would be locked (resulting in an
error should session 1 committs).

2) After locking the object while recording the dependency, the patch checks
that the object still exists.

Thanks to it, session 2 in scenario 1 would be locked and would report an error
once session 1 committs (that would not be the case should session 1 abort the
transaction).

The patch also adds a few tests for some dependency cases (that would currently
produce orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type (which is I think problematic enough, as involving a table into
the game, to fix this stuff as a whole).

[1]: 
https://www.postgresql.org/message-id/flat/a4f55089-7cbd-fe5d-a9bb-19adc6418...@darold.net#9af5cdaa9e80879beb1def3604c976e8
[2]: 
https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com

Please note that I'm not used to with this area of the code so that the patch
might not take the option proposed by Tom the "right" way.

Adding the patch to the July CF.

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 6798e85b0679dfccfa665007b06d9f1a0e39e0c6 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v1] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after locking the object, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type
---
 src/backend/catalog/dependency.c  | 42 ++
 src/backend/catalog/objectaddress.c   | 57 +++
 src/backend/catalog/pg_depend.c   |  6 ++
 src/include/catalog/dependency.h  |  1 +
 src/include/catalog/objectaddress.h   |  1 +
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../test_dependencies_locks/.gitignore|  3 +
 .../modules/test_dependencies_locks/Makefile  | 14 +
 .../expected/test_dependencies_locks.out  | 49 
 .../te

Re: Disallow changing slot's failover option in transaction block

2024-04-19 Thread Bertrand Drouvot
Hi,

On Fri, Apr 19, 2024 at 12:39:40AM +, Zhijie Hou (Fujitsu) wrote:
> Here is V2 patch which includes the changes for CREATE SUBSCRIPTION as
> suggested. Since we don't connect pub to alter slot when (create_slot=false)
> anymore, the restriction that disallows failover=true when connect=false is
> also removed.

Thanks!

+  specified in the subscription. When creating the slot, ensure the 
slot
+  property failover matches the counterpart 
parameter
+  of the subscription.

The slot could be created before or after the subscription is created, so I 
think
it needs a bit of rewording (as here it sounds like the sub is already created),
, something like?

"Always ensure the slot property failover matches the
counterpart parameter of the subscription and vice versa."

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-18 Thread Bertrand Drouvot
Hi,

On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
> Please find v8 attached. Changes are:

Thanks!

A few comments:

1 ===

@@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t 
startup_data_len)
 * slotsync_worker_onexit() but that will need the connection to be made
 * global and we want to avoid introducing global for this purpose.
 */
-   before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
+   before_shmem_exit(slotsync_worker_disconnect, PointerGetDatum(wrconn));

The comment above this change still states "Register the failure callback once
we have the connection", I think it has to be reworded a bit now that v8 is
making use of slotsync_worker_disconnect().

2 ===

+* Register slotsync_worker_onexit() before we register
+* ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit 
of
+* slot sync worker, ReplicationSlotShmemExit() is called first, 
followed
+* by slotsync_worker_onexit(). Startup process during promotion waits 
for

Worth to mention in shmem_exit() (where it "while (--before_shmem_exit_index >= 
0)"
or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on
this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
other part of the code).

3 ===

+* Startup process during promotion waits for slot sync to finish and it
+* does that by checking the 'syncing' flag.

worth to mention ShutDownSlotSync()?

4 ===

I did a few tests manually (launching ShutDownSlotSync() through gdb / with and
without sync worker and with / without pg_sync_replication_slots() running
concurrently) and it looks like it works as designed.

Having said that, the logic that is in place to take care of the corner cases
described up-thread seems reasonable to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
> On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
>  wrote:
> > I personally feel adding the additional check for sync_replication_slots may
> > not improve the situation here. Because the GUC sync_replication_slots can
> > change at any point, the GUC could be false when performing this addition 
> > check
> > and is set to true immediately after the check, so It could not simplify 
> > the logic
> > anyway.
> 
> +1.
> I feel doc and "cannot synchronize replication slots concurrently"
> check should suffice.
> 
> In the scenario which Hou-San pointed out,  if after performing the
> GUC check in SQL function, this GUC is enabled immediately and say
> worker is started sooner than the function could get chance to sync,
> in that case as well, SQL function will ultimately get error "cannot
> synchronize replication slots concurrently", even though GUC is
> enabled.  Thus, I feel we should stick with samer error in all
> scenarios.

Okay, fine by me, let's forget about checking sync_replication_slots then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Disallow changing slot's failover option in transaction block

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 06:32:11AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> Kuroda-San reported an issue off-list that:
> 
> If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> and rollback, only the subscription option change can be rolled back, while 
> the
> replication slot's failover change is preserved.

Nice catch, thanks!

> To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> (failover) inside a txn block, which is also consistent to the ALTER
> SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> patch to address this.

Agree. The patch looks pretty straightforward to me. Worth to add this
case in the doc? (where we already mention that "Commands ALTER SUBSCRIPTION ...
REFRESH PUBLICATION and ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...
with refresh option as true cannot be executed inside a transaction block"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> Please find v5 addressing above comments.

Thanks!

@@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 {
PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, 
PointerGetDatum(wrconn));
{
+   check_flags_and_set_sync_info(InvalidPid);
+

Given the fact that if the sync worker is running it won't be possible to 
trigger
a manual sync with pg_sync_replication_slots(), what about also checking the 
"sync_replication_slots" value at the start of SyncReplicationSlots() and
emmit an error if sync_replication_slots is set to on? (The message could 
explicitly
states that it's not possible to use the function if sync_replication_slots is
set to on).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
> On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
> > > On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > > > > Now both worker and SQL
> > > > > function set 'syncing' when they start and reset it when they exit.
> > > >
> > > > It means that it's not possible anymore to trigger a manual sync if
> > > > sync_replication_slots is on. Indeed that would trigger:
> > > >
> > > > postgres=# select pg_sync_replication_slots();
> > > > ERROR:  cannot synchronize replication slots concurrently
> > > >
> > > > That looks like an issue to me, thoughts?
> > > >
> > >
> > > This is intentional as of now for the sake of keeping
> > > implementation/code simple. It is not difficult to allow them but I am
> > > not sure whether we want to add another set of conditions allowing
> > > them in parallel.
> >
> > I think that the ability to launch a manual sync before a switchover would 
> > be
> > missed. Except for this case I don't think that's an issue to prevent them 
> > to
> > run in parallel.
> >
> 
> I think if the slotsync worker is available, it can do that as well.

Right, but one has no control as to when the sync is triggered. 

> There is no clear use case for allowing them in parallel and I feel it
> would add more confusion when it can work sometimes but not other
> times. However, if we receive some report from the field where there
> is a real demand for such a thing, it should be easy to achieve. For
> example, I can imagine that we can have sync_state that has values
> 'started', 'in_progress' , and 'finished'. This should allow us to
> achieve what the current proposed patch is doing along with allowing
> the API to work in parallel when the sync_state is not 'in_progress'.
> 
> I think for now let's restrict their usage in parallel and make the
> promotion behavior consistent both for worker and API.

Okay, let's do it that way. Is it worth to add a few words in the doc related to
pg_sync_replication_slots() though? (to mention it can not be used if the sync
slot worker is running).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Bertrand Drouvot
Hi,

On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
> On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > > Now both worker and SQL
> > > function set 'syncing' when they start and reset it when they exit.
> >
> > It means that it's not possible anymore to trigger a manual sync if
> > sync_replication_slots is on. Indeed that would trigger:
> >
> > postgres=# select pg_sync_replication_slots();
> > ERROR:  cannot synchronize replication slots concurrently
> >
> > That looks like an issue to me, thoughts?
> >
> 
> This is intentional as of now for the sake of keeping
> implementation/code simple. It is not difficult to allow them but I am
> not sure whether we want to add another set of conditions allowing
> them in parallel.

I think that the ability to launch a manual sync before a switchover would be
missed. Except for this case I don't think that's an issue to prevent them to
run in parallel.

> And that too in an unpredictable way as the API will
> work only for the time slot sync worker is not performing the sync.

Yeah but then at least you would know that there is "really" a sync in progress
(which is not the case currently with v4, as the sync worker being started is
enough to prevent a manual sync even if a sync is not in progress).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Bertrand Drouvot
Hi,

On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 12, 2024 at 5:25 PM shveta malik  wrote:
> > >
> > > Please find v3 addressing race-condition and one other comment.
> > >
> > > Up-thread it was suggested that,  probably, checking
> > > SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
> > > re-thinking, it might not be. Slot sync worker sets and resets
> > > 'syncing' with each sync-cycle, and thus we need to rely on worker's
> > > pid in ShutDownSlotSync(), as there could be a window where promotion
> > > is triggered and 'syncing' is not set for worker, while the worker is
> > > still running. This implementation of setting and resetting syncing
> > > with each sync-cycle looks better as compared to setting syncing
> > > during the entire life-cycle of the worker. So, I did not change it.
> > >
> >
> > To retain this we need to have different handling for 'syncing' for
> > workers and function which seems like more maintenance burden than the
> > value it provides. Moreover, in SyncReplicationSlots(), we are calling
> > a function after acquiring spinlock which is not our usual coding
> > practice.
> 
> Okay.  Changed it to consistent handling.

Thanks for the patch!

> Now both worker and SQL
> function set 'syncing' when they start and reset it when they exit.

It means that it's not possible anymore to trigger a manual sync if 
sync_replication_slots is on. Indeed that would trigger:

postgres=# select pg_sync_replication_slots();
ERROR:  cannot synchronize replication slots concurrently

That looks like an issue to me, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-06 Thread Bertrand Drouvot
Hi,

On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote:
> On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
>  wrote:
> 
> I think the new LSN can be visible only when the corresponding WAL is
> written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can
> make it visible. In your experiment below, isn't it possible that in
> the meantime WAL writer has written that WAL due to which you are
> seeing an updated location?

What I did is:

session 1:  select pg_current_wal_lsn();\watch 1
session 2:  select pg_backend_pid();

terminal 1: tail -f logfile | grep -i snap 
terminal 2 : gdb -p ) at standby.c:1346
1346{
(gdb) n
1350
 
Then next, next until the DEBUG message is emitted (confirmed in terminal 1).

At this stage the DEBUG message shows the new LSN while session 1 still displays
the previous LSN.

Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then
session 1 displays the new LSN.

This is reproducible as desired.

With more debugging I can see that when the spinlock is released in 
XLogSetAsyncXactLSN()
then XLogWrite() is doing its job and then session 1 does see the new value 
(that
happens in this order, and as you said that's expected).

My point is that while the DEBUG message is emitted session 1 still 
see the old LSN (until the new LSN is vsible). I think that we should emit the
DEBUG message once session 1 can see the new value (If not, I think the 
timestamp
of the DEBUG message can be missleading during debugging purpose).

> I think I am missing how exactly moving DEBUG2 can confirm the above theory.

I meant to say that instead of seeing:

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 
739 next xid 740)
2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
SELECT '0/360' <= replay_lsn AND state = 'streaming'

We would probably see something like:

2024-04-05 04:37:05. UTC [3866475][client backend][2/4:0] LOG:  
statement: SELECT '0/360' <= replay_lsn AND state = 'streaming'
2024-04-05 04:37:05.+xx UTC [3854278][background writer][:0] DEBUG:  
snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest 
complete 739 next xid 740)

And then it would be clear that the query has ran before the new LSN is visible.

> > If the theory is proven then I'm not sure we need the extra complexity of
> > injection point here, maybe just relying on the slots created via SQL API 
> > could
> > be enough.
> >
> 
> Yeah, that could be the first step. We can probably add an injection
> point to control the bgwrite behavior and then add tests involving
> walsender performing the decoding. But I think it is important to have
> sufficient tests in this area as I see they are quite helpful in
> uncovering the issues.
>

Yeah agree. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 02:35:42PM +, Bertrand Drouvot wrote:
> I think that maybe as a first step we should move the "elog(DEBUG2," message 
> as
> proposed above to help debugging (that could help to confirm the above 
> theory).

If you agree and think that makes sense, pleae find attached a tiny patch doing
so.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a414e81a4c5c88963b2412d1641f3de1262c15c0 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 5 Apr 2024 14:49:51 +
Subject: [PATCH v1] Move DEBUG message in LogCurrentRunningXacts()

Indeed the new LSN is visible to others session (say through pg_current_wal_lsn())
only after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is released.

So moving the DEBUG message after the XLogSetAsyncXactLSN() call seems more
appropriate for debugging purpose.
---
 src/backend/storage/ipc/standby.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)
 100.0% src/backend/storage/ipc/

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 87b04e51b3..ba549acf50 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1366,6 +1366,17 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
+	/*
+	 * Ensure running_xacts information is synced to disk not too far in the
+	 * future. We don't want to stall anything though (i.e. use XLogFlush()),
+	 * so we let the wal writer do it during normal operation.
+	 * XLogSetAsyncXactLSN() conveniently will mark the LSN as to-be-synced
+	 * and nudge the WALWriter into action if sleeping. Check
+	 * XLogBackgroundFlush() for details why a record might not be flushed
+	 * without it.
+	 */
+	XLogSetAsyncXactLSN(recptr);
+
 	if (CurrRunningXacts->subxid_overflow)
 		elog(DEBUG2,
 			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
@@ -1383,17 +1394,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 			 CurrRunningXacts->latestCompletedXid,
 			 CurrRunningXacts->nextXid);
 
-	/*
-	 * Ensure running_xacts information is synced to disk not too far in the
-	 * future. We don't want to stall anything though (i.e. use XLogFlush()),
-	 * so we let the wal writer do it during normal operation.
-	 * XLogSetAsyncXactLSN() conveniently will mark the LSN as to-be-synced
-	 * and nudge the WALWriter into action if sleeping. Check
-	 * XLogBackgroundFlush() for details why a record might not be flushed
-	 * without it.
-	 */
-	XLogSetAsyncXactLSN(recptr);
-
 	return recptr;
 }
 
-- 
2.34.1



Re: Synchronizing slots from primary to standby

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote:
> On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila  wrote:
> Thinking more on this, it doesn't seem related to
> c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change
> any locking or something like that which impacts write positions.

Agree.

> I think what has happened here is that running_xact record written by
> the background writer [1] is not written to the kernel or disk (see
> LogStandbySnapshot()), before pg_current_wal_lsn() checks the
> current_lsn to be compared with replayed LSN.

Agree, I think it's not visible through pg_current_wal_lsn() yet.

Also I think that the DEBUG message in LogCurrentRunningXacts() 

"
elog(DEBUG2,
 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid 
%u latest complete %u next xid %u)",
 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
 LSN_FORMAT_ARGS(recptr),
 CurrRunningXacts->oldestRunningXid,
 CurrRunningXacts->latestCompletedXid,
 CurrRunningXacts->nextXid);
"

should be located after the XLogSetAsyncXactLSN() call. Indeed, the new LSN is
visible after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is
released, see:

\watch on Session 1 provides: 

 pg_current_wal_lsn

 0/87D110
(1 row)

Until:

Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at xlog.c:2579 
2579XLogRecPtr  WriteRqstPtr = asyncXactLSN;
(gdb) n
2581boolwakeup = false;
(gdb)
2584SpinLockAcquire(>info_lck);
(gdb)
2585RefreshXLogWriteResult(LogwrtResult);
(gdb)
2586sleeping = XLogCtl->WalWriterSleeping;
(gdb)
2587prevAsyncXactLSN = XLogCtl->asyncXactLSN;
(gdb)
2588if (XLogCtl->asyncXactLSN < asyncXactLSN)
(gdb)
2589XLogCtl->asyncXactLSN = asyncXactLSN;
(gdb)
2590SpinLockRelease(>info_lck);
(gdb) p p/x (uint32) XLogCtl->asyncXactLSN
$1 = 0x87d148

Then session 1 provides:

 pg_current_wal_lsn

 0/87D148
(1 row)

So, when we see in the log:

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 
739 next xid 740)
2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
SELECT '0/360' <= replay_lsn AND state = 'streaming'

It's indeed possible that the new LSN was not visible yet (spinlock not 
released?)
before the query began (because we can not rely on the time the DEBUG message 
has
been emitted).

> Note that the reason why
> walsender has picked the running_xact written by background writer is
> because it has checked after pg_current_wal_lsn() query, see LOGs [2].
> I think we can probably try to reproduce manually via debugger.
> 
> If this theory is correct

It think it is.

> then I think we will need to use injection
> points to control the behavior of bgwriter or use the slots created
> via SQL API for syncing in tests.
> 
> Thoughts?

I think that maybe as a first step we should move the "elog(DEBUG2," message as
proposed above to help debugging (that could help to confirm the above theory).

If the theory is proven then I'm not sure we need the extra complexity of
injection point here, maybe just relying on the slots created via SQL API could
be enough.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 04:09:01PM +0530, shveta malik wrote:
> On Fri, Apr 5, 2024 at 10:09 AM Bertrand Drouvot
>  wrote:
> >
> > What about something like?
> >
> > ereport(LOG,
> > errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs 
> > from remote slot",
> > remote_slot->name),
> > errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.",
> > LSN_FORMAT_ARGS(remote_slot->restart_lsn),
> > LSN_FORMAT_ARGS(slot->data.restart_lsn));
> >
> > Regards,
> 
> +1. Better than earlier. I will update and post the patch.
> 

Thanks!

BTW, I just realized that the LSN I used in my example in the LSN_FORMAT_ARGS()
are not the right ones.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 11:21:43AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 9:57 PM Bertrand Drouvot
>  wrote:
> Please find the attached v36 patch.

Thanks!

A few comments:

1 ===

+   
+The timeout is measured from the time since the slot has become
+inactive (known from its
+inactive_since value) until it gets
+used (i.e., its active is set to true).
+   

That's right except when it's invalidated during the checkpoint (as the slot
is not acquired in CheckPointReplicationSlots()).

So, what about adding: "or a checkpoint occurs"? That would also explain that
the invalidation could occur during checkpoint.

2 ===

+   /* If the slot has been invalidated, recalculate the resource limits */
+   if (invalidated)
+   {

/If the slot/If a slot/?

3 ===

+ * NB - this function also runs as part of checkpoint, so avoid raising errors

s/NB - this/NB: This function/? (that looks more consistent with other comments
in the code)

4 ===

+ * Note that having a new function for RS_INVAL_INACTIVE_TIMEOUT cause instead

I understand it's "the RS_INVAL_INACTIVE_TIMEOUT cause" but reading "cause 
instead"
looks weird to me. Maybe it would make sense to reword this a bit.

5 ===

+* considered not active as they don't actually perform logical 
decoding.

Not sure that's 100% accurate as we switched in fast forward logical
in 2ec005b4e2.

"as they perform only fast forward logical decoding (or not at all)", maybe?

6 ===

+   if (RecoveryInProgress() && slot->data.synced)
+   return false;
+
+   if (replication_slot_inactive_timeout == 0)
+   return false;

What about just using one if? It's more a matter of taste but it also probably
reduces the object file size a bit for non optimized build.

7 ===

+   /*
+* Do not invalidate the slots which are currently being synced 
from
+* the primary to the standby.
+*/
+   if (RecoveryInProgress() && slot->data.synced)
+   return false;

I think we don't need this check as the exact same one is done just before.

8 ===

+sub check_for_slot_invalidation_in_server_log
+{
+   my ($node, $slot_name, $offset) = @_;
+   my $invalidated = 0;
+
+   for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; 
$i++)
+   {
+   $node->safe_psql('postgres', "CHECKPOINT");

Wouldn't be better to wait for the replication_slot_inactive_timeout time before
instead of triggering all those checkpoints? (it could be passed as an extra arg
to wait_for_slot_invalidation()).

9 ===

# Synced slot mustn't get invalidated on the standby, it must sync invalidation
# from the primary. So, we must not see the slot's invalidation message in 
server
# log.
ok( !$standby1->log_contains(
"invalidating obsolete replication slot \"lsub1_sync_slot\"",
$standby1_logstart),
'check that syned slot has not been invalidated on the standby');

Would that make sense to trigger a checkpoint on the standby before this test?
I mean I think that without a checkpoint on the standby we should not see the
invalidation in the log anyway.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-04 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 09:43:35AM +0530, shveta malik wrote:
> On Fri, Apr 5, 2024 at 9:22 AM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote:
> > > On Thu, Apr 4, 2024 at 2:59 PM shveta malik  
> > > wrote:
> > 2 ===
> >
> > +   if (slot->data.confirmed_flush != 
> > remote_slot->confirmed_lsn)
> > +   elog(LOG,
> > +"could not synchronize local slot 
> > \"%s\" LSN(%X/%X)"
> > +" to remote slot's LSN(%X/%X) ",
> > +remote_slot->name,
> > +
> > LSN_FORMAT_ARGS(slot->data.confirmed_flush),
> > +
> > LSN_FORMAT_ARGS(remote_slot->confirmed_lsn));
> >
> > I don't think that the message is correct here. Unless I am missing 
> > something
> > there is nothing in the following code path that would prevent the slot to 
> > be
> > sync during this cycle.
> 
> This is a sanity check,  I will put a comment to indicate the same.

Thanks!

> We
> want to ensure if anything changes in future, we get correct logs to
> indicate that.

Right, understood that way.

> If required, the LOG msg can be changed. Kindly suggest if you have
> anything better in mind.
> 

What about something like?

ereport(LOG,
errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from 
remote slot",
remote_slot->name),
errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.",
LSN_FORMAT_ARGS(remote_slot->restart_lsn),
LSN_FORMAT_ARGS(slot->data.restart_lsn));

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-04 Thread Bertrand Drouvot
Hi,

On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote:
> On Thu, Apr 4, 2024 at 2:59 PM shveta malik  wrote:
> >
> >
> > Prior to commit 2ec005b, this check was okay, as we did not expect
> > restart_lsn of the synced slot to be ahead of remote since we were
> > directly copying the lsns. But now when we use 'advance' to do logical
> > decoding on standby, there is a possibility that restart lsn of the
> > synced slot is ahead of remote slot, if there are running txns records
> > found after reaching consistent-point while consuming WALs from
> > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance
> > may end up serializing snapshots and setting restart_lsn to the
> > serialized snapshot point, ahead of remote one.
> >
> > Fix:
> > The sanity check needs to be corrected. Attached a patch to address the 
> > issue.
> 

Thanks for reporting, explaining the issue and providing a patch.

Regarding the patch:

1 ===

+* Attempt to sync lsns and xmins only if remote slot is ahead of local

s/lsns/LSNs/?

2 ===

+   if (slot->data.confirmed_flush != 
remote_slot->confirmed_lsn)
+   elog(LOG,
+"could not synchronize local slot 
\"%s\" LSN(%X/%X)"
+" to remote slot's LSN(%X/%X) ",
+remote_slot->name,
+
LSN_FORMAT_ARGS(slot->data.confirmed_flush),
+
LSN_FORMAT_ARGS(remote_slot->confirmed_lsn));

I don't think that the message is correct here. Unless I am missing something
there is nothing in the following code path that would prevent the slot to be
sync during this cycle. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Autogenerate some wait events code and documentation

2024-04-04 Thread Bertrand Drouvot
Hi,

On Thu, Apr 04, 2024 at 07:14:47PM +0900, Michael Paquier wrote:
> On Thu, Apr 04, 2024 at 09:28:36AM +0000, Bertrand Drouvot wrote:
> > +# No "Backpatch" region here as code is generated automatically.
> > 
> > What about "region here as has its own C code" (that would be more 
> > consistent
> > with the comment in the "header" for the file). Done that way in v4.
> 
> I'd add a "as -this section- has its own C code", for clarity.  This
> just looked a bit strange here.

Sounds good, done that way in v5 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 1e24cdb02360147363495122fbfe79c2758ae4e8 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 4 Apr 2024 15:34:37 +0900
Subject: [PATCH v5] Add "ABI_compatibility" regions in wait_event_names.txt

When backpatching, adding an event will renumber others, which can make an
extension report the wrong event until recompiled. This is due to the fact that
generate-wait_event_types.pl sorts events to position them. The "ABI_compatibility"
region added here ensures no ordering is done for the wait events found after
this delimiter.
---
 .../activity/generate-wait_event_types.pl | 27 ++-
 .../utils/activity/wait_event_names.txt   | 17 
 2 files changed, 43 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/generate-wait_event_types.pl b/src/backend/utils/activity/generate-wait_event_types.pl
index f1adf0e8e7..9768406db5 100644
--- a/src/backend/utils/activity/generate-wait_event_types.pl
+++ b/src/backend/utils/activity/generate-wait_event_types.pl
@@ -38,7 +38,9 @@ die "Not possible to specify --docs and --code simultaneously"
 
 open my $wait_event_names, '<', $ARGV[0] or die;
 
+my @abi_compatibility_lines;
 my @lines;
+my $abi_compatibility = 0;
 my $section_name;
 my $note;
 my $note_name;
@@ -59,10 +61,27 @@ while (<$wait_event_names>)
 	{
 		$section_name = $_;
 		$section_name =~ s/^.*- //;
+		$abi_compatibility = 0;
 		next;
 	}
 
-	push(@lines, $section_name . "\t" . $_);
+	# ABI_compatibility region, preserving ABI compatibility of the code
+	# generated.  Any wait events listed in this part of the file
+	# will not be sorted during the code generation.
+	if (/^ABI_compatibility:$/)
+	{
+		$abi_compatibility = 1;
+		next;
+	}
+
+	if ($gen_code && $abi_compatibility)
+	{
+		push(@abi_compatibility_lines, $section_name . "\t" . $_);
+	}
+	else
+	{
+		push(@lines, $section_name . "\t" . $_);
+	}
 }
 
 # Sort the lines based on the second column.
@@ -70,6 +89,12 @@ while (<$wait_event_names>)
 my @lines_sorted =
   sort { uc((split(/\t/, $a))[1]) cmp uc((split(/\t/, $b))[1]) } @lines;
 
+# If we are generating code, concat @lines_sorted and then @abi_compatibility_lines.
+if ($gen_code)
+{
+	push(@lines_sorted, @abi_compatibility_lines);
+}
+
 # Read the sorted lines and populate the hash table
 foreach my $line (@lines_sorted)
 {
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d..5169be238c 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -26,6 +26,12 @@
 # When adding a new wait event, make sure it is placed in the appropriate
 # ClassName section.
 #
+# Wait events added in stable branches should be appended to the lists in
+# the "ABI_compatibility:" region of their related ClassName section to preserve
+# ABI compatibility of the C code generated from this file's data, respecting
+# the order of any wait event already listed there.  The "ABI_compatibility:"
+# regions should remain empty on the master branch and on unreleased branches.
+#
 # WaitEventLWLock and WaitEventLock have their own C code for their wait event
 # enums and function names.  Hence, for these, only the SGML documentation is
 # generated.
@@ -61,6 +67,7 @@ WAL_SENDER_MAIN	"Waiting in main loop of WAL sender process."
 WAL_SUMMARIZER_WAL	"Waiting in WAL summarizer for more WAL to be generated."
 WAL_WRITER_MAIN	"Waiting in main loop of WAL writer process."
 
+ABI_compatibility:
 
 #
 # Wait Events - Client
@@ -83,6 +90,7 @@ WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
+ABI_compatibility:
 
 #
 # Wait Events - IPC
@@ -150,6 +158,7 @@ WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for st
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to b

Re: Autogenerate some wait events code and documentation

2024-04-04 Thread Bertrand Drouvot
Hi,

On Thu, Apr 04, 2024 at 03:50:21PM +0900, Michael Paquier wrote:
> On Tue, Mar 19, 2024 at 07:34:09AM +0000, Bertrand Drouvot wrote:
> > I'm not sure as v2 used the "version >= 17" wording which I think would not 
> > need
> > manual refresh each time a new stable branch is forked.
> > 
> > But to avoid any doubt, I'm following your recommendation in v3 attached 
> > (then
> > only mentioning the "master branch" and "any other branch").
> 
> I don't see why we could not be more generic, TBH.  Note that the
> Backpatch region should be empty not only the master branch but also
> on stable and unreleased branches (aka REL_XX_STABLE branches from
> their fork from master to their .0 release).  I have reworded the
> whole, mentioning ABI compatibility, as well.

Yeah, agree. I do prefer your wording.

> The position of the Backpatch regions were a bit incorrect (extra one
> in LWLock, and the one in Lock was not needed).

oops, thanks for the fixes!

> We could be stricter with the order of the elements in
> pgstat_wait_event.c and wait_event_funcs_data.c, but there's no
> consequence feature-wise and I cannot get excited about the extra
> complexity this creates in generate-wait_event_types.pl between the
> enum generation and the rest.

Yeah, and I think generate-wait_event_types.pl is already complex enough.
So better to add only the strict necessary in it IMHO.

> Is "Backpatch" the best choice we have, though?  It speaks by itself
> but I was thinking about something different, like "Stable".  Other
> ideas or objections are welcome.  My naming sense is usually not that
> good, so there's that.

I think "Stable" is more confusing because the section should also be empty 
until
the .0 is released.

That said, what about "ABI_compatibility"? (that would also match the comment 
added in wait_event_names.txt). Attached v4 making use of the ABI_compatibility
proposal.

> 0001 is the patch with my tweaks.

Thanks! 

+# No "Backpatch" region here as code is generated automatically.

What about "region here as has its own C code" (that would be more 
consistent
with the comment in the "header" for the file). Done that way in v4.

It looks like WAL_SENDER_WRITE_ZZZ was also added in it (I guess for testing
purpose, so I removed it in v4).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 4a69b8adb9bef2a5c7f7bec061d0bdda95ad9e24 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 4 Apr 2024 15:34:37 +0900
Subject: [PATCH v4] Add "ABI_compatibility" regions in wait_event_names.txt

When backpatching, adding an event will renumber others, which can make an
extension report the wrong event until recompiled. This is due to the fact that
generate-wait_event_types.pl sorts events to position them. The "ABI_compatibility"
region added here ensures no ordering is done for the wait events found after
this delimiter.
---
 .../activity/generate-wait_event_types.pl | 27 ++-
 .../utils/activity/wait_event_names.txt   | 17 
 2 files changed, 43 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/generate-wait_event_types.pl b/src/backend/utils/activity/generate-wait_event_types.pl
index f1adf0e8e7..9768406db5 100644
--- a/src/backend/utils/activity/generate-wait_event_types.pl
+++ b/src/backend/utils/activity/generate-wait_event_types.pl
@@ -38,7 +38,9 @@ die "Not possible to specify --docs and --code simultaneously"
 
 open my $wait_event_names, '<', $ARGV[0] or die;
 
+my @abi_compatibility_lines;
 my @lines;
+my $abi_compatibility = 0;
 my $section_name;
 my $note;
 my $note_name;
@@ -59,10 +61,27 @@ while (<$wait_event_names>)
 	{
 		$section_name = $_;
 		$section_name =~ s/^.*- //;
+		$abi_compatibility = 0;
 		next;
 	}
 
-	push(@lines, $section_name . "\t" . $_);
+	# ABI_compatibility region, preserving ABI compatibility of the code
+	# generated.  Any wait events listed in this part of the file
+	# will not be sorted during the code generation.
+	if (/^ABI_compatibility:$/)
+	{
+		$abi_compatibility = 1;
+		next;
+	}
+
+	if ($gen_code && $abi_compatibility)
+	{
+		push(@abi_compatibility_lines, $section_name . "\t" . $_);
+	}
+	else
+	{
+		push(@lines, $section_name . "\t" . $_);
+	}
 }
 
 # Sort the lines based on the second column.
@@ -70,6 +89,12 @@ while (<$wait_event_names>)
 my @lines_sorted =
   sort { uc((split(/\t/, $a))[1]) cmp uc((split(/\t/, $b))[1]) } @lines;
 
+# If we are generating code, concat @lines_sorted and then @abi_compatibility_lines.
+if ($gen_code)
+{
+	push(@lines_sorted, @abi_compatibility_lines);
+}
+
 # Read the 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot
Hi,

On Wed, Apr 03, 2024 at 08:28:04PM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
>  wrote:
> >
> > Just one comment on v32-0001:
> >
> > +# Synced slot on the standby must get its own inactive_since.
> > +is( $standby1->safe_psql(
> > +   'postgres',
> > +   "SELECT '$inactive_since_on_primary'::timestamptz <= 
> > '$inactive_since_on_standby'::timestamptz AND
> > +   '$inactive_since_on_standby'::timestamptz <= 
> > '$slot_sync_time'::timestamptz;"
> > +   ),
> > +   "t",
> > +   'synchronized slot has got its own inactive_since');
> > +
> >
> > By using <= we are not testing that it must get its own inactive_since (as 
> > we
> > allow them to be equal in the test). I think we should just add some 
> > usleep()
> > where appropriate and deny equality during the tests on inactive_since.
> 
> > Except for the above, v32-0001 LGTM.
> 
> Thanks. Please see the attached v33-0001 patch after removing equality
> on inactive_since TAP tests.

Thanks! v33-0001 LGTM.

> On Wed, Apr 3, 2024 at 1:47 PM Bertrand Drouvot
>  wrote:
> > Some comments regarding v31-0002:
> >
> > T2 ===
> >
> > In case the slot is invalidated on the primary,
> >
> > primary:
> >
> > postgres=# select slot_name, inactive_since, invalidation_reason from 
> > pg_replication_slots where slot_name = 's1';
> >  slot_name |inactive_since | invalidation_reason
> > ---+---+-
> >  s1| 2024-04-03 06:56:28.075637+00 | inactive_timeout
> >
> > then on the standby we get:
> >
> > standby:
> >
> > postgres=# select slot_name, inactive_since, invalidation_reason from 
> > pg_replication_slots where slot_name = 's1';
> >  slot_name |inactive_since| invalidation_reason
> > ---+--+-
> >  s1| 2024-04-03 07:06:43.37486+00 | inactive_timeout
> >
> > shouldn't the slot be dropped/recreated instead of updating inactive_since?
> 
> The sync slots that are invalidated on the primary aren't dropped and
> recreated on the standby.

Yeah, right (I was confused with synced slots that are invalidated locally).

> However, I
> found that the synced slot is acquired and released unnecessarily
> after the invalidation_reason is synced from the primary. I added a
> skip check in synchronize_one_slot to skip acquiring and releasing the
> slot if it's locally found inactive. With this, inactive_since won't
> get updated for invalidated sync slots on the standby as we don't
> acquire and release the slot.

CR1 ===

Yeah, I can see:

@@ -575,6 +575,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid 
remote_dbid)
   " name slot \"%s\" already 
exists on the standby",
   remote_slot->name));

+   /*
+* Skip the sync if the local slot is already invalidated. We 
do this
+* beforehand to save on slot acquire and release.
+*/
+   if (slot->data.invalidated != RS_INVAL_NONE)
+   return false;

Thanks to the drop_local_obsolete_slots() call I think we are not missing the 
case
where the slot has been invalidated on the primary, invalidation reason has been
synced on the standby and later the slot is dropped/ recreated manually on the
primary (then it should be dropped/recreated on the standby too).

Also it seems we are not missing the case where a sync slot is invalidated
locally due to wal removal (it should be dropped/recreated).

> 
> > CR5 ===
> >
> > +   /*
> > +* This function isn't expected to be called for inactive timeout 
> > based
> > +* invalidation. A separate function 
> > InvalidateInactiveReplicationSlot is
> > +    * to be used for that.
> >
> > Do you think it's worth to explain why?
> 
> Hm, I just wanted to point out the actual function here. I modified it
> to something like the following, if others feel we don't need that, I
> can remove it.

Sorry If I was not clear but I meant to say "Do you think it's worth to explain
why we decided to create a dedicated function"? (currently we "just" explain 
that
we created one).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot
Hi,

On Wed, Apr 03, 2024 at 05:12:12PM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 4:19 PM Amit Kapila  wrote:
> >
> > + 'postgres',
> > + "SELECT '$inactive_since_on_primary'::timestamptz <
> > '$inactive_since_on_standby'::timestamptz AND
> > + '$inactive_since_on_standby'::timestamptz < 
> > '$slot_sync_time'::timestamptz;"
> >
> > Shall we do <= check as we are doing in the main function
> > get_slot_inactive_since_value as the time duration is less so it can
> > be the same as well? Similarly, please check other tests.
> 
> I get you. If the tests are so fast that losing a bit of precision
> might cause tests to fail. So, I'v added equality check for all the
> tests.

> Please find the attached v32-0001 patch with the above review comments
> addressed.

Thanks!

Just one comment on v32-0001:

+# Synced slot on the standby must get its own inactive_since.
+is( $standby1->safe_psql(
+   'postgres',
+   "SELECT '$inactive_since_on_primary'::timestamptz <= 
'$inactive_since_on_standby'::timestamptz AND
+   '$inactive_since_on_standby'::timestamptz <= 
'$slot_sync_time'::timestamptz;"
+   ),
+   "t",
+   'synchronized slot has got its own inactive_since');
+

By using <= we are not testing that it must get its own inactive_since (as we
allow them to be equal in the test). I think we should just add some usleep()
where appropriate and deny equality during the tests on inactive_since.

Except for the above, v32-0001 LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot
Hi,

On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
> 
> Please find the attached v31 patches implementing the above idea:

Thanks!

Some comments regarding v31-0002:

=== testing the behavior

T1 ===

> - synced slots don't get invalidated due to inactive timeout because
> such slots not considered active at all as they don't perform logical
> decoding (of course, they will perform in fast_forward mode to fix the
> other data loss issue, but they don't generate changes for them to be
> called as *active* slots)

It behaves as described. OTOH non synced logical slots on the standby and
physical slots on the standby are invalidated which is what is expected.

T2 ===

In case the slot is invalidated on the primary,

primary:

postgres=# select slot_name, inactive_since, invalidation_reason from 
pg_replication_slots where slot_name = 's1';
 slot_name |inactive_since | invalidation_reason
---+---+-
 s1| 2024-04-03 06:56:28.075637+00 | inactive_timeout

then on the standby we get:

standby:

postgres=# select slot_name, inactive_since, invalidation_reason from 
pg_replication_slots where slot_name = 's1';
 slot_name |inactive_since| invalidation_reason
---+--+-
 s1| 2024-04-03 07:06:43.37486+00 | inactive_timeout

shouldn't the slot be dropped/recreated instead of updating inactive_since?

=== code

CR1 ===

+Invalidates replication slots that are inactive for longer the
+specified amount of time

s/for longer the/for longer that/?

CR2 ===

+true) as such synced slots don't actually perform
+logical decoding.

We're switching in fast forward logical due to [1], so I'm not sure that's 100%
accurate here. I'm not sure we need to specify a reason.

CR3 ===

+ errdetail("This slot has been invalidated because it was inactive for more 
than the time specified by replication_slot_inactive_timeout parameter.")));

I think we can remove "parameter" (see for example the error message in
validate_remote_info()) and reduce it a bit, something like?

"This slot has been invalidated because it was inactive for more than 
replication_slot_inactive_timeout"?

CR4 ===

+ appendStringInfoString(_detail, _("The slot has been inactive for more 
than the time specified by replication_slot_inactive_timeout parameter."));

Same.

CR5 ===

+   /*
+* This function isn't expected to be called for inactive timeout based
+* invalidation. A separate function InvalidateInactiveReplicationSlot 
is
+* to be used for that.

Do you think it's worth to explain why?

CR6 ===

+   if (replication_slot_inactive_timeout == 0)
+   return false;
+   else if (slot->inactive_since > 0)

"else" is not needed here.

CR7 ===

+   SpinLockAcquire(>mutex);
+
+   /*
+* Check if the slot needs to be invalidated due to
+* replication_slot_inactive_timeout GUC. We do this with the 
spinlock
+* held to avoid race conditions -- for example the 
inactive_since
+* could change, or the slot could be dropped.
+*/
+   now = GetCurrentTimestamp();

We should not call GetCurrentTimestamp() while holding a spinlock.

CR8 ===

+# Testcase start: Invalidate streaming standby's slot as well as logical
+# failover slot on primary due to inactive timeout GUC. Also, check the logical

s/inactive timeout GUC/replication_slot_inactive_timeout/?

CR9 ===

+# Start: Helper functions used for this test file
+# End: Helper functions used for this test file

I think that's the first TAP test with this comment. Not saying we should not 
but
why did you feel the need to add those?

[1]: 
https://www.postgresql.org/message-id/os0pr01mb5716b3942ae49f3f725aca9294...@os0pr01mb5716.jpnprd01.prod.outlook.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Bertrand Drouvot
Hi,

On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> >
> > > > Or a simple solution is that the slotsync worker updates
> > > > inactive_since as it does for non-synced slots, and disables
> > > > timeout-based slot invalidation for synced slots.
> >
> > I like this idea better, it takes care of such a case too when the
> > user is relying on sync-function rather than worker and does not want
> > to get the slots invalidated in between 2 sync function calls.
> 
> Please find the attached v31 patches implementing the above idea:

Thanks!

Some comments related to v31-0001:

=== testing the behavior

T1 ===

> - synced slots get their on inactive_since just like any other slot

It behaves as described.

T2 ===

> - synced slots inactive_since is set to current timestamp after the
> standby gets promoted to help inactive_since interpret correctly just
> like any other slot.
 
It behaves as described.

CR1 ===

+inactive_since value will get updated
+after every synchronization

indicates the last synchronization time? (I think that after every 
synchronization
could lead to confusion).

CR2 ===

+   /*
+* Set the time since the slot has become inactive 
after shutting
+* down slot sync machinery. This helps correctly 
interpret the
+* time if the standby gets promoted without a restart.
+*/

It looks to me that this comment is not at the right place because there is
nothing after the comment that indicates that we shutdown the "slot sync 
machinery".

Maybe a better place is before the function definition and mention that this is
currently called when we shutdown the "slot sync machinery"?

CR3 ===

+* We get the current time beforehand and only once to 
avoid
+* system calls overhead while holding the lock.

s/avoid system calls overhead while holding the lock/avoid system calls while 
holding the spinlock/?

CR4 ===

+* Set the time since the slot has become inactive. We get the current
+* time beforehand to avoid system call overhead while holding the lock

Same.

CR5 ===

+   # Check that the captured time is sane
+   if (defined $reference_time)
+   {

s/Check that the captured time is sane/Check that the inactive_since is sane/?

Sorry if some of those comments could have been done while I did review 
v29-0001.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 02:19:30PM +0530, Amit Kapila wrote:
> On Tue, Apr 2, 2024 at 1:54 PM Bertrand Drouvot
>  wrote:
> > What about adding a "wait" injection point in LogStandbySnapshot() to 
> > prevent
> > checkpointer/bgwriter to log a standby snapshot? Something among those 
> > lines:
> >
> >if (AmCheckpointerProcess() || AmBackgroundWriterProcess())
> >INJECTION_POINT("bgw-log-standby-snapshot");
> >
> > And make use of it in the test, something like:
> >
> >$node_primary->safe_psql('postgres',
> >"SELECT injection_points_attach('bgw-log-standby-snapshot', 
> > 'wait');");
> >
> 
> Sometimes we want the checkpoint to log the standby snapshot as we
> need it at a predictable time, maybe one can use
> pg_log_standby_snapshot() instead of that. Can we add an injection
> point as a separate patch/commit after a bit more discussion?

Sure, let's come back to this injection point discussion after the feature
freeze. BTW, I think it could also be useful to make use of injection point for
the test that has been added in 7f13ac8123.

I'll open a new thread for this at that time.

>. One other
> idea to make such tests predictable is to add a developer-specific GUC
> say debug_bg_log_standby_snapshot or something like that but injection
> point sounds like a better idea.

Agree.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 12:41:35PM +0530, Bharath Rupireddy wrote:
> On Tue, Apr 2, 2024 at 11:58 AM Bertrand Drouvot
>  wrote:
> >
> > > Or a simple solution is that the slotsync worker updates
> > > inactive_since as it does for non-synced slots, and disables
> > > timeout-based slot invalidation for synced slots.
> >
> > Yeah, I think the main question to help us decide is: do we want to 
> > invalidate
> > "inactive" synced slots locally (in addition to synchronizing the 
> > invalidation
> > from the primary)?
> 
> I think this approach looks way simpler than the other one. The other
> approach of linking inactive_since on the standby for synced slots to
> the actual LSNs (or other slot parameters) being updated or not looks
> more complicated, and might not go well with the end user.  However,
> we need to be able to say why we don't invalidate synced slots due to
> inactive timeout unlike the wal_removed invalidation that can happen
> right now on the standby for synced slots. This leads us to define
> actually what a slot being active means. Is syncing the data from the
> remote slot considered as the slot being active?
> 
> On the other hand, it may not sound great if we don't invalidate
> synced slots due to inactive timeout even though they hold resources
> such as WAL and XIDs.

Right and the "only" benefit then would be to give an idea as to when the last
sync did occur on the local slot.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 07:20:46AM +, Zhijie Hou (Fujitsu) wrote:
> I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which 
> can
> reproduce the data loss issue consistently on my machine.

Thanks!

> It may not reproduce
> in some rare cases if concurrent xl_running_xacts are written by bgwriter, but
> I think it's still valuable if it can verify the fix in most cases.

What about adding a "wait" injection point in LogStandbySnapshot() to prevent
checkpointer/bgwriter to log a standby snapshot? Something among those lines:

   if (AmCheckpointerProcess() || AmBackgroundWriterProcess())
   INJECTION_POINT("bgw-log-standby-snapshot");

And make use of it in the test, something like:

   $node_primary->safe_psql('postgres',
   "SELECT injection_points_attach('bgw-log-standby-snapshot', 
'wait');");

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote:
> On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
> 
> FWIW, coming to this thread late, I think that the inactive_since
> should not be synchronized from the primary. The wall clocks are
> different on the primary and the standby so having the primary's
> timestamp on the standby can confuse users, especially when there is a
> big clock drift. Also, as Amit mentioned, inactive_since seems not to
> be consistent with other parameters we copy. The
> replication_slot_inactive_timeout feature should work on the standby
> independent from the primary, like other slot invalidation mechanisms,
> and it should be based on its own local clock.

Thanks for sharing your thoughts! So, it looks like that most of us agree to not
sync inactive_since from the primary, I'm fine with that.

> If we want to invalidate the synced slots due to the timeout, I think
> we need to define what is "inactive" for synced slots.
> 
> Suppose that the slotsync worker updates the local (synced) slot's
> inactive_since whenever releasing the slot, irrespective of the actual
> LSNs (or other slot parameters) having been updated. I think that this
> idea cannot handle a slot that is not acquired on the primary. In this
> case, the remote slot is inactive but the local slot is regarded as
> active.  WAL files are piled up on the standby (and on the primary) as
> the slot's LSNs don't move forward. I think we want to regard such a
> slot as "inactive" also on the standby and invalidate it because of
> the timeout.

I think that makes sense to somehow link inactive_since on the standby to 
the actual LSNs (or other slot parameters) being updated or not.

> > > Now, the other concern is that calling GetCurrentTimestamp()
> > > could be costly when the values for the slot are not going to be
> > > updated but if that happens we can optimize such that before acquiring
> > > the slot we can have some minimal pre-checks to ensure whether we need
> > > to update the slot or not.
> 
> If we use such pre-checks, another problem might happen; it cannot
> handle a case where the slot is acquired on the primary but its LSNs
> don't move forward. Imagine a logical replication conflict happened on
> the subscriber, and the logical replication enters the retry loop. In
> this case, the remote slot's inactive_since gets updated for every
> retry, but it looks inactive from the standby since the slot LSNs
> don't change. Therefore, only the local slot could be invalidated due
> to the timeout but probably we don't want to regard such a slot as
> "inactive".
> 
> Another idea I came up with is that the slotsync worker updates the
> local slot's inactive_since to the local timestamp only when the
> remote slot might have got inactive. If the remote slot is acquired by
> someone, the local slot's inactive_since is also NULL. If the remote
> slot gets inactive, the slotsync worker sets the local timestamp to
> the local slot's inactive_since. Since the remote slot could be
> acquired and released before the slotsync worker gets the remote slot
> data again, if the remote slot's inactive_since > the local slot's
> inactive_since, the slotsync worker updates the local one.

Then I think we would need to be careful about time zone comparison.

> IOW, we
> detect whether the remote slot was acquired and released since the
> last synchronization, by checking the remote slot's inactive_since.
> This idea seems to handle these cases I mentioned unless I'm missing
> something, but it requires for the slotsync worker to update
> inactive_since in a different way than other parameters.
> 
> Or a simple solution is that the slotsync worker updates
> inactive_since as it does for non-synced slots, and disables
> timeout-based slot invalidation for synced slots.

Yeah, I think the main question to help us decide is: do we want to invalidate
"inactive" synced slots locally (in addition to synchronizing the invalidation
from the primary)? 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 04:24:49AM +, Zhijie Hou (Fujitsu) wrote:
> On Monday, April 1, 2024 9:28 PM Bertrand Drouvot 
>  wrote:
> > 
> > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote:
> > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
> > 
> > >
> > > > 2 ===
> > > >
> > > > +   {
> > > > +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> > > > +   {
> > > >
> > > > That could call SnapBuildSnapshotExists() multiple times for the
> > > > same "restart_lsn" (for example in case of multiple remote slots to 
> > > > sync).
> > > >
> > > > What if the sync worker records the last lsn it asks for
> > > > serialization (and serialized ? Then we could check that value first
> > > > before deciding to call (or not)
> > > > SnapBuildSnapshotExists() on it?
> > > >
> > > > It's not ideal because it would record "only the last one" but that
> > > > would be simple enough for now (currently there is only one sync
> > > > worker so that scenario is likely to happen).
> > > >
> > >
> > > Yeah, we could do that but I am not sure how much it can help. I guess
> > > we could do some tests to see if it helps.
> > 
> > Yeah not sure either. I just think it can only help and shouldn't make 
> > things
> > worst (but could avoid extra SnapBuildSnapshotExists() calls).
> 
> Thanks for the idea. I tried some tests based on Nisha's setup[1].

Thank you and Nisha and Shveta for the testing!

> I tried to
> advance the slots on the primary to the same restart_lsn before calling
> sync_replication_slots(), and reduced the data generated by pgbench.

Agree that this scenario makes sense to try to see the impact of
SnapBuildSnapshotExists().

> The SnapBuildSnapshotExists is still not noticeable in the profile.

SnapBuildSnapshotExists() number of calls are probably negligeable when 
compared to the IO calls generated by the fast forward logical decoding in this
scenario.

> So, I feel we
> could leave this as a further improvement once we encounter scenarios where
> the duplicate SnapBuildSnapshotExists call becomes noticeable.

Sounds reasonable to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bertrand Drouvot
Hi,

On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote:
> On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
>  wrote:
> > Then there is no need to call WaitForStandbyConfirmation() as it could go 
> > until
> > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
> > already
> > know it).
> >
> 
> Won't it will normally return from the first check in
> WaitForStandbyConfirmation() because standby_slot_names_config is not
> set on standby?

I think standby_slot_names can be set on a standby. One could want to set it in
a cascading standby env (though it won't have any real effects until the standby
is promoted). 

> 
> > 2 ===
> >
> > +   {
> > +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> > +   {
> >
> > That could call SnapBuildSnapshotExists() multiple times for the same
> > "restart_lsn" (for example in case of multiple remote slots to sync).
> >
> > What if the sync worker records the last lsn it asks for serialization (and
> > serialized ? Then we could check that value first before deciding to call 
> > (or not)
> > SnapBuildSnapshotExists() on it?
> >
> > It's not ideal because it would record "only the last one" but that would be
> > simple enough for now (currently there is only one sync worker so that 
> > scenario
> > is likely to happen).
> >
> 
> Yeah, we could do that but I am not sure how much it can help. I guess
> we could do some tests to see if it helps.

Yeah not sure either. I just think it can only help and shouldn't make things
worst (but could avoid extra SnapBuildSnapshotExists() calls).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-01 Thread Bertrand Drouvot
Hi,

On Sun, Mar 31, 2024 at 10:25:46AM +0530, Bharath Rupireddy wrote:
> On Thu, Mar 28, 2024 at 3:13 PM Bertrand Drouvot
>  wrote:
> > I think in this case it should always reflect the value from the primary (so
> > that one can understand why it is invalidated).
> 
> I'll come back to this as soon as we all agree on inactive_since
> behavior for synced slots.

Makes sense. Also if the majority of us thinks it's not needed for 
inactive_since
to be an exact copy of the primary, then let's go that way.

> > I think when it is invalidated it should always reflect the value from the
> > primary (so that one can understand why it is invalidated).
> 
> I'll come back to this as soon as we all agree on inactive_since
> behavior for synced slots.

Yeah.

> > T4 ===
> >
> > Also, it looks like querying pg_replication_slots() does not trigger an
> > invalidation: I think it should if the slot is not invalidated yet (and 
> > matches
> > the invalidation criteria).
> 
> There's a different opinion on this, check comment #3 from
> https://www.postgresql.org/message-id/CAA4eK1LLj%2BeaMN-K8oeOjfG%2BUuzTY%3DL5PXbcMJURZbFm%2B_aJSA%40mail.gmail.com.

Oh right, I can see Amit's point too. Let's put pg_replication_slots() out of
the game then.

> > CR6 ===
> >
> > +static bool
> > +InvalidateSlotForInactiveTimeout(ReplicationSlot *slot, bool need_locks)
> > +{
> >
> > InvalidatePossiblyInactiveSlot() maybe?
> 
> I think we will lose the essence i.e. timeout from the suggested
> function name, otherwise just the inactive doesn't give a clearer
> meaning. I kept it that way unless anyone suggests otherwise.

Right. OTOH I think that "Possibly" adds some nuance (like 
InvalidatePossiblyObsoleteSlot()
is already doing).

> Please see the attached v30 patch. 0002 is where all of the above
> review comments have been addressed.

Thanks! FYI, I did not look at the content yet, just replied to the above
comments.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bertrand Drouvot
Hi,

On Mon, Apr 01, 2024 at 06:05:34AM +, Zhijie Hou (Fujitsu) wrote:
> On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) 
>  wrote:
> Attach the V4 patch which includes the optimization to skip the decoding if
> the snapshot at the syncing restart_lsn is already serialized. It can avoid 
> most
> of the duplicate decoding in my test, and I am doing some more tests locally.
> 

Thanks!

1 ===

Same comment as in [1].

In LogicalSlotAdvanceAndCheckReadynessForDecoding(), if we are synchronizing 
slots
then I think that we can skip:

+   /*
+* Wait for specified streaming replication standby servers (if 
any)
+* to confirm receipt of WAL up to moveto lsn.
+*/
+   WaitForStandbyConfirmation(moveto);

Indeed if we are dealing with synced slot then we know we're in 
RecoveryInProgress().

Then there is no need to call WaitForStandbyConfirmation() as it could go until
the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
already
know it).

2 ===

+   {
+   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
+   {

That could call SnapBuildSnapshotExists() multiple times for the same
"restart_lsn" (for example in case of multiple remote slots to sync).

What if the sync worker records the last lsn it asks for serialization (and
serialized ? Then we could check that value first before deciding to call (or 
not)
SnapBuildSnapshotExists() on it?

It's not ideal because it would record "only the last one" but that would be
simple enough for now (currently there is only one sync worker so that scenario
is likely to happen).

Maybe an idea for future improvement (not for now) could be that
SnapBuildSerialize() maintains a "small list" of "already serialized" snapshots.

[1]: 
https://www.postgresql.org/message-id/ZgayTFIhLfzhpHci%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-01 Thread Bertrand Drouvot
e minimal pre-checks to ensure whether we need
> > to update the slot or not.

Also maybe we could accept a bit less accuracy and use
GetCurrentTransactionStopTimestamp() instead?

> If we are too much concerned about the cost of GetCurrentTimestamp(),
> a possible approach is just don't set inactive_since for slots being
> synced on the standby.
> Just let the first acquisition and release
> after the promotion do that job. We can always call this out in the
> docs saying "replication slots on the streaming standbys which are
> being synced from the primary are not inactive in practice, so the
> inactive_since is always NULL for them unless the standby is
> promoted".

I think that was the initial behavior that lead to Robert's remark (see [2]):

"
And I'm suspicious that having an exception for slots being synced is
a bad idea. That makes too much of a judgement about how the user will
use this field. It's usually better to just expose the data, and if
the user needs helps to make sense of that data, then give them that
help separately.
"

[1]: 
https://www.postgresql.org/message-id/CAA4eK1JtKieWMivbswYg5FVVB5FugCftLvQKVsxh%3Dm_8nk04vw%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CA%2BTgmob_Ta-t2ty8QrKHBGnNLrf4ZYcwhGHGFsuUoFrAEDw4sA%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-01 Thread Bertrand Drouvot
Hi,

On Mon, Apr 01, 2024 at 09:04:43AM +0530, Amit Kapila wrote:
> On Fri, Mar 29, 2024 at 6:17 PM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Mar 29, 2024 at 03:03:01PM +0530, Amit Kapila wrote:
> > > On Fri, Mar 29, 2024 at 11:49 AM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> > > > >
> > > > > Commit message states: "why we can't just update inactive_since for
> > > > > synced slots on the standby with the value received from remote slot
> > > > > on the primary. This is consistent with any other slot parameter i.e.
> > > > > all of them are synced from the primary."
> > > > >
> > > > > The inactive_since is not consistent with other slot parameters which
> > > > > we copy. We don't perform anything related to those other parameters
> > > > > like say two_phase phase which can change that property. However, we
> > > > > do acquire the slot, advance the slot (as per recent discussion [1]),
> > > > > and release it. Since these operations can impact inactive_since, it
> > > > > seems to me that inactive_since is not the same as other parameters.
> > > > > It can have a different value than the primary. Why would anyone want
> > > > > to know the value of inactive_since from primary after the standby is
> > > > > promoted?
> > > >
> > > > I think it can be useful "before" it is promoted and in case the 
> > > > primary is down.
> > > >
> > >
> > > It is not clear to me what is user going to do by checking the
> > > inactivity time for slots when the corresponding server is down.
> >
> > Say a failover needs to be done, then it could be useful to know for which
> > slots the activity needs to be resumed (thinking about external logical 
> > decoding
> > plugin, not about pub/sub here). If one see an inactive slot (since long 
> > "enough")
> > then he can start to reasonate about what to do with it.
> >
> > > I thought the idea was to check such slots and see if they need to be
> > > dropped or enabled again to avoid excessive disk usage, etc.
> >
> > Yeah that's the case but it does not mean inactive_since can't be useful in 
> > other
> > ways.
> >
> > Also, say the slot has been invalidated on the primary (due to inactivity 
> > timeout),
> > primary is down and there is a failover. By keeping the inactive_since from
> > the primary, one could know when the inactivity that lead to the timeout 
> > started.
> >
> 
> So, this means at promotion, we won't set the current_time for
> inactive_since which is not what the currently proposed patch is
> doing.

Yeah, that's why I made the comment T2 in [1].

> Moreover, doing the invalidation on promoted standby based on
> inactive_since of the primary node sounds debatable because the
> inactive_timeout could be different on the new node (promoted
> standby).

I think that if the slot is not invalidated before the promotion then we should
erase the value from the primary and use the promotion time.

> > Again, more concerned about external logical decoding plugin than pub/sub 
> > here.
> >
> > > > I agree that tracking the activity time of a synced slot can be useful, 
> > > > why
> > > > not creating a dedicated field for that purpose (and keep 
> > > > inactive_since a
> > > > perfect "copy" of the primary)?
> > > >
> > >
> > > We can have a separate field for this but not sure if it is worth it.
> >
> > OTOH I'm not sure that erasing this information from the primary is useful. 
> > I
> > think that 2 fields would be the best option and would be less subject of
> > misinterpretation.
> >
> > > > > Now, the other concern is that calling GetCurrentTimestamp()
> > > > > could be costly when the values for the slot are not going to be
> > > > > updated but if that happens we can optimize such that before acquiring
> > > > > the slot we can have some minimal pre-checks to ensure whether we need
> > > > > to update the slot or not.
> > > >
> > > > Right, but for a very active slot it is likely that we call 
> > > > GetCurrentTimestamp()
> > > > during almost each sync cycle.
> > > >
> > >
> > > True, but if we have to save a slot to disk each time to pe

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-29 Thread Bertrand Drouvot
Hi,

On Fri, Mar 29, 2024 at 03:03:01PM +0530, Amit Kapila wrote:
> On Fri, Mar 29, 2024 at 11:49 AM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> > >
> > > Commit message states: "why we can't just update inactive_since for
> > > synced slots on the standby with the value received from remote slot
> > > on the primary. This is consistent with any other slot parameter i.e.
> > > all of them are synced from the primary."
> > >
> > > The inactive_since is not consistent with other slot parameters which
> > > we copy. We don't perform anything related to those other parameters
> > > like say two_phase phase which can change that property. However, we
> > > do acquire the slot, advance the slot (as per recent discussion [1]),
> > > and release it. Since these operations can impact inactive_since, it
> > > seems to me that inactive_since is not the same as other parameters.
> > > It can have a different value than the primary. Why would anyone want
> > > to know the value of inactive_since from primary after the standby is
> > > promoted?
> >
> > I think it can be useful "before" it is promoted and in case the primary is 
> > down.
> >
> 
> It is not clear to me what is user going to do by checking the
> inactivity time for slots when the corresponding server is down.

Say a failover needs to be done, then it could be useful to know for which
slots the activity needs to be resumed (thinking about external logical decoding
plugin, not about pub/sub here). If one see an inactive slot (since long 
"enough")
then he can start to reasonate about what to do with it.

> I thought the idea was to check such slots and see if they need to be
> dropped or enabled again to avoid excessive disk usage, etc.

Yeah that's the case but it does not mean inactive_since can't be useful in 
other
ways.

Also, say the slot has been invalidated on the primary (due to inactivity 
timeout),
primary is down and there is a failover. By keeping the inactive_since from
the primary, one could know when the inactivity that lead to the timeout 
started.

Again, more concerned about external logical decoding plugin than pub/sub here.

> > I agree that tracking the activity time of a synced slot can be useful, why
> > not creating a dedicated field for that purpose (and keep inactive_since a
> > perfect "copy" of the primary)?
> >
> 
> We can have a separate field for this but not sure if it is worth it.

OTOH I'm not sure that erasing this information from the primary is useful. I
think that 2 fields would be the best option and would be less subject of
misinterpretation.

> > > Now, the other concern is that calling GetCurrentTimestamp()
> > > could be costly when the values for the slot are not going to be
> > > updated but if that happens we can optimize such that before acquiring
> > > the slot we can have some minimal pre-checks to ensure whether we need
> > > to update the slot or not.
> >
> > Right, but for a very active slot it is likely that we call 
> > GetCurrentTimestamp()
> > during almost each sync cycle.
> >
> 
> True, but if we have to save a slot to disk each time to persist the
> changes (for an active slot) then probably GetCurrentTimestamp()
> shouldn't be costly enough to matter.

Right, persisting the changes to disk would be even more costly.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-03-29 Thread Bertrand Drouvot
Hi,

On Fri, Mar 29, 2024 at 02:35:22PM +0530, Amit Kapila wrote:
> On Fri, Mar 29, 2024 at 1:08 PM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Mar 29, 2024 at 07:23:11AM +, Zhijie Hou (Fujitsu) wrote:
> > > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > > > > Attach a new version patch which fixed an un-initialized variable
> > > > > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > > > > 040 tap-test so that we can analyze the possible CFbot failures 
> > > > > easily.
> > > > >
> > > >
> > > > Thanks!
> > > >
> > > > +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> > > > +   {
> > > > +   /*
> > > > +* By advancing the restart_lsn, confirmed_lsn, and 
> > > > xmin using
> > > > +* fast-forward logical decoding, we ensure that the 
> > > > required
> > > > snapshots
> > > > +* are saved to disk. This enables logical decoding to 
> > > > quickly
> > > > reach a
> > > > +* consistent point at the restart_lsn, eliminating the 
> > > > risk of
> > > > missing
> > > > +* data during snapshot creation.
> > > > +*/
> > > > +
> > > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> > > > +
> > > > found_consistent_point);
> > > > +   ReplicationSlotsComputeRequiredLSN();
> > > > +   updated_lsn = true;
> > > > +   }
> > > >
> > > > Instead of using pg_logical_replication_slot_advance() for each synced 
> > > > slot and
> > > > during sync cycles what about?:
> > > >
> > > > - keep sync slot synchronization as it is currently (not using
> > > > pg_logical_replication_slot_advance())
> > > > - create "an hidden" logical slot if sync slot feature is on
> > > > - at the time of promotion use pg_logical_replication_slot_advance() on 
> > > > this
> > > > hidden slot only to advance to the max lsn of the synced slots
> > > >
> > > > I'm not sure that would be enough, just asking your thoughts on this 
> > > > (benefits
> > > > would be to avoid calling pg_logical_replication_slot_advance() on each 
> > > > sync
> > > > slots and during the sync cycles).
> > >
> > > Thanks for the idea !
> > >
> > > I considered about this. I think advancing the "hidden" slot on promotion 
> > > may be a
> > > bit late, because if we cannot reach the consistent point after advancing 
> > > the
> > > "hidden" slot, then it means we may need to remove all the synced slots 
> > > as we
> > > are not sure if they are usable(will not loss data) after promotion.
> >
> > What about advancing the hidden slot during the sync cycles then?
> >
> > > The current approach is to mark such un-consistent slot as temp and 
> > > persist
> > > them once it reaches consistent point, so that user can ensure the slot 
> > > can be
> > > used after promotion once persisted.
> >
> > Right, but do we need to do so for all the sync slots? Would a single hidden
> > slot be enough?
> >
> 
> Even if we mark one of the synced slots as persistent without reaching
> a consistent state, it could create a problem after promotion. And,
> how a single hidden slot would serve the purpose, different synced
> slots will have different restart/confirmed_flush LSN and we won't be
> able to perform advancing for those using a single slot. For example,
> say for first synced slot, it has not reached a consistent state and
> then how can it try for the second slot? This sounds quite tricky to
> make work. We should go with something simple where the chances of
> introducing bugs are lesser.

Yeah, better to go with something simple.

+   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
+   {
+   /*
+* By advancing the restart_lsn, confirmed_lsn, and xmin using
+* fast-forward logical decoding, we ensure that the required 
snapshots
+* are saved to disk. This enables logical decoding to quickly 
reach a
+* consistent point at the restart_lsn, eliminating the risk of 
missing
+* data during snapshot creation.
+*/
+   pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+   
found_consistent_point);

In our case, what about skipping WaitForStandbyConfirmation() in
pg_logical_replication_slot_advance()? (It could go until the
RecoveryInProgress() check in StandbySlotsHaveCaughtup() if we don't skip it).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-03-29 Thread Bertrand Drouvot
Hi,

On Fri, Mar 29, 2024 at 07:23:11AM +, Zhijie Hou (Fujitsu) wrote:
> On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
>  wrote:
> > 
> > Hi,
> > 
> > On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > > Attach a new version patch which fixed an un-initialized variable
> > > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > > 040 tap-test so that we can analyze the possible CFbot failures easily.
> > >
> > 
> > Thanks!
> > 
> > +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> > +   {
> > +   /*
> > +* By advancing the restart_lsn, confirmed_lsn, and xmin 
> > using
> > +* fast-forward logical decoding, we ensure that the 
> > required
> > snapshots
> > +* are saved to disk. This enables logical decoding to 
> > quickly
> > reach a
> > +* consistent point at the restart_lsn, eliminating the 
> > risk of
> > missing
> > +* data during snapshot creation.
> > +*/
> > +
> > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> > +
> > found_consistent_point);
> > +   ReplicationSlotsComputeRequiredLSN();
> > +   updated_lsn = true;
> > +   }
> > 
> > Instead of using pg_logical_replication_slot_advance() for each synced slot 
> > and
> > during sync cycles what about?:
> > 
> > - keep sync slot synchronization as it is currently (not using
> > pg_logical_replication_slot_advance())
> > - create "an hidden" logical slot if sync slot feature is on
> > - at the time of promotion use pg_logical_replication_slot_advance() on this
> > hidden slot only to advance to the max lsn of the synced slots
> > 
> > I'm not sure that would be enough, just asking your thoughts on this 
> > (benefits
> > would be to avoid calling pg_logical_replication_slot_advance() on each sync
> > slots and during the sync cycles).
> 
> Thanks for the idea !
> 
> I considered about this. I think advancing the "hidden" slot on promotion may 
> be a
> bit late, because if we cannot reach the consistent point after advancing the
> "hidden" slot, then it means we may need to remove all the synced slots as we
> are not sure if they are usable(will not loss data) after promotion.

What about advancing the hidden slot during the sync cycles then?

> The current approach is to mark such un-consistent slot as temp and persist
> them once it reaches consistent point, so that user can ensure the slot can be
> used after promotion once persisted.

Right, but do we need to do so for all the sync slots? Would a single hidden
slot be enough?

> Another optimization idea is to check the snapshot file existence before 
> calling the
> slot_advance(). If the file already exists, we skip the decoding and directly
> update the restart_lsn. This way, we could also avoid some duplicate decoding
> work.

Yeah, I think it's a good idea (even better if we can do this check without
performing any I/O).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-03-29 Thread Bertrand Drouvot
Hi,

On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> Attach a new version patch which fixed an un-initialized variable issue and
> added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so 
> that
> we can analyze the possible CFbot failures easily.
> 

Thanks!

+   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
+   {
+   /*
+* By advancing the restart_lsn, confirmed_lsn, and xmin using
+* fast-forward logical decoding, we ensure that the required 
snapshots
+* are saved to disk. This enables logical decoding to quickly 
reach a
+* consistent point at the restart_lsn, eliminating the risk of 
missing
+* data during snapshot creation.
+*/
+   pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+   
found_consistent_point);
+   ReplicationSlotsComputeRequiredLSN();
+   updated_lsn = true;
+   }

Instead of using pg_logical_replication_slot_advance() for each synced slot
and during sync cycles what about?:

- keep sync slot synchronization as it is currently (not using 
pg_logical_replication_slot_advance())
- create "an hidden" logical slot if sync slot feature is on
- at the time of promotion use pg_logical_replication_slot_advance() on this
hidden slot only to advance to the max lsn of the synced slots

I'm not sure that would be enough, just asking your thoughts on this (benefits
would be to avoid calling pg_logical_replication_slot_advance() on each sync 
slots
and during the sync cycles).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-29 Thread Bertrand Drouvot
Hi,

On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> On Wed, Mar 27, 2024 at 9:00 PM Bharath Rupireddy
>  wrote:
> >
> >
> > Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
> > standby for sync slots.
> >
> 
> Commit message states: "why we can't just update inactive_since for
> synced slots on the standby with the value received from remote slot
> on the primary. This is consistent with any other slot parameter i.e.
> all of them are synced from the primary."
> 
> The inactive_since is not consistent with other slot parameters which
> we copy. We don't perform anything related to those other parameters
> like say two_phase phase which can change that property. However, we
> do acquire the slot, advance the slot (as per recent discussion [1]),
> and release it. Since these operations can impact inactive_since, it
> seems to me that inactive_since is not the same as other parameters.
> It can have a different value than the primary. Why would anyone want
> to know the value of inactive_since from primary after the standby is
> promoted?

I think it can be useful "before" it is promoted and in case the primary is 
down.
I agree that tracking the activity time of a synced slot can be useful, why
not creating a dedicated field for that purpose (and keep inactive_since a
perfect "copy" of the primary)?

> Now, the other concern is that calling GetCurrentTimestamp()
> could be costly when the values for the slot are not going to be
> updated but if that happens we can optimize such that before acquiring
> the slot we can have some minimal pre-checks to ensure whether we need
> to update the slot or not.

Right, but for a very active slot it is likely that we call 
GetCurrentTimestamp()
during almost each sync cycle.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-03-28 Thread Bertrand Drouvot
Hi,

On Thu, Mar 28, 2024 at 05:05:35PM +0530, Amit Kapila wrote:
> On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
> >
> > > To fix this, we could use the fast forward logical decoding to advance 
> > > the synced
> > > slot's lsn/xmin when syncing these values instead of directly updating the
> > > slot's info. This way, the snapshot will be serialized to disk when 
> > > decoding.
> > > If we could not reach to the consistent point at the remote restart_lsn, 
> > > the
> > > slot is marked as temp and will be persisted once it reaches the 
> > > consistent
> > > point. I am still analyzing the fix and will share once ready.
> >
> > Thanks! I'm wondering about the performance impact (even in fast_forward 
> > mode),
> > might be worth to keep an eye on it.
> >
> 
> True, we can consider performance but correctness should be a
> priority,

Yeah of course.

> and can we think of a better way to fix this issue?

I'll keep you posted if there is one that I can think of.

> > Should we create a 17 open item [1]?
> >
> > [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> >
> 
> Yes, we can do that.

done.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-03-28 Thread Bertrand Drouvot
Hi,

On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> When analyzing one BF error[1], we find an issue of slotsync: Since we don't
> perform logical decoding for the synced slots when syncing the lsn/xmin of
> slot, no logical snapshots will be serialized to disk. So, when user starts to
> use these synced slots after promotion, it needs to re-build the consistent
> snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
> position indicates that there are running transactions. This however could
> cause the data that before the consistent point to be missed[2].

I see, nice catch and explanation, thanks!

> This issue doesn't exist on the primary because the snapshot at restart_lsn
> should have been serialized to disk (SnapBuildProcessRunningXacts ->
> SnapBuildSerialize), so even if the logical decoding restarts, it can find
> consistent snapshot immediately at restart_lsn.

Right.

> To fix this, we could use the fast forward logical decoding to advance the 
> synced
> slot's lsn/xmin when syncing these values instead of directly updating the
> slot's info. This way, the snapshot will be serialized to disk when decoding.
> If we could not reach to the consistent point at the remote restart_lsn, the
> slot is marked as temp and will be persisted once it reaches the consistent
> point. I am still analyzing the fix and will share once ready.

Thanks! I'm wondering about the performance impact (even in fast_forward mode),
might be worth to keep an eye on it.

Should we create a 17 open item [1]?

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-28 Thread Bertrand Drouvot
h(slot, path, ERROR);
+   }

Maybe we could create a new function say GivenReplicationSlotSave()
with a slot as parameter, that ReplicationSlotSave() could call too?

CR9 ===

+   if (check_for_invalidation)
+   {
+   /* The slot is ours by now */
+   Assert(s->active_pid == MyProcPid);
+
+   /*
+* Well, the slot is not yet ours really unless we check for the
+* invalidation below.
+*/
+   s->active_pid = 0;
+   if (InvalidateReplicationSlotForInactiveTimeout(s, true, true))
+   {
+   /*
+* If the slot has been invalidated, recalculate the 
resource
+* limits.
+*/
+   ReplicationSlotsComputeRequiredXmin(false);
+   ReplicationSlotsComputeRequiredLSN();
+
+   /* Might need it for slot clean up on error, so restore 
it */
+   s->active_pid = MyProcPid;
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("cannot acquire invalidated 
replication slot \"%s\"",
+   
NameStr(MyReplicationSlot->data.name;
+   }
+   s->active_pid = MyProcPid;

Are we not missing some SpinLockAcquire/Release on the slot's mutex here? (the
places where we set the active_pid).

CR10 ===

@@ -1628,6 +1674,10 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
if (SlotIsLogical(s))
invalidation_cause = cause;
break;
+   case RS_INVAL_INACTIVE_TIMEOUT:
+   if 
(InvalidateReplicationSlotForInactiveTimeout(s, false, false))
+   invalidation_cause = cause;
+   break;

InvalidatePossiblyObsoleteSlot() is not called with such a reason, better to use
an Assert here and in the caller too?

CR11 ===

+++ b/src/test/recovery/t/050_invalidate_slots.pl

why not using 019_replslot_limit.pl?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




  1   2   3   >