Re: document the need to analyze partitioned tables

2023-09-29 Thread Bruce Momjian
On Sat, Sep 30, 2023 at 12:39:43AM +0200, Laurenz Albe wrote:
> On Fri, 2023-09-29 at 18:08 -0400, Bruce Momjian wrote:
> > On Wed, Sep  6, 2023 at 05:53:56AM +0200, Laurenz Albe wrote:
> > > > We may have different mental models here. This relates to the part
> > > > that I wasn't keen on in your patch, i.e:
> > > > 
> > > > +    The partitions of a partitioned table are normal tables and get 
> > > > processed
> > > > +    by autovacuum
> > > > 
> > > > While I agree that the majority of partitions are likely to be
> > > > relkind='r', which you might ordinarily consider a "normal table", you
> > > > just might change your mind when you try to INSERT or UPDATE records
> > > > that would violate the partition constraint. Some partitions might
> > > > also be themselves partitioned tables and others might be foreign
> > > > tables. That does not really matter much when it comes to what
> > > > autovacuum does or does not do, but I'm not really keen to imply in
> > > > our documents that partitions are "normal tables".
> > > 
> > > Agreed, there are differences between partitions and normal tables.
> > > And this is not the place in the documentation where we would like to
> > > get into detail about the differences.
> > > 
> > > Attached is the next version of my patch.
> > 
> > I adjusted your patch to be shorter and clearer, patch attached.  I am
> > planning to apply this back to PG 11.
> 
> Thanks for looking at this.
> 
> I am mostly fine with your version, but it does not directly state that
> autovacuum does not process partitioned tables.  I think this should be
> clarified in the beginning.

Very good point!   Updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 9cf9d030a8..be1c522575 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -861,10 +861,16 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu

 

-Partitioned tables are not processed by autovacuum.  Statistics
-should be collected by running a manual ANALYZE when it is
-first populated, and again whenever the distribution of data in its
-partitions changes significantly.
+Partitioned tables do not directly store tuples and consequently
+autovacuum does not VACUUM them.  (Autovacuum does
+perform VACUUM on table partitions just like other
+tables.)  Unfortunately, this also means that autovacuum doesn't
+run ANALYZE on partitioned tables, and this
+can cause suboptimal plans for queries that reference partitioned
+table statistics.  You can work around this problem by manually
+running ANALYZE on partitioned tables when they
+are first populated, and again whenever the distribution of data in
+their partitions changes significantly.

 



Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 6:27 PM James Coleman  wrote:
> On Fri, Sep 29, 2023 at 4:06 PM Peter Geoghegan  wrote:
> > I think that it's talking about what happens during opportunistic
> > pruning, in particular what happens to HOT chains. (Though pruning
> > does almost the same amount of useful work with non-heap-only tuples,
> > so it's a bit unfortunate that the name "HOT pruning" seems to have
> > stuck.)
>
> That's very likely what the intention was. I read it again, and the
> same confusion still sticks out to me: it doesn't say anything
> explicitly about opportunistic pruning (I'm not sure if that term is
> "public docs" level, so that's probably fine), and it doesn't scope
> the claim to intermediate tuples in a HOT chain -- indeed the context
> is the HOT feature generally.

It doesn't mention opportunistic pruning by name, but it does say:

"Old versions of updated rows can be completely removed during normal
operation, including SELECTs, instead of requiring periodic vacuum
operations."

There is a strong association between HOT and pruning (particularly
opportunistic pruning) in the minds of some hackers (and perhaps some
users), because both features appeared together in 8.3, and both are
closely related at the implementation level. It's nevertheless not
quite accurate to say that HOT "provides two optimizations" -- since
pruning (the second of the two bullet points) isn't fundamentally
different for pages that don't have any HOT chains. Not at the level
of the heap pages, at least (indexes are another matter).

Explaining these sorts of distinctions through prose is very
difficult. You really need diagrams for something like this IMV.
Without that, the only way to make all of this less confusing is to
avoid all discussion of pruning...but then you can't really make the
point about breaking the dependency on VACUUM, which is a relatively
important point -- one with real practical relevance.

> This is why I discovered it: it says "indexes do not reference their
> page item identifiers", which is manifestly not true when talking
> about the root item, and in fact would defeat the whole purpose of HOT
> (at least in a old-to-new chain like Postgres uses).

Yeah, but...that's not what was intended. Obviously, the index hasn't
changed, and we expect index scans to continue to give correct
answers. So it is pretty strongly implied that it continues to point
to something valid.

> Assuming people can be convinced this is confusing (I realize you may
> not be yet), I see two basic options:
>
> 1. Update this to discuss both intermediate tuples and root items
> separately. This could entail either one larger paragraph or splitting
> such that instead of "two optimizations" we say "three" optimizations.
>
> 2. Change "old versions" to something like "intermediate versions in a
> series of updates".
>
> I prefer some form of (1) since it more fully describes the behavior,
> but we could tweak further for concision.

Bruce authored these docs. I was mostly just glad to have anything at
all about HOT in the user-facing docs, quite honestly.

-- 
Peter Geoghegan




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread James Coleman
On Fri, Sep 29, 2023 at 4:06 PM Peter Geoghegan  wrote:
>
> On Fri, Sep 29, 2023 at 11:45 AM James Coleman 
> wrote:my reading the issue is that "old versions" doesn't say
> > anything about "old HOT versions; it seems to be describing what
> > happens generally when a heap-only tuple is written -- which would
> > include the first time a heap-only tuple is written.
>
> I think that it's talking about what happens during opportunistic
> pruning, in particular what happens to HOT chains. (Though pruning
> does almost the same amount of useful work with non-heap-only tuples,
> so it's a bit unfortunate that the name "HOT pruning" seems to have
> stuck.)

That's very likely what the intention was. I read it again, and the
same confusion still sticks out to me: it doesn't say anything
explicitly about opportunistic pruning (I'm not sure if that term is
"public docs" level, so that's probably fine), and it doesn't scope
the claim to intermediate tuples in a HOT chain -- indeed the context
is the HOT feature generally.

This is why I discovered it: it says "indexes do not reference their
page item identifiers", which is manifestly not true when talking
about the root item, and in fact would defeat the whole purpose of HOT
(at least in a old-to-new chain like Postgres uses).

Assuming people can be convinced this is confusing (I realize you may
not be yet), I see two basic options:

1. Update this to discuss both intermediate tuples and root items
separately. This could entail either one larger paragraph or splitting
such that instead of "two optimizations" we say "three" optimizations.

2. Change "old versions" to something like "intermediate versions in a
series of updates".

I prefer some form of (1) since it more fully describes the behavior,
but we could tweak further for concision.

> > And when it's the
> > first heap-only tuple the "old version" would be the original version,
> > which would not be a heap-only tuple.
>
> The docs say "Old versions of updated rows can be completely removed
> during normal operation". Opportunistic pruning removes dead heap-only
> tuples completely, and makes their line pointers LP_UNUSED right away.
> But it can also entail removing storage for the original root item
> heap tuple, and making its line pointer LP_REDIRECT right away (not
> LP_DEAD or LP_UNUSED) at most once in the life of each HOT chain. So
> yeah, we're not quite limited to removing storage for heap-only tuples
> when pruning a HOT chain. Does that distinction really matter, though?

Given pageinspect can show you the original tuple still exists and
that the index still references it...I think it does.

I suppose very few people go checking that out, of course, but I'd
like to be precise.

Regards,
James Coleman




Re: Eager page freeze criteria clarification

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas  wrote:
> I think that's true. For me, the issue is what a user is practically
> likely to notice and care about. I submit that on a
> not-particularly-busy system, it would probably be fine to freeze
> aggressively in almost every situation, because you're only incurring
> costs you can afford to pay. On a busy system, it's more important to
> be right, or at least not too badly wrong. But even on a busy system,
> I think that when the time between data being written and being frozen
> is more than a few tens of minutes, it's very doubtful that anyone is
> going to notice the contribution that freezing makes to the overall
> workload. They're much more likely to notice an annoying autovacuum
> than they are to notIce a bit of excess freezing that ends up getting
> reversed. But when you start cranking the time between writing data
> and freezing it down into the single-digit numbers of minutes, and
> even more if you push down to tens of seconds or less, now I think
> people are going to care more about useless freezing work than about
> long-term autovacuum risks. Because now their database is really busy
> so they care a lot about performance, and seemingly most of the data
> involved is ephemeral anyway.

I think that that's all true.

As I believe I pointed out at least once in the past, my thinking
about the practical requirements from users was influenced by this
paper/talk:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/

For the types of applications that Postgres is a natural fit for, most
of the data is cold -- very cold ("freezing" cold, even). If you have
a 50GB pgbench_accounts table, Postgres will always perform
suboptimally. But so will SQL Server, Oracle, and MySQL.

While pgbench makes a fine stress-test, for the most part its workload
is highly unrealistic. And yet we seem to think that it's just about
the most important benchmark of all. If we're not willing to get over
even small regressions in pgbench, I fear we'll never make significant
progress in this area. It's at least partly a cultural problem IMV.

The optimal freezing strategy for pgbench_accounts is to never freeze,
even once. It doesn't matter if you vary the distributions of the
accounts table updates -- it's still inevitable that each and every
row will get updated before too long. In fact, it's not just freezing
that should always be avoided here -- same applies with pruning by
VACUUM.

As I suggested in my last email, the lesson offered by
pgbench_accounts table seems to be "never VACUUM at all, except
perhaps to advance relfrozenxid" (which shouldn't actually require any
freezing even one page). If you haven't tuned heap fill factor, then
you might want to VACUUM a bit, at first. But, overall, vacuuming is
bad. That is the logical though absurd conclusion. It completely flies
in the face of practical experience.

-- 
Peter Geoghegan




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-29 Thread Tom Lane
Thomas Munro  writes:
> This came up with the CI image:
> https://www.postgresql.org/message-id/flat/20230731191510.pebqeiuo2sbmlcfh%40awork3.anarazel.de

BTW, after re-reading that thread, I think the significant
difference is that these FreeBSD images don't force you to
select a timezone during setup, unlike what I recall seeing
when installing x86_64 FreeBSD.  You're not forced to run
bsdconfig at all, and even if you do it doesn't make you
enter the sub-menu where you can pick a timezone.  I recall
that I did do that while setting mine up, but I'll bet
Tomas skipped it.  I'm not sure at this point whether FreeBSD
changed behavior since 13.x, or this is a difference between
their preferred installation processes for x86 vs. ARM.
But in any case, it's clearly easier to get into the
no-/etc/localtime state with these systems than I thought
before.

regards, tom lane




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-29 Thread Tom Lane
Thomas Munro  writes:
> Does the image lack a /etc/localtime file/link, but perhaps one of you
> did something to create it?

Hah!  I thought it had to be some sort of locale effect, but I failed
to think of that as a contributor :-(.  My installation does have
/etc/localtime, and removing it duplicates Tomas' syndrome.

I also find that if I add "-gmt 1" to the clock invocation, it's happy
with or without /etc/localtime.  So I think we should modify the test
case to use that to reduce its environmental sensitivity.  Will
go make it so.

regards, tom lane




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-29 Thread Thomas Munro
Does the image lack a /etc/localtime file/link, but perhaps one of you
did something to create it?

This came up with the CI image:
https://www.postgresql.org/message-id/flat/20230731191510.pebqeiuo2sbmlcfh%40awork3.anarazel.de
Also mentioned at: https://wiki.tcl-lang.org/page/clock+scan




Re: False "pg_serial": apparent wraparound” in logs

2023-09-29 Thread Imseih (AWS), Sami
> I don't really understand what exactly the problem is, or how this fixes
> it. But this doesn't feel right:

As the repro show, false reports of "pg_serial": apparent wraparound”
messages are possible. For a very busy system which checkpoints frequently
and heavy usage of serializable isolation, this will flood the error logs, and 
falsely cause alarm to the user. It also prevents the SLRU from being
truncated.

In my repro, I end up seeing, even though the SLRU does not wraparound.
" LOG: could not truncate directory "pg_serial": apparent wraparound"

> Firstly, isn't headPage == 0 also a valid value? We initialize headPage
> to -1 when it's not in use.

Yes. You are correct. This is wrong.

> Secondly, shouldn't we set it to the page corresponding to headXid
> rather than tailXid.

> Thirdly, I don't think this code should have any business setting
> latest_page_number directly. latest_page_number is set in
> SimpleLruZeroPage(). 

Correct, after checking again, I do realize the patch is wrong.

> Are we missing a call to SimpleLruZeroPage() somewhere?

That is a good point.

The initial idea was to advance the latest_page_number 
during SerialSetActiveSerXmin, but the initial approach is 
obviously wrong.

When SerialSetActiveSerXmin is called for a new active
serializable xmin, and at that point we don't need to keep any
any earlier transactions, should SimpleLruZeroPage be called
to ensure there is a target page for the xid?

I tried something like below, which fixes my repro, by calling
SimpleLruZeroPage at the end of SerialSetActiveSerXmin.

@@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
 static void
 SerialSetActiveSerXmin(TransactionId xid)
 {
+   int targetPage = SerialPage(xid);
+
LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
 
/*
@@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)
 
serialControl->tailXid = xid;
 
+   if (serialControl->headPage != targetPage)
+   SimpleLruZeroPage(SerialSlruCtl, targetPage);
+
LWLockRelease(SerialSLRULock);
 }

Regards,

Sami









Re: how to manage Cirrus on personal repository

2023-09-29 Thread David Steele




On 9/29/23 18:02, Thomas Munro wrote:

On Sat, Sep 30, 2023 at 3:35 AM Nazir Bilal Yavuz  wrote:

On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut  wrote:

I have a private repository on GitHub where I park PostgreSQL
development work.  I also have Cirrus activated on that repository, to
build those branches, using the existing Cirrus configuration.

Now under the new system of limited credits that started in September,
this blew through the credits about half-way through the month.


[Replying to wrong person because I never saw Peter's original email,
something must be jammed in the intertubes...]

It's annoying, but on the bright side... if you're making it halfway
through the month, I think that means there's a chance you'd have
enough credit if we can depessimise the known problems with TAP query
execution[1], and there are some more obviously stupid things too
(sleep/poll for replication progress where PostgreSQL should offer an
event-based wait-for-replay, running all the tests when only docs
changed, running the regression tests fewer times, ...).


I also have Cirrus setup for my cloned Postgres repo and it is annoying 
that it will run CI whenever I push up new commits that I pulled from 
git.p.o.


My strategy for this (on other projects) is to require branches that 
need CI to end in -ci. Since I use multiple CI services, I further allow 
-cic (Cirrus), -cig (Github Actions), etc. PRs are also included. That 
way, nothing runs through CI unless I want it to.


Here's an example of how this works on Cirrus:

# Build the branch if it is a pull request, or ends in -ci/-cic (-cic 
targets only Cirrus CI)
only_if: $CIRRUS_PR != '' || $CIRRUS_BRANCH =~ '.*-ci$' || 
$CIRRUS_BRANCH =~ '.*-cic$'


Regards,
-David




Re: SHARED locks barging behaviour

2023-09-29 Thread Laurenz Albe
On Fri, 2023-09-29 at 17:45 -0400, Bruce Momjian wrote:
> On Tue, Jan 17, 2023 at 12:18:28PM -0500, Arul Ajmani wrote:
> > I'm trying to better understand the following barging behaviour with SHARED
> > locks.
> ...
> > Given there is a transaction waiting to acquire a FOR UPDATE lock, I was
> > surprised to see the second FOR SHARE transaction return immediately 
> > instead of
> > waiting. I have two questions:
> > 
> > 1) Could this barging behaviour potentially starve out the transaction 
> > waiting
> > to acquire the FOR UPDATE lock, if there is a continuous queue of 
> > transactions
> > that acquire a FOR SHARE lock briefly?
> 
> Yes, see below.
> 
> > 2) Assuming this is by design, I couldn't find (in code) where this explicit
> > policy choice is made. I was looking around LockAcquireExtended, but it 
> > seems
> > like the decision is made above this layer. Could someone more familiar with
> > this code point me at the right place? 
> 
> I know this from January, but I do have an answer.  [...]

You answer the question where this is implemented.  But the more important 
question
is whether this is intentional.  This code was added by 0ac5ad5134f (introducing
FOR KEY SHARE and FOR NO KEY UPDATE).  My feeling is that it is not intentional 
that
a continuous stream of share row locks can starve out an exclusive row lock, 
since
PostgreSQL behaves differently with other locks.

On the other hand, if nobody has complained about it in these ten years, perhaps
it is just fine the way it is, if by design or not.

Yours,
Laurenz Albe




Re: document the need to analyze partitioned tables

2023-09-29 Thread Laurenz Albe
On Fri, 2023-09-29 at 18:08 -0400, Bruce Momjian wrote:
> On Wed, Sep  6, 2023 at 05:53:56AM +0200, Laurenz Albe wrote:
> > > We may have different mental models here. This relates to the part
> > > that I wasn't keen on in your patch, i.e:
> > > 
> > > +    The partitions of a partitioned table are normal tables and get 
> > > processed
> > > +    by autovacuum
> > > 
> > > While I agree that the majority of partitions are likely to be
> > > relkind='r', which you might ordinarily consider a "normal table", you
> > > just might change your mind when you try to INSERT or UPDATE records
> > > that would violate the partition constraint. Some partitions might
> > > also be themselves partitioned tables and others might be foreign
> > > tables. That does not really matter much when it comes to what
> > > autovacuum does or does not do, but I'm not really keen to imply in
> > > our documents that partitions are "normal tables".
> > 
> > Agreed, there are differences between partitions and normal tables.
> > And this is not the place in the documentation where we would like to
> > get into detail about the differences.
> > 
> > Attached is the next version of my patch.
> 
> I adjusted your patch to be shorter and clearer, patch attached.  I am
> planning to apply this back to PG 11.

Thanks for looking at this.

I am mostly fine with your version, but it does not directly state that
autovacuum does not process partitioned tables.  I think this should be
clarified in the beginning.

Yours,
Laurenz Albe




Re: document the need to analyze partitioned tables

2023-09-29 Thread Bruce Momjian
On Wed, Sep  6, 2023 at 05:53:56AM +0200, Laurenz Albe wrote:
> > We may have different mental models here. This relates to the part
> > that I wasn't keen on in your patch, i.e:
> > 
> > +    The partitions of a partitioned table are normal tables and get 
> > processed
> > +    by autovacuum
> > 
> > While I agree that the majority of partitions are likely to be
> > relkind='r', which you might ordinarily consider a "normal table", you
> > just might change your mind when you try to INSERT or UPDATE records
> > that would violate the partition constraint. Some partitions might
> > also be themselves partitioned tables and others might be foreign
> > tables. That does not really matter much when it comes to what
> > autovacuum does or does not do, but I'm not really keen to imply in
> > our documents that partitions are "normal tables".
> 
> Agreed, there are differences between partitions and normal tables.
> And this is not the place in the documentation where we would like to
> get into detail about the differences.
> 
> Attached is the next version of my patch.

I adjusted your patch to be shorter and clearer, patch attached.  I am
planning to apply this back to PG 11.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 9cf9d030a8..70b5576037 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -861,10 +861,16 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu

 

-Partitioned tables are not processed by autovacuum.  Statistics
-should be collected by running a manual ANALYZE when it is
-first populated, and again whenever the distribution of data in its
-partitions changes significantly.
+Partitioned tables do not directly store tuples and consequently do not
+require autovacuum to perform automated VACUUM.
+(Autovacuum does perform VACUUM on table
+partitions just like other tables.)  Unfortunately, this also means
+that autovacuum doesn't run ANALYZE on partitioned
+tables, and this can cause suboptimal plans for queries that reference
+partitioned table statistics.  You can work around this problem by
+manually running ANALYZE on partitioned tables
+when they are first populated, and again whenever the distribution
+of data in their partitions changes significantly.

 



Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-29 Thread Tom Lane
Tomas Vondra  writes:
> On 9/27/23 15:38, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> Hmm, yeah. Which FreeBSD image did you install? armv7 or aarch64?

>> https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/14.0/FreeBSD-14.0-BETA3-arm64-aarch64-RPI.img.xz

> Thanks, that's the image I've used. This is really strange ...

I've now laid my hands on a Pi 4B, and with that exact same SD card
plugged in, I get the same results I did with the 3B+: pltcl
regression tests pass, and so does the manual check with tclsh8.[67].
So it seems like the "different CPU" theory doesn't survive contact
with reality either.

I'm completely baffled, but I do notice that "clock scan" without
a -format option is deprecated according to the Tcl man page.
Maybe we should stop relying on deprecated behavior and put in
a -format option?

regards, tom lane




Re: how to manage Cirrus on personal repository

2023-09-29 Thread Thomas Munro
On Sat, Sep 30, 2023 at 3:35 AM Nazir Bilal Yavuz  wrote:
> On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut  wrote:
> > I have a private repository on GitHub where I park PostgreSQL
> > development work.  I also have Cirrus activated on that repository, to
> > build those branches, using the existing Cirrus configuration.
> >
> > Now under the new system of limited credits that started in September,
> > this blew through the credits about half-way through the month.

[Replying to wrong person because I never saw Peter's original email,
something must be jammed in the intertubes...]

It's annoying, but on the bright side... if you're making it halfway
through the month, I think that means there's a chance you'd have
enough credit if we can depessimise the known problems with TAP query
execution[1], and there are some more obviously stupid things too
(sleep/poll for replication progress where PostgreSQL should offer an
event-based wait-for-replay, running all the tests when only docs
changed, running the regression tests fewer times, ...).

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJoEO33K%3DZynsH%3DxkiEyfBMZjOoqBK%2BgouBdTGW2-woig%40mail.gmail.com




Re: Eager page freeze criteria clarification

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas  wrote:
> > Even if you're willing to assume that vacuum_freeze_min_age isn't just
> > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age
> > is applied by VACUUM, at the point that it scans pages. If VACUUM were
> > infinitely fast, and new VACUUMs were launched constantly, then
> > vacuum_freeze_min_age (and this bucketing scheme) might make more
> > sense. But, you know, they're not. So whether or not VACUUM (with
> > Andres' algorithm) deems a page that it has frozen to have been
> > opportunistically frozen or not is greatly influenced by factors that
> > couldn't possibly be relevant.
>
> I'm not totally sure that I'm understanding what you're concerned
> about here, but I *think* that the issue you're worried about here is:
> if we have various rules that can cause freezing, let's say X Y and Z,
> and we adjust the aggressiveness of rule X based on the performance of
> rule Y, that would be stupid and might suck.

Your summary is pretty close. There are a couple of specific nuances
to it, though:

1. Anything that uses XID age or even LSN age necessarily depends on
when VACUUM shows up, which itself depends on many other random
things.

With small to medium sized tables that don't really grow, it's perhaps
reasonable to expect this to not matter. But with tables like the
TPC-C order/order lines table, or even pgbench_history, the next
VACUUM operation will reliably be significantly longer and more
expensive than the last one, forever (ignoring the influence of
aggressive mode, and assuming typical autovacuum settings). So VACUUMs
get bigger and less frequent as the table grows.

As the table continues to grow, at some point we reach a stage where
many XIDs encountered by VACUUM will be significantly older than
vacuum_freeze_min_age, while others will be significantly younger. And
so whether we apply the vacuum_freeze_min_age rule (or some other age
based rule) is increasingly a matter of random happenstance (i.e. is
more and more due to when VACUUM happens to show up), and has less and
less to do with what the workload signals we should do. This is a
moving target, but (if I'm not mistaken) under the scheme described by
Andres we're not even trying to compensate for that.

Separately, I have a practical concern:

2. It'll be very hard to independently track the effectiveness of
rules X, Y, and Z as a practical matter, because the application of
each rule quite naturally influences the application of every other
rule over time. They simply aren't independent things in any practical
sense.

Even if this wasn't an issue, I can't think of a reasonable cost
model. Is it good or bad if "opportunistic freezing" results in
unfreezing 50% of the time? AFAICT that's an *extremely* complicated
question. You cannot just interpolate from the 0% case (definitely
good) and the 100% case (definitely bad) and expect to get a sensible
answer. You can't split the difference -- even if we allow ourselves
to ignore tricky value judgement type questions.

> Assuming that the previous sentence is a correct framing, let's take X
> to be "freezing based on the page LSN age" and Y to be "freezing based
> on vacuum_freeze_min_age". I think the problem scenario here would be
> if it turns out that, under some set of circumstances, Y freezes more
> aggressively than X. For example, suppose the user runs VACUUM FREEZE,
> effectively setting vacuum_freeze_min_age=0 for that operation. If the
> table is being modified at all, it's likely to suffer a bunch of
> unfreezing right afterward, which could cause us to decide to make
> future vacuums freeze less aggressively. That's not necessarily what
> we want, because evidently the user, at least at that moment in time,
> thought that previous freezing hadn't been aggressive enough. They
> might be surprised to find that flash-freezing the table inhibited
> future automatic freezing.

I didn't think of that one myself, but it's a great example.

> Or suppose that they just have a very high XID consumption rate
> compared to the rate of modifications to this particular table, such
> that criteria related to vacuum_freeze_min_age tend to be satisfied a
> lot, and thus vacuums tend to freeze a lot no matter what the page LSN
> age is. This scenario actually doesn't seem like a problem, though. In
> this case the freezing criterion based on page LSN age is already not
> getting used, so it doesn't really matter whether we tune it up or
> down or whatever.

It would have to be a smaller table, which I'm relatively unconcerned about.

> > Okay then. I guess it's more accurate to say that we'll have a strong
> > bias in the direction of freezing when an FPI won't result, though not
> > an infinitely strong bias. We'll at least have something that can be
> > thought of as an improved version of the FPI thing for 17, I think --
> > which is definitely significant progress.
>
> I do kind of wonder whether we're going to care about 

Re: SHARED locks barging behaviour

2023-09-29 Thread Bruce Momjian
On Tue, Jan 17, 2023 at 12:18:28PM -0500, Arul Ajmani wrote:
> I'm trying to better understand the following barging behaviour with SHARED
> locks.
...
> Given there is a transaction waiting to acquire a FOR UPDATE lock, I was
> surprised to see the second FOR SHARE transaction return immediately instead 
> of
> waiting. I have two questions:
> 
> 1) Could this barging behaviour potentially starve out the transaction waiting
> to acquire the FOR UPDATE lock, if there is a continuous queue of transactions
> that acquire a FOR SHARE lock briefly?

Yes, see below.

> 2) Assuming this is by design, I couldn't find (in code) where this explicit
> policy choice is made. I was looking around LockAcquireExtended, but it seems
> like the decision is made above this layer. Could someone more familiar with
> this code point me at the right place? 

I know this from January, but I do have an answer.  First, looking at
parser/gram.y, I see:

| FOR SHARE { $$ = LCS_FORSHARE; }

Looking for LCS_FORSHARE, I see in optimizer/plan/planner.c:

case LCS_FORSHARE:
return ROW_MARK_SHARE;

Looking for ROW_MARK_SHARE, I see in executor/nodeLockRows.c:

case ROW_MARK_SHARE:
lockmode = LockTupleShare;

Looking for LockTupleShare, I see in access/heap/heapam.c:

else if (mode == LockTupleShare)
{
/*
 * If we're requesting Share, we can similarly avoid sleeping if
 * there's no update and no exclusive lock present.
 */
if (HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
!HEAP_XMAX_IS_EXCL_LOCKED(infomask))
{
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);

/*
 * Make sure it's still an appropriate lock, else start over.
 * See above about allowing xmax to change.
 */
if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask) ||
HEAP_XMAX_IS_EXCL_LOCKED(tuple->t_data->t_infomask))
goto l3;
require_sleep = false;
}
}

and this is basically saying that if the row is locked
(HEAP_XMAX_IS_LOCKED_ONLY), but not exclusively locked
(!HEAP_XMAX_IS_EXCL_LOCKED), then there is no need to sleep waiting for
the lock.

I hope that helps.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Allow deleting enumerated values from an existing enumerated data type

2023-09-29 Thread Andrew Dunstan



On 2023-09-28 Th 14:46, Tom Lane wrote:

Andrew Dunstan  writes:

I wonder if we could have a boolean flag in pg_enum, indicating that
setting an enum to that value was forbidden.

Yeah, but that still offers no coherent solution to the problem of
what happens if there's a table that already contains such a value.
It doesn't seem terribly useful to forbid new entries if you can't
get rid of old ones.

Admittedly, a DISABLE flag would at least offer a chance at a
race-condition-free scan to verify that no such values remain
in tables.  But as somebody already mentioned upthread, that
wouldn't guarantee that the value doesn't appear in non-leaf
index pages.  So basically you could never get rid of the pg_enum
row, short of a full dump and restore.



or a reindex, I think, although getting the timing right would be messy. 
I agree the non-leaf index pages are rather pesky in dealing with this.


I guess the alternative would be to create a new enum with the 
to-be-deleted value missing, and then alter the column type to the new 
enum type. For massive tables that would be painful.





We went through all these points years ago when the enum feature
was first developed, as I recall.  Nobody thought that the ability
to remove an enum value was worth the amount of complexity it'd
entail.





That's quite true, and I accept my part in this history. But I'm not 
sure we were correct back then.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 11:45 AM James Coleman 
wrote:my reading the issue is that "old versions" doesn't say
> anything about "old HOT versions; it seems to be describing what
> happens generally when a heap-only tuple is written -- which would
> include the first time a heap-only tuple is written.

I think that it's talking about what happens during opportunistic
pruning, in particular what happens to HOT chains. (Though pruning
does almost the same amount of useful work with non-heap-only tuples,
so it's a bit unfortunate that the name "HOT pruning" seems to have
stuck.)

> And when it's the
> first heap-only tuple the "old version" would be the original version,
> which would not be a heap-only tuple.

The docs say "Old versions of updated rows can be completely removed
during normal operation". Opportunistic pruning removes dead heap-only
tuples completely, and makes their line pointers LP_UNUSED right away.
But it can also entail removing storage for the original root item
heap tuple, and making its line pointer LP_REDIRECT right away (not
LP_DEAD or LP_UNUSED) at most once in the life of each HOT chain. So
yeah, we're not quite limited to removing storage for heap-only tuples
when pruning a HOT chain. Does that distinction really matter, though?

There isn't even any special case handling for it in pruneheap.c (we
only have assertions that make sure that we're performing "valid
transitions" for each tuple/line pointer). That is, we don't really
care about the difference between calling ItemIdSetRedirect() for an
LP_NORMAL item versus an existing LP_REDIRECT item at the code level
(we just do it and let PageRepairFragmentation() clean things up).

--
Peter Geoghegan




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread James Coleman
On Fri, Sep 29, 2023 at 2:39 PM Peter Geoghegan  wrote:
>
> On Fri, Sep 29, 2023 at 11:04 AM Peter Geoghegan  wrote:
> > > But when a HOT update happens the entry in an (logically unchanged)
> > > index still points to the original heap tid, and that line item is
> > > updated with a pointer to the new line pointer in the same page.
> >
> > It's true that the original root heap tuple (which is never a
> > heap-only tuple) must have its line pointer changed from LP_NORMAL to
> > LP_REDIRECT the first time pruning takes place that affects its HOT
> > chain. But I don't think that referring to the root item as something
> > along the lines of "an obsolescent/old tuple's line pointer" is
> > particularly helpful.
>
> To be clear, the existing wording seems correct to me. Even heap-only
> tuples require line pointers. These line pointers are strictly
> guaranteed to never be *directly* referenced from indexes (if they are
> then we'll actually detect it and report data corruption on recent
> versions). The name "heap-only tuple" quite literally means that the
> tuple and its line pointer are only represented in the heap, and never
> in indexes.

Hmm, to my reading the issue is that "old versions" doesn't say
anything about "old HOT versions; it seems to be describing what
happens generally when a heap-only tuple is written -- which would
include the first time a heap-only tuple is written. And when it's the
first heap-only tuple the "old version" would be the original version,
which would not be a heap-only tuple.

I can work up a tweaked version of the patch that shows there are two
paths here (original tuple is being updated versus an intermediate
heap-only tuple is being updated); would you be willing to consider
that?

Thanks,
James Coleman




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 11:04 AM Peter Geoghegan  wrote:
> > But when a HOT update happens the entry in an (logically unchanged)
> > index still points to the original heap tid, and that line item is
> > updated with a pointer to the new line pointer in the same page.
>
> It's true that the original root heap tuple (which is never a
> heap-only tuple) must have its line pointer changed from LP_NORMAL to
> LP_REDIRECT the first time pruning takes place that affects its HOT
> chain. But I don't think that referring to the root item as something
> along the lines of "an obsolescent/old tuple's line pointer" is
> particularly helpful.

To be clear, the existing wording seems correct to me. Even heap-only
tuples require line pointers. These line pointers are strictly
guaranteed to never be *directly* referenced from indexes (if they are
then we'll actually detect it and report data corruption on recent
versions). The name "heap-only tuple" quite literally means that the
tuple and its line pointer are only represented in the heap, and never
in indexes.

There is a related rule about what is allowed to happen to any
heap-only tuple's line pointer: it can only change from LP_NORMAL to
LP_UNUSED (never LP_DEAD or LP_REDIRECT). You can think of a heap-only
tuple as "skipping the LP_DEAD step" that regular heap tuples must go
through. We don't need LP_DEAD tombstones precisely because there
cannot possibly be any references to heap-only tuples in indexes -- so
we can't break index scans by going straight from LP_NORMAL to
LP_UNUSED for heap-only tuple line pointers.

-- 
Peter Geoghegan




Re: Fix incorrect comment reference

2023-09-29 Thread James Coleman
On Fri, Sep 29, 2023 at 2:26 PM Bruce Momjian  wrote:
>
> On Mon, Jan 23, 2023 at 06:42:45PM -0500, James Coleman wrote:
> > On Mon, Jan 23, 2023 at 4:07 PM James Coleman  wrote:
> > >
> > > On Mon, Jan 23, 2023 at 3:41 PM Robert Haas  wrote:
> > > >
> > > > On Mon, Jan 23, 2023 at 3:19 PM James Coleman  wrote:
> > > > > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas  
> > > > > wrote:
> > > > > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman  
> > > > > > wrote:
> > > > > > > See the attached for a simple comment fix -- the referenced
> > > > > > > generate_useful_gather_paths call isn't in grouping_planner it's 
> > > > > > > in
> > > > > > > apply_scanjoin_target_to_paths.
> > > > > >
> > > > > > The intended reading of the comment is not clear. Is it telling you 
> > > > > > to
> > > > > > look at grouping_planner because that's where we
> > > > > > generate_useful_gather_paths, or is it telling you to look there to
> > > > > > see how we get the final target list together? If it's the former,
> > > > > > then your fix is correct. If the latter, it's fine as it is.
> > > > > >
> > > > > > The real answer is probably that some years ago both things happened
> > > > > > in that function. We've moved on from there, but I'm still not sure
> > > > > > what the most useful phrasing of the comment is.
> > > > >
> > > > > Yeah, almost certainly, and the comments just didn't keep up.
> > > > >
> > > > > Would you prefer something that notes both that the broader concern is
> > > > > happening via the grouping_planner() stage but still points to the
> > > > > proper callsite (so that people don't go looking for that confused)?
> > > >
> > > > I don't really have a strong view on what the best thing to do is. I
> > > > was just pointing out that the comment might not be quite so obviously
> > > > wrong as you were supposing.
> > >
> > > "Wrong" is certainly too strong; my apologies.
> > >
> > > I'm really just hoping to improve it for future readers to save them
> > > some confusion I had initially reading it.
> >
> > Updated patch attached.
>
> Patch applied.

Thanks!

Regards,
James Coleman




Re: Eager page freeze criteria clarification

2023-09-29 Thread Robert Haas
On Fri, Sep 29, 2023 at 11:57 AM Peter Geoghegan  wrote:
> Assuming your concern is more or less limited to those cases where the
> same page could be frozen an unbounded number of times (once or almost
> once per VACUUM), then I think we fully agree. We ought to converge on
> the right behavior over time, but it's particularly important that we
> never converge on the wrong behavior instead.

I think that more or less matches my current thinking on the subject.
A caveat might be: If it were once per two vacuums rather than once
per vacuum, that might still be an issue. But I agree with the idea
that the case that matters is *repeated* wasteful freezing. I don't
think freezing is expensive enough that individual instances of
mistaken freezing are worth getting too stressed about, but as you
say, the overall pattern does matter.

> The TPC-C scenario is partly interesting because it isn't actually
> obvious what the most desirable behavior is, even assuming that you
> had perfect information, and were not subject to practical
> considerations about the complexity of your algorithm. There doesn't
> seem to be perfect clarity on what the goal should actually be in such
> scenarios -- it's not like the problem is just that we can't agree on
> the best way to accomplish those goals with this specific workload.
>
> If performance/efficiency and performance stability are directly in
> tension (as they sometimes are), how much do you want to prioritize
> one or the other? It's not an easy question to answer. It's a value
> judgement as much as anything else.

I think that's true. For me, the issue is what a user is practically
likely to notice and care about. I submit that on a
not-particularly-busy system, it would probably be fine to freeze
aggressively in almost every situation, because you're only incurring
costs you can afford to pay. On a busy system, it's more important to
be right, or at least not too badly wrong. But even on a busy system,
I think that when the time between data being written and being frozen
is more than a few tens of minutes, it's very doubtful that anyone is
going to notice the contribution that freezing makes to the overall
workload. They're much more likely to notice an annoying autovacuum
than they are to notIce a bit of excess freezing that ends up getting
reversed. But when you start cranking the time between writing data
and freezing it down into the single-digit numbers of minutes, and
even more if you push down to tens of seconds or less, now I think
people are going to care more about useless freezing work than about
long-term autovacuum risks. Because now their database is really busy
so they care a lot about performance, and seemingly most of the data
involved is ephemeral anyway.

> Even if you're willing to assume that vacuum_freeze_min_age isn't just
> an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age
> is applied by VACUUM, at the point that it scans pages. If VACUUM were
> infinitely fast, and new VACUUMs were launched constantly, then
> vacuum_freeze_min_age (and this bucketing scheme) might make more
> sense. But, you know, they're not. So whether or not VACUUM (with
> Andres' algorithm) deems a page that it has frozen to have been
> opportunistically frozen or not is greatly influenced by factors that
> couldn't possibly be relevant.

I'm not totally sure that I'm understanding what you're concerned
about here, but I *think* that the issue you're worried about here is:
if we have various rules that can cause freezing, let's say X Y and Z,
and we adjust the aggressiveness of rule X based on the performance of
rule Y, that would be stupid and might suck.

Assuming that the previous sentence is a correct framing, let's take X
to be "freezing based on the page LSN age" and Y to be "freezing based
on vacuum_freeze_min_age". I think the problem scenario here would be
if it turns out that, under some set of circumstances, Y freezes more
aggressively than X. For example, suppose the user runs VACUUM FREEZE,
effectively setting vacuum_freeze_min_age=0 for that operation. If the
table is being modified at all, it's likely to suffer a bunch of
unfreezing right afterward, which could cause us to decide to make
future vacuums freeze less aggressively. That's not necessarily what
we want, because evidently the user, at least at that moment in time,
thought that previous freezing hadn't been aggressive enough. They
might be surprised to find that flash-freezing the table inhibited
future automatic freezing.

Or suppose that they just have a very high XID consumption rate
compared to the rate of modifications to this particular table, such
that criteria related to vacuum_freeze_min_age tend to be satisfied a
lot, and thus vacuums tend to freeze a lot no matter what the page LSN
age is. This scenario actually doesn't seem like a problem, though. In
this case the freezing criterion based on page LSN age is already not
getting used, so it doesn't really matter 

Re: Fix incorrect comment reference

2023-09-29 Thread Bruce Momjian
On Mon, Jan 23, 2023 at 06:42:45PM -0500, James Coleman wrote:
> On Mon, Jan 23, 2023 at 4:07 PM James Coleman  wrote:
> >
> > On Mon, Jan 23, 2023 at 3:41 PM Robert Haas  wrote:
> > >
> > > On Mon, Jan 23, 2023 at 3:19 PM James Coleman  wrote:
> > > > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas  
> > > > wrote:
> > > > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman  
> > > > > wrote:
> > > > > > See the attached for a simple comment fix -- the referenced
> > > > > > generate_useful_gather_paths call isn't in grouping_planner it's in
> > > > > > apply_scanjoin_target_to_paths.
> > > > >
> > > > > The intended reading of the comment is not clear. Is it telling you to
> > > > > look at grouping_planner because that's where we
> > > > > generate_useful_gather_paths, or is it telling you to look there to
> > > > > see how we get the final target list together? If it's the former,
> > > > > then your fix is correct. If the latter, it's fine as it is.
> > > > >
> > > > > The real answer is probably that some years ago both things happened
> > > > > in that function. We've moved on from there, but I'm still not sure
> > > > > what the most useful phrasing of the comment is.
> > > >
> > > > Yeah, almost certainly, and the comments just didn't keep up.
> > > >
> > > > Would you prefer something that notes both that the broader concern is
> > > > happening via the grouping_planner() stage but still points to the
> > > > proper callsite (so that people don't go looking for that confused)?
> > >
> > > I don't really have a strong view on what the best thing to do is. I
> > > was just pointing out that the comment might not be quite so obviously
> > > wrong as you were supposing.
> >
> > "Wrong" is certainly too strong; my apologies.
> >
> > I'm really just hoping to improve it for future readers to save them
> > some confusion I had initially reading it.
> 
> Updated patch attached.

Patch applied.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: CREATE EXTENSION forces an library initialization - is it bug?

2023-09-29 Thread Pavel Stehule
pá 29. 9. 2023 v 20:14 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > I had to fix plpgsql_check issue
> > https://github.com/okbob/plpgsql_check/issues/155
>
> > The problem is in execution of _PG_init() in CREATE EXTENSION time.
>
> > It is a problem for any extension that uses plpgsql debug API, because it
> > is quietly activated.
>
> > Is it necessary?
>
> Yes, I think so.  If the extension has any C functions, then when its
> script executes those CREATE FUNCTION commands then the underlying
> library will be loaded (so we can check that the library is loadable
> and the functions really exist).  That's always happened and I do not
> think it is negotiable.
>

ok

thank you for info



>
> regards, tom lane
>


Re: CREATE EXTENSION forces an library initialization - is it bug?

2023-09-29 Thread Tom Lane
Pavel Stehule  writes:
> I had to fix plpgsql_check issue
> https://github.com/okbob/plpgsql_check/issues/155

> The problem is in execution of _PG_init() in CREATE EXTENSION time.

> It is a problem for any extension that uses plpgsql debug API, because it
> is quietly activated.

> Is it necessary?

Yes, I think so.  If the extension has any C functions, then when its
script executes those CREATE FUNCTION commands then the underlying
library will be loaded (so we can check that the library is loadable
and the functions really exist).  That's always happened and I do not
think it is negotiable.

regards, tom lane




CREATE EXTENSION forces an library initialization - is it bug?

2023-09-29 Thread Pavel Stehule
Hi

I had to fix plpgsql_check issue
https://github.com/okbob/plpgsql_check/issues/155

The problem is in execution of _PG_init() in CREATE EXTENSION time.

It is a problem for any extension that uses plpgsql debug API, because it
is quietly activated.

Is it necessary?

Regards

Pavel


Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 10:39 AM James Coleman  wrote:
> > Old versions of updated rows can be completely removed during normal
> > operation, including SELECTs, instead of requiring periodic vacuum
> > operations. (This is possible because indexes do not reference their page
> > item identifiers.)
>
> But when a HOT update happens the entry in an (logically unchanged)
> index still points to the original heap tid, and that line item is
> updated with a pointer to the new line pointer in the same page.

It's true that the original root heap tuple (which is never a
heap-only tuple) must have its line pointer changed from LP_NORMAL to
LP_REDIRECT the first time pruning takes place that affects its HOT
chain. But I don't think that referring to the root item as something
along the lines of "an obsolescent/old tuple's line pointer" is
particularly helpful. Changing from LP_NORMAL to LP_REDIRECT during
the initial prune isn't terribly different from changing an existing
LP_REDIRECT's redirect-TID link during every subsequent prune. In both
cases you're just updating where the first heap-only tuple begins.

The really important point is that the TID (which maps to the root
item of the HOT chain) has a decent chance of being stable over time,
no matter how many versions the HOT chain churns through. And that
that can break (or at least weaken) our dependence on VACUUM with some
workloads.

-- 
Peter Geoghegan




Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread David G. Johnston
On Fri, Sep 29, 2023 at 10:45 AM James Coleman  wrote:

> Hello,
>
> While working on my talk for PGConf.NYC next week I came across this
> bullet in the docs on heap only tuples:
>
> > Old versions of updated rows can be completely removed during normal
> > operation, including SELECTs, instead of requiring periodic vacuum
> > operations. (This is possible because indexes do not reference their page
> > item identifiers.)
>
> But when a HOT update happens the entry in an (logically unchanged)
> index still points to the original heap tid, and that line item is
> updated with a pointer to the new line pointer in the same page.
>
> Assuming I'm understanding this correctly, attached is a patch
> correcting the description.
>
>
I think we want to somehow distinguish between the old tuple that is the
root of the chain and old tuples that are not.  This comment refers to
pruning the chain and removing intermediate links in the chain that are no
longer relevant because the root has been updated to point to the live
tuple.  In README.HOT, tuple 2 in the example after 1 points to 3.

https://github.com/postgres/postgres/blob/ec99d6e9c87a8ff0f4805cc0c6c12cbb89c48e06/src/backend/access/heap/README.HOT#L109

David J.


[DOCS] HOT - correct claim about indexes not referencing old line pointers

2023-09-29 Thread James Coleman
Hello,

While working on my talk for PGConf.NYC next week I came across this
bullet in the docs on heap only tuples:

> Old versions of updated rows can be completely removed during normal
> operation, including SELECTs, instead of requiring periodic vacuum
> operations. (This is possible because indexes do not reference their page
> item identifiers.)

But when a HOT update happens the entry in an (logically unchanged)
index still points to the original heap tid, and that line item is
updated with a pointer to the new line pointer in the same page.

Assuming I'm understanding this correctly, attached is a patch
correcting the description.

Thanks,
James Coleman


v1-0001-Correct-HOT-docs-to-account-for-LP_REDIRECT.patch
Description: Binary data


Re: POC, WIP: OR-clause support for indexes

2023-09-29 Thread a.rybakina
I'm sorry I didn't write for a long time, but I really had a very 
difficult month, now I'm fully back to work.


*I was able to implement the patches to the end and moved the 
transformation of "OR" expressions to ANY.* I haven't seen a big 
difference between them yet, one has a transformation before 
calculating selectivity (v7.1-Replace-OR-clause-to-ANY.patch), the 
other after (v7.2-Replace-OR-clause-to-ANY.patch). Regression tests 
are passing, I don't see any problems with selectivity, nothing has 
fallen into the coredump, but I found some incorrect transformations. 
What is the reason for these inaccuracies, I have not found, but, to 
be honest, they look unusual). Gave the error below.


In the patch, I don't like that I had to drag three libraries from 
parsing until I found a way around it.The advantage of this approach 
compared to the other ([1]) is that at this stage all possible or 
transformations are performed, compared to the patch, where the 
transformation was done at the parsing stage. That is, here, for 
example, there are such optimizations in the transformation:



I took the common element out of the bracket and the rest is converted 
to ANY, while, as noted by Peter Geoghegan, we did not have several 
bitmapscans, but only one scan through the array.


postgres=# explain analyze SELECT p1.oid, p1.proname
FROM pg_proc as p1
WHERE prolang = 13 AND prolang=1 OR prolang = 13 AND prolang = 2 OR 
prolang = 13 AND prolang = 3;

  QUERY PLAN
---
 Seq Scan on pg_proc p1  (cost=0.00..151.66 rows=1 width=68) (actual 
time=1.167..1.168 rows=0 loops=1)
   Filter: ((prolang = '13'::oid) AND (prolang = ANY (ARRAY['1'::oid, 
'2'::oid, '3'::oid])))

   Rows Removed by Filter: 3302
 Planning Time: 0.146 ms
 Execution Time: 1.191 ms
(5 rows)
*Falls into coredump at me:*
explain analyze SELECT p1.oid, p1.proname
FROM pg_proc as p1
WHERE prolang = 13 OR prolang = 2 AND prolang = 2 OR prolang = 13;

I continue to try to move transformations of "OR" expressions at the 
optimization stage, unfortunately I have not been able to figure out 
coredump yet, but I saw an important thing that it is already necessary 
to process RestrictInfo expressions here. I corrected it.


To be honest, despite some significant advantages in the fact that we 
are already processing pre-converted "or" expressions (logical 
transformations have been performed and duplicates have been removed), I 
have big doubts about this approach. We already have quite a lot of 
objects at this stage that can refer to the RestrictInfo variable in 
ReplOptInfo, and updating these links can be costly for us. By the way, 
right now I suspect that the current coredump appeared precisely because 
there is a link somewhere that refers to an un-updated RestrictInfo, but 
so far I can't find this place. coredump occurs at the request execution 
stage, looks like this:


Core was generated by `postgres: alena regression [local] 
SELECT '.

--Type  for more, q to quit, c to continue without paging--
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x5565f3ec4947 in ExecInitExprRec (node=0x5565f530b290, 
state=0x5565f53383d8, resv=0x5565f53383e0, resnull=0x5565f53383dd) at 
execExpr.c:1331
1331    Expr   *arg = (Expr 
*) lfirst(lc);

(gdb) bt
#0  0x5565f3ec4947 in ExecInitExprRec (node=0x5565f530b290, 
state=0x5565f53383d8, resv=0x5565f53383e0, resnull=0x5565f53383dd) at 
execExpr.c:1331
#1  0x5565f3ec2708 in ExecInitQual (qual=0x5565f531d950, 
parent=0x5565f5337948) at execExpr.c:258
#2  0x5565f3f2f080 in ExecInitSeqScan (node=0x5565f5309700, 
estate=0x5565f5337700, eflags=32) at nodeSeqscan.c:172
#3  0x5565f3ee70c9 in ExecInitNode (node=0x5565f5309700, 
estate=0x5565f5337700, eflags=32) at execProcnode.c:210
#4  0x5565f3edbe3a in InitPlan (queryDesc=0x5565f53372f0, eflags=32) 
at execMain.c:968
#5  0x5565f3edabe3 in standard_ExecutorStart 
(queryDesc=0x5565f53372f0, eflags=32) at execMain.c:266
#6  0x5565f3eda927 in ExecutorStart (queryDesc=0x5565f53372f0, 
eflags=0) at execMain.c:145
#7  0x5565f419921e in PortalStart (portal=0x5565f52ace90, 
params=0x0, eflags=0, snapshot=0x0) at pquery.c:517

#8  0x5565f4192635 in exec_simple_query (
    query_string=0x5565f5233af0 "SELECT p1.oid, p1.proname\nFROM 
pg_proc as p1\nWHERE prolang = 13 AND (probin IS NULL OR probin = '' OR 
probin = '-');") at postgres.c:1233
#9  0x5565f41976ef in PostgresMain (dbname=0x5565f526ad10 
"regression", username=0x5565f526acf8 "alena") at postgres.c:4652
#10 0x5565f40b8417 in BackendRun (port=0x5565f525f830) at 
postmaster.c:4439
#11 0x5565f40b7ca3 in BackendStartup (port=0x5565f525f830) at 
postmaster.c:4167

#12 0x5565f40b40f1 in ServerLoop () at postmaster.c:1781
#13 

Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL

2023-09-29 Thread Justin Pryzby
On Fri, Sep 29, 2023 at 09:16:35AM +0900, Michael Paquier wrote:
> You mean when upgrading from an instance of 9.6 or older as c30f177 is
> not there, right?

No - while upgrading from v15 to v16.  I'm not clear on how we upgraded
*to* v15 without hitting the issue, nor how the "not null" got
dropped...

> Anyway, it seems like the patch from [1] has no need to run this check
> when the old cluster's version is 10 or newer.  And perhaps it should
> mention that this check could be removed from pg_upgrade once v10
> support is out of scope, in the shape of a comment.

You're still thinking of PRIMARY KEY as the only way to hit this, right?
But Ali Akbar already pointed out how to reproduce the problem with DROP
NOT NULL - which still applies to both v16 and v17.




Re: Annoying build warnings from latest Apple toolchain

2023-09-29 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> On 2023-09-28 16:46:08 -0400, Tom Lane wrote:
>>> Well, it's only important on platforms where we can't restrict
>>> libpq.so from exporting all symbols.  I don't know how close we are
>>> to deciding that such cases are no longer interesting to worry about.
>>> Makefile.shlib seems to know how to do it everywhere except Windows,
>>> and I imagine we know how to do it over in the MSVC scripts.

>> Hm, then I'd argue that we don't need to care about it anymore. The meson
>> build does the necessary magic on windows, as do the current msvc scripts.

> If we take that argument seriously, then I'm inclined to adjust my
> upthread patch for Makefile.global.in so that it removes the extra
> inclusions of libpgport/libpgcommon everywhere, not only macOS.
> The rationale would be that it's not worth worrying about ABI
> stability details on any straggler platforms.

Looking closer, it's only since v16 that we have export list support
on all officially-supported platforms.  Therefore, I think the prudent
thing to do in the back branches is use the patch I posted before,
to suppress the duplicate -l switches only on macOS.  In HEAD,
I propose we simplify life by doing it everywhere, as attached.

regards, tom lane

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 18240b5fef..7b66590801 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -589,19 +589,27 @@ endif
 libpq = -L$(libpq_builddir) -lpq
 
 # libpq_pgport is for use by client executables (not libraries) that use libpq.
-# We force clients to pull symbols from the non-shared libraries libpgport
-# and libpgcommon rather than pulling some libpgport symbols from libpq just
-# because libpq uses those functions too.  This makes applications less
-# dependent on changes in libpq's usage of pgport (on platforms where we
-# don't have symbol export control for libpq).  To do this we link to
-# pgport before libpq.  This does cause duplicate -lpgport's to appear
-# on client link lines, since that also appears in $(LIBS).
+# We used to use this to force libpgport and libpgcommon to be linked before
+# libpq, ensuring that clients would pull symbols from those libraries rather
+# than possibly getting them from libpq (and thereby having an unwanted
+# dependency on which symbols libpq uses).  However, now that we can prevent
+# libpq from exporting those symbols on all platforms of interest, we don't
+# worry about that anymore.  The previous technique resulted in duplicate
+# libraries in link commands, since those libraries also appear in $(LIBS).
+# Some platforms warn about that, so avoiding those warnings is now more
+# important.  Hence, $(libpq_pgport) is now equivalent to $(libpq), but we
+# still recommend using it for client executables in case some other reason
+# appears to handle them differently.
+libpq_pgport = $(libpq)
+
 # libpq_pgport_shlib is the same idea, but for use in client shared libraries.
+# We need those clients to use the shlib variants.  (Ideally, users of this
+# macro would strip libpgport and libpgcommon from $(LIBS), but no harm is
+# done if they don't, since they will have satisfied all their references
+# from these libraries.)
 ifdef PGXS
-libpq_pgport = -L$(libdir) -lpgcommon -lpgport $(libpq)
 libpq_pgport_shlib = -L$(libdir) -lpgcommon_shlib -lpgport_shlib $(libpq)
 else
-libpq_pgport = -L$(top_builddir)/src/common -lpgcommon -L$(top_builddir)/src/port -lpgport $(libpq)
 libpq_pgport_shlib = -L$(top_builddir)/src/common -lpgcommon_shlib -L$(top_builddir)/src/port -lpgport_shlib $(libpq)
 endif
 


Re: Remove distprep

2023-09-29 Thread Andres Freund
Hi,

On 2023-08-23 12:46:45 +0200, Peter Eisentraut wrote:
> Subject: [PATCH v4] Remove distprep
> 
> A PostgreSQL release tarball contains a number of prebuilt files, in
> particular files produced by bison, flex, perl, and well as html and
> man documentation.  We have done this consistent with established
> practice at the time to not require these tools for building from a
> tarball.  Some of these tools were hard to get, or get the right
> version of, from time to time, and shipping the prebuilt output was a
> convenience to users.
> 
> Now this has at least two problems:
> 
> One, we have to make the build system(s) work in two modes: Building
> from a git checkout and building from a tarball.  This is pretty
> complicated, but it works so far for autoconf/make.  It does not
> currently work for meson; you can currently only build with meson from
> a git checkout.  Making meson builds work from a tarball seems very
> difficult or impossible.  One particular problem is that since meson
> requires a separate build directory, we cannot make the build update
> files like gram.h in the source tree.  So if you were to build from a
> tarball and update gram.y, you will have a gram.h in the source tree
> and one in the build tree, but the way things work is that the
> compiler will always use the one in the source tree.  So you cannot,
> for example, make any gram.y changes when building from a tarball.
> This seems impossible to fix in a non-horrible way.

I think it might be possible to fix in a non-horrible way, just that the
effort doing so could be much better spent on other things.

It's maybe also worth mentioning that this does *not* work reliably with vpath
make builds either...


> The make maintainer-clean target, whose job it is to remove the
> prebuilt files in addition to what make distclean does, is now just an
> alias to make distprep.  (In practice, it is probably obsolete given
> that git clean is available.)

FWIW, I find a "full clean" target useful to be sure that we don't produce
"untracked" build products. Do a full build, then run "full clean", then see
what remains.


>  88 files changed, 169 insertions(+), 409 deletions(-)

It might be worthwhile to split this into a bit smaller chunks, e.g. depending
on perl, bison, flex, and then separately the various makefile bits that are
all over the tree.

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-29 Thread Peter Geoghegan
On Fri, Sep 29, 2023 at 7:55 AM Robert Haas  wrote:
> On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan  wrote:
> > But isn't the main problem *not* freezing when we could and
> > should have? (Of course the cost of freezing is very relevant, but
> > it's still secondary.)
>
> Perhaps this is all in how you look at it, but I don't see it this
> way. It's easy to see how to solve the "not freezing" problem: just
> freeze everything as often as possible. If that were cheap enough that
> we could just do it, then we'd just do it and be done here. The
> problem is that, at least in my opinion, that seems too expensive in
> some cases. I'm starting to believe that those cases are narrower than
> I once thought, but I think they do exist. So now, I'm thinking that
> maybe the main problem is identifying when you've got such a case, so
> that you know when you need to be less aggressive.

Assuming your concern is more or less limited to those cases where the
same page could be frozen an unbounded number of times (once or almost
once per VACUUM), then I think we fully agree. We ought to converge on
the right behavior over time, but it's particularly important that we
never converge on the wrong behavior instead.

> > Won't the algorithm that you've sketched always think that
> > "unfreezing" pages doesn't affect recently frozen pages with such a
> > workload? Isn't the definition of "recently frozen" that emerges from
> > this algorithm not in any way related to the order delivery time, or
> > anything like that? You know, rather like vacuum_freeze_min_age.
>
> FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant
> to endorse changes in this area mostly because I fear replacing one
> bad idea with another, not because I think that what we have now is
> particularly good. It's better to be wrong in the same way in every
> release than to have every release be equally wrong but in a different
> way.
>
> Also, I think the question of what "recently frozen" means is a good
> one, but I'm not convinced that it ought to relate to the order
> delivery time.

I don't think it should, either.

The TPC-C scenario is partly interesting because it isn't actually
obvious what the most desirable behavior is, even assuming that you
had perfect information, and were not subject to practical
considerations about the complexity of your algorithm. There doesn't
seem to be perfect clarity on what the goal should actually be in such
scenarios -- it's not like the problem is just that we can't agree on
the best way to accomplish those goals with this specific workload.

If performance/efficiency and performance stability are directly in
tension (as they sometimes are), how much do you want to prioritize
one or the other? It's not an easy question to answer. It's a value
judgement as much as anything else.

> If we insert into a table and 12-14 hours go buy before
> it's updated, it doesn't seem particularly bad to me if we froze that
> data meanwhile (regardless of what metric drove that freezing). Same
> thing if it's 2-4 hours. What seems bad to me is if we're constantly
> updating the table and vacuum comes sweeping through and freezing
> everything to no purpose over and over again and then it gets
> un-frozen a few seconds or minutes later.

Right -- we agree here.

I even think that it makes sense to freeze pages knowing for sure that
the pages will be unfrozen on that sort of timeline (at least with a
large and ever growing table like this). It may technically be less
efficient, but not when you consider how everything else is affected
by the disruptive impact of freezing a great deal of stuff all at
once. (Of course it's also true that we don't really know what will
happen, which is all the more reason to freeze eagerly.)

> Now maybe that's the wrong idea. After all, as a percentage, the
> overhead is the same either way, regardless of whether we're talking
> about WAL volume or CPU cycles. But somehow it feels worse to make the
> same mistakes every few minutes or potentially even tens of seconds
> than it does to make the same mistakes every few hours. The absolute
> cost is a lot higher.

I agree.

Another problem with the algorithm that Andres sketched is that it
supposes that vacuum_freeze_min_age means something relevant -- that's
how we decide whether or not freezing should count as "opportunistic".
But there really isn't that much difference between (say) an XID age
of 25 million and 50 million. At least not with a table like the TPC-C
tables, where VACUUMs are naturally big operations that take place
relatively infrequently.

Assume a default vacuum_freeze_min_age of 50 million. How can it make
sense to deem freezing a page "opportunistic" when its oldest XID has
only attained an age of 25 million, if the subsequent unfreezing
happens when that same XID would have attained an age of 75 million,
had we not frozen it? And if you agree that it doesn't make sense, how
can we compensate for this effect, as 

Re: Annoying build warnings from latest Apple toolchain

2023-09-29 Thread Tristan Partin

On Thu Sep 28, 2023 at 5:22 PM CDT, Andres Freund wrote:

Hi,

On 2023-09-28 16:46:08 -0400, Tom Lane wrote:
> There's still one duplicate warning
> from the backend link:
> 
> ld: warning: ignoring duplicate libraries: '-lpam'
> 
> I'm a bit baffled why that's showing up; there's no obvious

> double reference to pam.

I think it's because multiple libraries/binaries depend on it. Meson knows how
to deduplicate libraries found via pkg-config (presumably because that has
enough information for a topological sort), but apparently not when they're
found as "raw" libraries.  Until now that was also just pretty harmless, so I
understand not doing anything about it.

I see a way to avoid the warnings, but perhaps it's better to ask the meson
folks to put in a generic way of dealing with this.


I wonder if this Meson PR[0] will help.

[0]: https://github.com/mesonbuild/meson/pull/12276

--
Tristan Partin
Neon (https://neon.tech)




Re: Annoying build warnings from latest Apple toolchain

2023-09-29 Thread Tom Lane
Andres Freund  writes:
> On 2023-09-28 22:53:09 -0400, Tom Lane wrote:
>> (Perhaps we should apply the above to HEAD alongside the meson.build fix, to
>> get more test coverage?)

> The macos animals BF seem to run Ventura, so I think it'd not really provide
> additional coverage that CI and your manual testing already has. So probably
> not worth it from that angle?

My thought was that if it's in the tree we'd get testing from
non-buildfarm sources.

FWIW, I'm going to update sifaka/indri to Sonoma before too much longer
(they're already using Xcode 15.0 which is the Sonoma toolchain).
I recently pulled longfin up to Ventura, and plan to leave it on that
for the next year or so.  I don't think anyone else is running macOS
animals.

regards, tom lane




Re: Eager page freeze criteria clarification

2023-09-29 Thread Robert Haas
On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan  wrote:
> But isn't the main problem *not* freezing when we could and
> should have? (Of course the cost of freezing is very relevant, but
> it's still secondary.)

Perhaps this is all in how you look at it, but I don't see it this
way. It's easy to see how to solve the "not freezing" problem: just
freeze everything as often as possible. If that were cheap enough that
we could just do it, then we'd just do it and be done here. The
problem is that, at least in my opinion, that seems too expensive in
some cases. I'm starting to believe that those cases are narrower than
I once thought, but I think they do exist. So now, I'm thinking that
maybe the main problem is identifying when you've got such a case, so
that you know when you need to be less aggressive.

> Won't the algorithm that you've sketched always think that
> "unfreezing" pages doesn't affect recently frozen pages with such a
> workload? Isn't the definition of "recently frozen" that emerges from
> this algorithm not in any way related to the order delivery time, or
> anything like that? You know, rather like vacuum_freeze_min_age.

FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant
to endorse changes in this area mostly because I fear replacing one
bad idea with another, not because I think that what we have now is
particularly good. It's better to be wrong in the same way in every
release than to have every release be equally wrong but in a different
way.

Also, I think the question of what "recently frozen" means is a good
one, but I'm not convinced that it ought to relate to the order
delivery time. If we insert into a table and 12-14 hours go buy before
it's updated, it doesn't seem particularly bad to me if we froze that
data meanwhile (regardless of what metric drove that freezing). Same
thing if it's 2-4 hours. What seems bad to me is if we're constantly
updating the table and vacuum comes sweeping through and freezing
everything to no purpose over and over again and then it gets
un-frozen a few seconds or minutes later.

Now maybe that's the wrong idea. After all, as a percentage, the
overhead is the same either way, regardless of whether we're talking
about WAL volume or CPU cycles. But somehow it feels worse to make the
same mistakes every few minutes or potentially even tens of seconds
than it does to make the same mistakes every few hours. The absolute
cost is a lot higher.

> On a positive note, I like that what you've laid out freezes eagerly
> when an FPI won't result -- this much we can all agree on. I guess
> that that part is becoming uncontroversial.

I don't think that we're going to be able to get away with freezing
rows in a small, frequently-updated table just because no FPI will
result. I think Melanie's results show that the cost is not
negligible. But Andres's pseudocode algorithm, although it is more
aggressive in that case, doesn't necessarily seem bad to me, because
it still has some ability to hold off freezing in such cases if our
statistics show that it isn't working out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: how to manage Cirrus on personal repository

2023-09-29 Thread Andres Freund
Hi,

On 2023-09-29 11:13:24 +0200, Peter Eisentraut wrote:
> I have a private repository on GitHub where I park PostgreSQL development
> work.  I also have Cirrus activated on that repository, to build those
> branches, using the existing Cirrus configuration.
> 
> Now under the new system of limited credits that started in September, this
> blew through the credits about half-way through the month.

:(


> Does anyone have an idea how to manage this better?  Is there maybe a way to
> globally set "only trigger manually", or could we make one?

One way is to configure to run under a custom compute account. If we get
credits from google, that might be realistic to provide to many hackers, it's
not that expensive to do.

Another thing is to work on our tests - we apparently waste *tremendous*
amounts of time due to tap tests forking psql over and over.


We definitely can provide a way to configure on a repo level which tests run
automatically. I think that would be useful not just for turning things
manual, but also the opposite, enabling tests that we don't want everybody to
run to automatically run as part of cfbot. E.g. running mingw tests during
cfbot while keeping it manual for everyone else.

Maybe we should have two environment variables, which can be overwritten set
on a repository level, one to make manual tasks run by default, and one the
other way?


> I suppose one way to deal with this is to make a second repository and only
> activate Cirrus on that one, and push there on demand.

You already can control which platforms are run via commit messages,
fwiw. By adding, e.g.:
ci-os-only: linux, macos

Greetings,

Andres Freund




Re: Annoying build warnings from latest Apple toolchain

2023-09-29 Thread Andres Freund
Hi,

On 2023-09-28 22:53:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-09-28 19:20:27 -0700, Andres Freund wrote:
> >> Thus the easiest fix looks to be to use this:
> >> -  export_fmt = '-exported_symbols_list=@0@'
> >> +  export_fmt = '-Wl,-exported_symbols_list,@0@'
> >> I don't have anything older than Ventura to check though.
> 
> I don't have meson installed on my surviving Catalina box, but
> I tried the equivalent thing in the Makefile universe:
> 
> diff --git a/src/Makefile.shlib b/src/Makefile.shlib
> index f94d59d1c5..f2ed222cc7 100644
> --- a/src/Makefile.shlib
> +++ b/src/Makefile.shlib
> @@ -136,7 +136,7 @@ ifeq ($(PORTNAME), darwin)
>BUILD.exports= $(AWK) '/^[^\#]/ {printf "_%s\n",$$1}' $< 
> >$@
>exports_file = $(SHLIB_EXPORTS:%.txt=%.list)
>ifneq (,$(exports_file))
> -exported_symbols_list = -exported_symbols_list $(exports_file)
> +exported_symbols_list = -Wl,-exported_symbols_list,$(exports_file)
>endif
>  endif
> 
> That builds and produces correctly-symbol-trimmed shlibs, so I'd
> say it's fine.

Thanks for testing!

I'll go and push that 16/HEAD then.


> (Perhaps we should apply the above to HEAD alongside the meson.build fix, to
> get more test coverage?)

The macos animals BF seem to run Ventura, so I think it'd not really provide
additional coverage that CI and your manual testing already has. So probably
not worth it from that angle?


> > Attached is the above change and a commit to change CI over to Sonoma. Not
> > sure when we should switch, but it seems useful to include for testing
> > purposes at the very least.
> 
> No opinion on when to switch CI.  Sonoma is surely pretty bleeding edge
> yet.

Yea, it does feel like that...

Greetings,

Andres Freund




Re: [PGDOCS] change function linkend to refer to a more relevant target

2023-09-29 Thread Daniel Gustafsson
> On 29 Sep 2023, at 10:54, Daniel Gustafsson  wrote:
> 
>> On 29 Sep 2023, at 06:51, Peter Smith  wrote:
> 
>> I found a link in the docs that referred to the stats "views" section,
>> instead of the more relevant (IMO)  stats "functions" section.
> 
> Agreed.  This link was added in 2007 (in 7d3b7011b), and back in 7.3/7.4 days
> the functions were listed in the same section as the views, so the anchor was
> at the time pointing to the right section.  In 2012 it was given its own
> section (in aebe989477) but the link wasn't updated.
> 
> Thanks for the patch, I'll go ahead with it.

Applied to all supported branches, thanks!

--
Daniel Gustafsson





Re: Testing autovacuum wraparound (including failsafe)

2023-09-29 Thread Noah Misch
On Fri, Sep 29, 2023 at 12:17:04PM +0200, Daniel Gustafsson wrote:
> +# Bump the query timeout to avoid false negatives on slow test systems.
> +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;
> 
> Should we bump the timeout like this for all systems?  I interpreted Noah's
> comment such that it should be possible for slower systems to override, not
> that it should be extended everywhere, but I might have missed something.

This is the conventional way to do it.  For an operation far slower than a
typical timeout_default situation, the patch can and should dilate the default
timeout like this.  The patch version as of my last comment was extending the
timeout but also blocking users from extending it further via
PG_TEST_TIMEOUT_DEFAULT.  The latest version restores PG_TEST_TIMEOUT_DEFAULT
reactivity, resolving my comment.




Re: Eager page freeze criteria clarification

2023-09-29 Thread Robert Haas
On Wed, Sep 27, 2023 at 7:09 PM Melanie Plageman
 wrote:
> At the risk of seeming too execution-focused, I want to try and get more
> specific. Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

When I first read this, I thought this sounded right, except for the
naming which doesn't mention freezing, but on further thought, this
feels like it might not be the right thing. Why do I care how many
super-old pages (regardless of how exactly we define super-old) I
unfroze? I think what I care about is more like: how often do I have
to unfreeze a recently-frozen page in order to complete a DML
operation?

For example, consider a table with a million pages, 10 of which are
frozen. 1 was frozen a long time ago, 9 were frozen recently, and the
other 999,990 are unfrozen. I update every page in the table. Well,
now older_than_cpt = 1 and younger_than_cpt = 9, so the ratio of the
two is terrible: 90% of the frozen pages I encountered were
recently-frozen. But that's irrelevant. What matters is that if I had
frozen less aggressively, I could have saved myself 9 unfreeze
operations across a million page modifications. So actually I'm doing
great: only 0.0009% of my page modifications required an unfreeze that
I maybe should have avoided and didn't.

Likewise, if all of the pages had been frozen, but only 9 were frozen
recently, I'm doing exactly equally great. But if the number of
recently frozen pages were higher, then I'd be doing less great. So I
think comparing the number of young pages unfrozen to the number of
old pages unfrozen is the wrong test, and comparing it instead to the
number of pages modified is a better test.

But I don't think it's completely right, either. Imagine a table with
100 recently frozen pages. I modify the table and unfreeze one of
them. So, 100% of the pages that I unfroze were recently-frozen. Does
this mean that we should back off and freeze less aggressively? Not
really. It seems like I actually got this case 99% right. So another
idea could be to look at what percentage of frozen pages I end up
un-freezing shortly thereafter. That gets this example right. But it
gets the earlier example with a million pages wrong.

Maybe there's something to combining these ideas. If the percentage of
page modifications (of clean pages, say, so that we don't skew the
stats as much when a page is modified many times in a row) that have
to un-freeze a recently-frozen page is low, then that means there's no
real performance penalty for whatever aggressive freezing we're doing.
Even if we un-freeze a zillion pages, if that happens over the course
of modifying 100 zillion pages, it's not material. On the flip side,
if the percentage of frozen pages that get unfrozen soon thereafter is
low, then it feels worth doing even if most page modifications end up
un-freezing a recently-frozen page, because on net we're still ending
up with a lot of extra stuff frozen.

Said differently, we're freezing too aggressively if un-freezing is
both adding a significant expense (i.e. the foreground work we're
doing often requires un-freezing ages) and ineffective (i.e. we don't
end up with frozen pages left over). If it has one of those problems,
it's still OK, but if it has both, we need to back off.

Maybe that's still not quite right, but it's the best I've got this morning.

> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,
>
>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

My guess is that, when we're being aggressive, we should be aiming to
freeze all pages that can be completely frozen with no additional
criterion. Any additional criterion that we add here is likely to mess
up the insert-only table case, and I'm not really sure what it saves
us from. On the other hand, when we're being non-aggressive, we might
still sometimes want to freeze in situations where we historically
haven't. Consider the following three examples:

1. pgbench_accounts table, standard run
2. pgbench_accounts table, highly skewed run
3. slowly growing table with a hot tail. records are inserted, updated
heavily for a while, thereafter updated only very occasionally.

Aggressive freezing could misfire in all of these cases, because all
of them have some portion of the table where rows are being updated
very frequently. But in the second and third cases, 

Re: how to manage Cirrus on personal repository

2023-09-29 Thread Melanie Plageman
On Fri, Sep 29, 2023 at 7:11 AM Daniel Gustafsson  wrote:
>
> > On 29 Sep 2023, at 11:13, Peter Eisentraut  wrote:
>
> > Does anyone have an idea how to manage this better?  Is there maybe a way 
> > to globally set "only trigger manually", or could we make one?
>
> On my personal repo I only build via doing pull-requests, such that I can
> control when and what is built and rate-limit myself that way.  Using the
> Github CLI client it's quite easy to script "push-and-build".  Not sure if 
> it's
> better, it's just what I got used to from doing personal CI on Github before 
> we
> had Cirrus support in the tree.

It is not a global configuration solution, but I just add an empty
ci-os-only: tag to my commit messages so it doesn't trigger CI.
I'm sure this is not what you are looking for, though

- Melanie




Re: pg_upgrade and logical replication

2023-09-29 Thread Amit Kapila
On Wed, Sep 27, 2023 at 9:14 AM Michael Paquier  wrote:
>
> On Tue, Sep 26, 2023 at 09:40:48AM +0530, Amit Kapila wrote:
> > On Mon, Sep 25, 2023 at 11:43 AM Michael Paquier  
> > wrote:
> >> Sure, that's assuming that the publisher side is upgraded.
> >
> > At some point, user needs to upgrade publisher and subscriber could
> > itself have some publications defined which means the downstream
> > subscribers will have the same problem.
>
> Not always.  I take it as a valid case that one may want to create a
> logical setup only for the sake of an upgrade, and trashes the
> publisher after a failover to an upgraded subscriber node after the
> latter has done a sync up of the data that's been added to the
> relations tracked by the publications while the subscriber was
> pg_upgrade'd.
>

Such a use case is possible to achieve even without this patch.
Sawada-San has already given an alternative to slightly tweak the
steps mentioned by Julien to achieve it. Also, there are other ways to
achieve it by slightly changing the steps. OTOH, it will create a
problem for normal logical replication set up after upgrade as
discused.

-- 
With Regards,
Amit Kapila.




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-29 Thread Hayato Kuroda (Fujitsu)
Dear Bharath,

Thanks for giving your idea!

> > I think this approach can work, but I am not sure if it's better than other
> > approaches. Mainly because it has almost the same maintaince burden as the
> > current approach, i.e. we need to verify and update the check function each
> > time we add a new WAL record type.
> 
> I think that's not a big problem if we have comments in
> replication/decode.h, access/rmgrlist.h, docs to categorize the new
> WAL records as decodable. Currently, the WAL record types adders will
> have to do certain things based on notes in comments or docs anyways.
> 
> Another idea to enforce categorizing decodability of WAL records is to
> have a new RMGR API rm_is_record_decodable or such, the RMGR
> implementers will then add respective functions returning true/false
> if a given WAL record is decodable or not:
> void(*rm_decode) (struct LogicalDecodingContext *ctx,
>   struct XLogRecordBuffer *buf);
> bool(*rm_is_record_decodable) (uint8 type);
> } RmgrData;
> 
> PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL,
> NULL, NULL, xlog_is_record_decodable), then the
> xlog_is_record_decodable can look something like [1].
> 
> This approach can also enforce/help custom RMGR implementers to define
> the decodability of the WAL records.

Yeah, the approach enforces developers to check the decodability.
But the benefit seems smaller than required efforts for it because the function
would be used only by pg_upgrade. Could you tell me if you have another use case
in mind? We may able to adopt if we have...
Also, this approach cannot be backported.

Anyway, let's see how senior members say.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Synchronizing slots from primary to standby

2023-09-29 Thread Amit Kapila
On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand
 wrote:
>
> On 9/25/23 6:10 AM, shveta malik wrote:
> > On Fri, Sep 22, 2023 at 3:48 PM Amit Kapila  wrote:
> >>
> >> On Thu, Sep 21, 2023 at 9:16 AM shveta malik  
> >> wrote:
> >>>
> >>> On Tue, Sep 19, 2023 at 10:29 AM shveta malik  
> >>> wrote:
> >>>
> >>> Currently in patch001, synchronize_slot_names is a GUC on both primary
> >>> and physical standby. This GUC tells which all logical slots need to
> >>> be synced on physical standbys from the primary. Ideally it should be
> >>> a GUC on physical standby alone and each physical standby should be
> >>> able to communicate the value to the primary (considering the value
> >>> may vary for different physical replicas of the same primary). The
> >>> primary on the other hand should be able to take UNION of these values
> >>> and let the logical walsenders (belonging to the slots in UNION
> >>> synchronize_slots_names) wait for physical standbys for confirmation
> >>> before sending those changes to logical subscribers. The intent is
> >>> logical subscribers should never be ahead of physical standbys.
> >>>
> >>
> >> Before getting into the details of 'synchronize_slot_names', I would
> >> like to know whether we really need the second GUC
> >> 'standby_slot_names'. Can't we simply allow all the logical wal
> >> senders corresponding to 'synchronize_slot_names' to wait for just the
> >> physical standby(s) (physical slot corresponding to such physical
> >> standby) that have sent ' synchronize_slot_names'list? We should have
> >> one physical standby slot corresponding to one physical standby.
> >>
> >
> > yes, with the new approach (to be implemented next) where we plan to
> > send synchronize_slot_names from each physical standby to primary, the
> > standby_slot_names GUC should no longer be needed on primary. The
> > physical standbys sending requests should automatically become the
> > ones to be waited for confirmation on the primary.
> >
>
> I think that standby_slot_names could be used to do some filtering (means
> for which standby(s) we don't want the logical replication on the primary to 
> go
> ahead and for which standby(s) one would allow it).
>

Isn't it implicit that the physical standby that has requested
'synchronize_slot_names' should be ahead of their corresponding
logical walsenders? Otherwise, after the switchover to the new
physical standby, the logical subscriptions won't work.

> I think that removing the GUC would:
>
> - remove this flexibility
>

I think if required we can add such a GUC later as well. Asking users
to set more parameters also makes the feature less attractive, so I am
trying to see if we can avoid this GUC.

> - probably open corner cases like: what if a standby is down? would that mean
> that synchronize_slot_names not being send to the primary would allow the 
> decoding
> on the primary to go ahead?
>

Good question. BTW, irrespective of whether we have
'standby_slot_names' parameters or not, how should we behave if
standby is down? Say, if 'synchronize_slot_names' is only specified on
standby then in such a situation primary won't be even aware that some
of the logical walsenders need to wait. OTOH, one can say that users
should configure 'synchronize_slot_names' on both primary and standby
but note that this value could be different for different standby's,
so we can't configure it on primary.

-- 
With Regards,
Amit Kapila.




Re: how to manage Cirrus on personal repository

2023-09-29 Thread Nazir Bilal Yavuz
Hi,

On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut  wrote:
>
> I have a private repository on GitHub where I park PostgreSQL
> development work.  I also have Cirrus activated on that repository, to
> build those branches, using the existing Cirrus configuration.
>
> Now under the new system of limited credits that started in September,
> this blew through the credits about half-way through the month.
>
> Does anyone have an idea how to manage this better?  Is there maybe a
> way to globally set "only trigger manually", or could we make one?

You can create another repository / branch with only a .cirrus.yml
file which will only have the 'trigger_type: manual' line. Then, you
can go to your private repository's settings on Cirrus CI and set
REPO_CI_CONFIG_GIT_URL=github.com/{user}/{repository}/.cirrus.yml@{branch}
environment variable. This will write contents of the newly created
.cirrus.yml file to your private repository's .cirrus.yml
configuration while running the CI. You can look at the .cirrus.star
file for more information. I just tested this on a public repository
and it worked but I am not sure if something is different for private
repositories. I hope these help.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-29 Thread Amit Kapila
On Fri, Sep 29, 2023 at 1:00 PM Bharath Rupireddy
 wrote:
>
> On Thu, Sep 28, 2023 at 6:08 PM Zhijie Hou (Fujitsu)
>  wrote:
>
> IMO, WAL scanning approach looks better. However, if were to optimize
> it by not scanning WAL records for every replication slot
> confirmed_flush_lsn (CFL), start with lowest CFL (min of all slots
> CFL), and scan till the end of WAL.
>

Earlier, I also thought something like that but I guess it won't
matter much as most of the slots will be up-to-date at shutdown time.
That would mean we would read just one or two records. Personally, I
feel it is better to build consensus on the WAL scanning approach,
basically, is it okay to decide as the patch is currently doing or
whether we should expose an API from the decode module as you are
proposing? OTOH, if we want to go with other approach like adding
field in pg_controldata then we don't need to deal with WAL record
types at all.

-- 
With Regards,
Amit Kapila.




Re: how to manage Cirrus on personal repository

2023-09-29 Thread Daniel Gustafsson
> On 29 Sep 2023, at 11:13, Peter Eisentraut  wrote:

> Does anyone have an idea how to manage this better?  Is there maybe a way to 
> globally set "only trigger manually", or could we make one?

On my personal repo I only build via doing pull-requests, such that I can
control when and what is built and rate-limit myself that way.  Using the
Github CLI client it's quite easy to script "push-and-build".  Not sure if it's
better, it's just what I got used to from doing personal CI on Github before we
had Cirrus support in the tree.

--
Daniel Gustafsson





Re: Testing autovacuum wraparound (including failsafe)

2023-09-29 Thread vignesh C
On Thu, 28 Sept 2023 at 03:55, Masahiko Sawada  wrote:
>
> Sorry for the late reply.
>
> On Sun, Sep 3, 2023 at 2:48 PM Noah Misch  wrote:
> >
> > On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> > > > On 12 Jul 2023, at 09:52, Masahiko Sawada  wrote:
> > > > Agreed. The timeout can be set by manually setting
> > > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> > > > now require setting PG_TET_EXTRA to run it.
> > >
> > > +# bump the query timeout to avoid false negatives on slow test syetems.
> > > typo: s/syetems/systems/
> > >
> > >
> > > +# bump the query timeout to avoid false negatives on slow test syetems.
> > > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> > > Does this actually work?  Utils.pm read the environment variable at 
> > > compile
> > > time in the BEGIN block so this setting won't be seen?  A quick 
> > > testprogram
> > > seems to confirm this but I might be missing something.
> >
> > The correct way to get a longer timeout is "IPC::Run::timer(4 *
> > $PostgreSQL::Test::Utils::timeout_default);".  Even if changing env worked,
> > that would be removing the ability for even-slower systems to set timeouts
> > greater than 10min.
>
> Agreed.
>
> I've attached new version patches. 0001 patch adds an option to
> background_psql to specify the timeout seconds, and 0002 patch is the
> main regression test patch.

Few comments:
1) Should we have some validation for the inputs given:
+PG_FUNCTION_INFO_V1(consume_xids_until);
+Datum
+consume_xids_until(PG_FUNCTION_ARGS)
+{
+   FullTransactionId targetxid =
FullTransactionIdFromU64((uint64) PG_GETARG_INT64(0));
+   FullTransactionId lastxid;
+
+   if (!FullTransactionIdIsNormal(targetxid))
+   elog(ERROR, "targetxid %llu is not normal", (unsigned
long long) U64FromFullTransactionId(targetxid));

If not it will take inputs like -1 and 999.
Also the notice messages might confuse for the above values, as it
will show a different untilxid value like the below:
postgres=# SELECT consume_xids_until(999);
NOTICE:  consumed up to 0:1809 / 232830:2764472319

2) Should this be added after worker_spi as we generally add it in the
alphabetical order:
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..4054bde84c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -10,6 +10,7 @@ subdir('libpq_pipeline')
 subdir('plsample')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
+subdir('xid_wraparound')
 subdir('test_bloomfilter')

3) Similarly here too:
index e81873cb5a..a4c845ab4a 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -13,6 +13,7 @@ SUBDIRS = \
  libpq_pipeline \
  plsample \
  spgist_name_ops \
+ xid_wraparound \
  test_bloomfilter \

4) The following includes are not required transam.h, fmgr.h, lwlock.h
+ * src/test/modules/xid_wraparound/xid_wraparound.c
+ *
+ * -
+ */
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/xact.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"

Regards,
Vignesh




Re: Testing autovacuum wraparound (including failsafe)

2023-09-29 Thread Daniel Gustafsson
> On 27 Sep 2023, at 14:39, Masahiko Sawada  wrote:

> I've attached new version patches. 0001 patch adds an option to
> background_psql to specify the timeout seconds, and 0002 patch is the
> main regression test patch.

-=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params)
+=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params, timeout)

Looking at this I notice that I made a typo in 664d757531e, the =item line
should have "@psql_params" and not "@params".  Perhaps you can fix that minor
thing while in there?


+   $timeout = $params{timeout} if defined $params{timeout};

I think this should be documented in the background_psql POD docs.


+   Not enabled by default it is resource intensive.

This sentence is missing a "because", should read: "..by default *because* it
is.."


+# Bump the query timeout to avoid false negatives on slow test systems.
+my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;

Should we bump the timeout like this for all systems?  I interpreted Noah's
comment such that it should be possible for slower systems to override, not
that it should be extended everywhere, but I might have missed something.

--
Daniel Gustafsson





Re: On login trigger: take three

2023-09-29 Thread Daniel Gustafsson
> On 28 Sep 2023, at 23:50, Alexander Korotkov  wrote:

> I don't think I can reproduce the performance regression pointed out
> by Pavel Stehule [1].

> I can't confirm the measurable overhead.


Running the same pgbench command on my laptop looking at the average connection
times, and the averaging that over five runs (low/avg/high) I see ~5% increase
over master with the patched version (compiled without assertions and debug):

Patched event_triggers on:  6.858 ms/7.038 ms/7.434 ms
Patched event_triggers off: 6.601 ms/6.958 ms/7.539 ms
Master: 6.676 ms/6.697 ms/6.760 ms

This is all quite unscientific with a lot of jitter so grains of salt are to be
applied, but I find it odd that you don't see any measurable effect.  Are you
seeing the same/similar connection times between master and with this patch
applied?

A few small comments on the patch:

+ prevent successful login to the system. Such bugs may be fixed by
+ restarting the system in single-user mode (as event triggers are
This paragraph should be reworded to recommend the GUC instead of single-user
mode (while retaining mention of single-user mode, just not as the primary
option).


+ Also, it's recommended to evade long-running queries in
s/evade/avoid/ perhaps?


Thanks for working on this!

--
Daniel Gustafsson





Re: Remove MSVC scripts from the tree

2023-09-29 Thread Peter Eisentraut

On 22.09.23 03:12, Michael Paquier wrote:

It has been mentioned a few times now that, as Meson has been
integrated in the tree, the next step would be to get rid of the
custom scripts in src/tools/msvc/ and moving forward only support
Meson when building with VS compilers.  As far as I understand, nobody
has sent a patch to do that yet, so here is one.

Please find attached a patch to move the needle in this sense.


Your patch still leaves various mentions of Mkvcbuild.pm and Project.pm 
in other files, including in


config/perl.m4
meson.build
src/bin/pg_basebackup/Makefile
src/bin/pgevent/meson.build
src/common/Makefile
src/common/meson.build
src/interfaces/libpq/Makefile
src/port/Makefile

A few of these comments are like "see $otherfile for the reason", which 
means that if we delete $otherfile, we should move that information to a 
new site somehow.






how to manage Cirrus on personal repository

2023-09-29 Thread Peter Eisentraut
I have a private repository on GitHub where I park PostgreSQL 
development work.  I also have Cirrus activated on that repository, to 
build those branches, using the existing Cirrus configuration.


Now under the new system of limited credits that started in September, 
this blew through the credits about half-way through the month.


Does anyone have an idea how to manage this better?  Is there maybe a 
way to globally set "only trigger manually", or could we make one?


I suppose one way to deal with this is to make a second repository and 
only activate Cirrus on that one, and push there on demand.


Any ideas?




Re: pg_resetwal regression: could not upgrade after 1d863c2504

2023-09-29 Thread Peter Eisentraut

On 29.09.23 09:39, Hayato Kuroda (Fujitsu) wrote:

pg_resetwal with relative path cannot be executed. It could be done at 7273945,
but could not at 1d863.


Ok, I have reverted the offending patch.





Re: [PGDOCS] change function linkend to refer to a more relevant target

2023-09-29 Thread Daniel Gustafsson
> On 29 Sep 2023, at 06:51, Peter Smith  wrote:

> I found a link in the docs that referred to the stats "views" section,
> instead of the more relevant (IMO)  stats "functions" section.

Agreed.  This link was added in 2007 (in 7d3b7011b), and back in 7.3/7.4 days
the functions were listed in the same section as the views, so the anchor was
at the time pointing to the right section.  In 2012 it was given its own
section (in aebe989477) but the link wasn't updated.

Thanks for the patch, I'll go ahead with it.

--
Daniel Gustafsson





Re: wal recycling problem

2023-09-29 Thread Fabrice Chapuis
Yes, barman replication can keep up with primary, wals segments size are
under max_wal_size (24Gb in our configuration)

Here is  pg_replication_slots view:

barman_ge  physical  f  t39409 1EE2/4900
reservedf
barman_be  physical  f  t39434 1EE2/3D00
reservedf

on the other hand there are 2 slots for logical replication which display
status extended. I don't understand why given that the confirmed_flush_lsn
field that is up to date. The restart_lsn remains frozen, for what reason?

pgoutput │ logical   │ 2667915 │ db019a00 │ f │ t  │1880162
│  │ 68512101 │ 1ECA/37C3F1B8 │ 1EE2/4D6BDCF8   │ extended   │
 │ f │
pgoutput │ logical   │ 2668584 │ db038a00 │ f │ t  │
 363230  │  │ 68512101 │ 1ECA/37C3F1B8 │ 1EE2/4D6BDCF8   │
extended   │   │ f │

Regards
Fabrice

On Thu, Sep 28, 2023 at 7:59 PM Christoph Moench-Tegeder 
wrote:

> ## Fabrice Chapuis (fabrice636...@gmail.com):
>
> > We have a cluster of 2 members (1 primary and 1 standby) with Postgres
> > version 14.9 and 2 barman server, slots are only configured for barman,
> > barman is version 3.7.
>
> The obvious question here is: can both of those barmans keep up with
> your database, or are you seeing WAL retention due to exactly these
> replication slots? (Check pg_replication_slots).
>
> Regards,
> Christoph
>
> --
> Spare Space
>


pg_resetwal regression: could not upgrade after 1d863c2504

2023-09-29 Thread Hayato Kuroda (Fujitsu)
Dear hackers,
(CC: Peter Eisentraut - committer of the problematic commit)

While developing pg_upgrade patch, I found a candidate regression for 
pg_resetwal.
It might be occurred due to 1d863c2504.

Is it really regression, or am I missing something?

# Phenomenon

pg_resetwal with relative path cannot be executed. It could be done at 7273945,
but could not at 1d863.


At 1d863:

```
$ pg_resetwal -n data_N1/
pg_resetwal: error: could not read permissions of directory "data_N1/": No such 
file or directory
```

At 7273945:

```
$ pg_resetwal -n data_N1/
Current pg_control values:

pg_control version number:1300
Catalog version number:   202309251
...
```

# Environment

Attached script was executed on RHEL 7.9, gcc was 8.3.1.
I used meson build system with following options:

meson setup -Dcassert=true -Ddebug=true -Dc_args="-ggdb -O0 -g3 
-fno-omit-frame-pointer"

# My analysis

I found that below part in GetDataDirectoryCreatePerm() returns false, it was a
cause.

```
/*
 * If an error occurs getting the mode then return false.  The caller is
 * responsible for generating an error, if appropriate, indicating that 
we
 * were unable to access the data directory.
 */
if (stat(dataDir, ) == -1)
return false;
```

Also, I found that the value DataDir in main() has relative path.
Based on that, upcoming stat() may not able to detect the given location because
the process has already located inside the directory.

```
(gdb) break chdir
Breakpoint 1 at 0x4016f0
(gdb) run -n data_N1

...
Breakpoint 1, 0x778e1390 in chdir () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) print DataDir
$1 = 0x7fffe25c "data_N1"
(gdb) frame 1
#1  0x004028d7 in main (argc=3, argv=0x7fffdf58) at 
../postgres/src/bin/pg_resetwal/pg_resetwal.c:348
348 if (chdir(DataDir) < 0)
(gdb) print DataDir
$2 = 0x7fffe25c "data_N1"
```

# How to fix

One alternative approach is to call chdir() several times. PSA the patch.
(I'm not sure the commit should be reverted)

# Appendix - How did I find?

Originally, I found an issue when attached script was executed.
It creates two clusters and executes pg_upgrade, but failed with following 
output.
(I also attached whole output, please see result_*.out)

```
Performing Consistency Checks
-
Checking cluster versions ok
pg_resetwal: error: could not read permissions of directory "data_N1": No such 
file or directory
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



test.sh
Description: test.sh


result_7273945ca.out
Description: result_7273945ca.out


result_1d863c2504.out
Description: result_1d863c2504.out


fix.patch
Description: fix.patch


Re: pg_resetwal tests, logging, and docs update

2023-09-29 Thread Peter Eisentraut

On 26.09.23 17:19, Aleksander Alekseev wrote:

Attached is an updated patch set where I have split the changes into
smaller pieces.  The last two patches still have some open questions
about what certain constants mean etc.  The other patches should be settled.


The patches 0001..0005 seem to be ready and rather independent. I
suggest merging them and continue discussing the rest of the patches.


I have committed 0001..0005, and also posted a separate patch to discuss 
and correct the behavior of the -c option.  I expect that we will carry 
over this patch set to the next commit fest.






Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-29 Thread Bharath Rupireddy
On Thu, Sep 28, 2023 at 6:08 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, September 28, 2023 5:32 PM Bharath Rupireddy 
>  wrote:
>
> > Perhaps, a function in logical/decode.c returning the WAL record as valid 
> > if the
> > record type is any of the above. A note in replication/decode.h and/or
> > access/rmgrlist.h asking rmgr adders to categorize the WAL record type in 
> > the
> > new function based on its decoding operation might help with future new WAL
> > record type additions.
> >
> > Thoughts?
>
> I think this approach can work, but I am not sure if it's better than other
> approaches. Mainly because it has almost the same maintaince burden as the
> current approach, i.e. we need to verify and update the check function each
> time we add a new WAL record type.

I think that's not a big problem if we have comments in
replication/decode.h, access/rmgrlist.h, docs to categorize the new
WAL records as decodable. Currently, the WAL record types adders will
have to do certain things based on notes in comments or docs anyways.

Another idea to enforce categorizing decodability of WAL records is to
have a new RMGR API rm_is_record_decodable or such, the RMGR
implementers will then add respective functions returning true/false
if a given WAL record is decodable or not:
void(*rm_decode) (struct LogicalDecodingContext *ctx,
  struct XLogRecordBuffer *buf);
bool(*rm_is_record_decodable) (uint8 type);
} RmgrData;

PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL,
NULL, NULL, xlog_is_record_decodable), then the
xlog_is_record_decodable can look something like [1].

This approach can also enforce/help custom RMGR implementers to define
the decodability of the WAL records.

> Apart from the WAL scan approach, we also considered alternative approach that
> do not impose an additional maintenance burden and could potentially be less
> complex.  For example, we can add a new field in pg_controldata to record the
> last checkpoint that happens in non-upgrade mode, so that we can compare the
> slot's confirmed_flush_lsn with this value, If they are the same, the WAL
> should have been consumed otherwise we disallow upgrading this slot. I would
> appreciate if you can share your thought about this approach.

I read this 
https://www.postgresql.org/message-id/CAA4eK1JVKZGRHLOEotWi%2Be%2B09jucNedqpkkc-Do4dh5FTAU%2B5w%40mail.gmail.com
and I agree with the concern on adding a new filed in pg_controldata
just for this purpose and spreading the IsBinaryUpgrade code in
checkpointer. Another concern for me with a new filed in
pg_controldata approach is that it makes it hard to make this patch
support back branches. Therefore, -1 for this approach from me.

> And if we decided to use WAL scan approach, instead of checking each record, 
> we
> could directly check if the WAL record can be decoded into meaningful results
> by use test_decoding to decode them. This approach also doesn't add new
> maintenance burden as we anyway need to update the test_decoding if any decode
> logic for new record changes. This was also mentioned [1].
>
> What do you think ?
>
> [1] 
> https://www.postgresql.org/message-id/OS0PR01MB5716FC0F814D78E82E4CC3B894C3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-1 for decoding the WAL with test_decoding, I don't think it's a great
idea to create temp slots and launch walsenders during upgrade.

IMO, WAL scanning approach looks better. However, if were to optimize
it by not scanning WAL records for every replication slot
confirmed_flush_lsn (CFL), start with lowest CFL (min of all slots
CFL), and scan till the end of WAL. The
binary_upgrade_validate_wal_logical_end function can return an array
of LSNs at which decodable WAL records are found. Then, use CFLs of
all other slots and this array to determine if the slots have
unconsumed WAL. Following is an illustration of this idea:

1. Slots s1, s2, s3, s4, s5 with CFLs 100, 90, 110, 70, 80 respectively.
2. Min of all CFLs is 70 for slot s4.
3. Start scanning WAL from min CFL 70 for slot s4, say there are
unconsumed WAL at LSN {85, 89}.
4. Now, without scanning WAL for rest of the slots, determine if they
have unconsumed WAL.
5.1. CFL of slot s1 is 100 and no unconsumed WAL at or after LSN 100 -
look at the array of unconsumed WAL LSNs {85, 89}.
5.2. CFL of slot s2 is 90 and no unconsumed WAL at or after LSN 90 -
look at the array of unconsumed WAL LSNs {85, 89}.
5.3. CFL of slot s3 is 110 and no unconsumed WAL at or after LSN 110 -
look at the array of unconsumed WAL LSNs {85, 89}.
5.4. CFL of slot s4 is 70 and there's unconsumed WAL at or after LSN
70 - look at the array of unconsumed WAL LSNs {85, 89}.
5.5. CFL of slot s5 is 80 and there's unconsumed WAL at or after LSN
80 - look at the array of unconsumed WAL LSNs {85, 89}.

With this approach, the WAL is scanned only once as opposed to the
current approach the patch implements.

Thoughts?

[1]
bool

Re: Add support for AT LOCAL

2023-09-29 Thread Michael Paquier
On Sat, Sep 23, 2023 at 12:54:01AM +0200, Vik Fearing wrote:
> On 9/22/23 23:46, cary huang wrote:
>> I think this feature can be a useful addition in dealing with time
>> zones. I have applied and tried out the patch, The feature works as
>> described and seems promising. The problem with compilation failure
>> was probably reported on CirrusCI when compiled on different
>> platforms. I have run the latest patch on my own Cirrus CI environment
>> and everything checked out fine. 
> 
> Thank you for reviewing!

+| a_expr AT LOCAL%prec AT
+{
+/* Use the value of the session's time zone */
+FuncCall *tz = 
makeFuncCall(SystemFuncName("current_setting"),
+
list_make1(makeStringConst("TimeZone", -1)),
+COERCE_SQL_SYNTAX,
+-1);
+$$ = (Node *) makeFuncCall(SystemFuncName("timezone"),
+   list_make2(tz, $1),
+   COERCE_SQL_SYNTAX,
+   @2);

As the deparsing code introduced by this patch is showing, this leads
to a lot of extra complexity.  And, actually, this can be quite
expensive as well with these two layers of functions.  Note also that
in comparison to SQLValueFunctions, COERCE_SQL_SYNTAX does less
inlining.  So here comes my question: why doesn't this stuff just use 
one underlying function to do this job?
--
Michael


signature.asc
Description: PGP signature


Re: Index range search optimization

2023-09-29 Thread Alexander Korotkov
Hi, Peter.

On Fri, Sep 29, 2023 at 4:57 AM Peter Geoghegan  wrote:
> On Fri, Sep 22, 2023 at 7:24 AM Alexander Korotkov  
> wrote:
> > The thing is that NULLs could appear in the middle of matching values.
> >
> > # WITH t (a, b) AS (VALUES ('a', 'b'), ('a', NULL), ('b', 'a'))
> > SELECT a, b, (a, b) > ('a', 'a') FROM t ORDER BY (a, b);
> >  a |  b   | ?column?
> > ---+--+--
> >  a | b| t
> >  a | NULL | NULL
> >  b | a| t
> > (3 rows)
> >
> > So we can't just skip the row comparison operator, because we can meet
> > NULL at any place.
>
> But why would SK_ROW_HEADER be any different? Is it related to this
> existing case inside _bt_check_rowcompare()?:
>
> if (subkey->sk_flags & SK_ISNULL)
> {
> /*
>  * Unlike the simple-scankey case, this isn't a disallowed case.
>  * But it can never match.  If all the earlier row comparison
>  * columns are required for the scan direction, we can stop the
>  * scan, because there can't be another tuple that will succeed.
>  */
> if (subkey != (ScanKey) DatumGetPointer(skey->sk_argument))
> subkey--;
> if ((subkey->sk_flags & SK_BT_REQFWD) &&
> ScanDirectionIsForward(dir))
> *continuescan = false;
> else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
>  ScanDirectionIsBackward(dir))
> *continuescan = false;
> return false;
> }

Yes, exactly. Our row comparison operators don't match if there is any
null inside the row.  But you can find these rows within the matching
range.

> I noticed that you're not initializing so->firstPage correctly for the
> _bt_endpoint() path, which is used when the initial position of the
> scan is either the leftmost or rightmost page. That is, it's possible
> to reach _bt_readpage() without having reached the point in
> _bt_first() where you initialize so->firstPage to "true".

Good catch, thank you!

> It would probably make sense if the flag was initialized to "false" in
> the same way as most other scan state is already, somewhere in
> nbtree.c. Probably in btrescan().

Makes sense, initialisation is added.

--
Regards,
Alexander Korotkov


0001-Skip-checking-of-scan-keys-required-for-direction-v8.patch
Description: Binary data


Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag

2023-09-29 Thread Michael Paquier
On Thu, Sep 28, 2023 at 02:37:02PM +0200, Drouvot, Bertrand wrote:
> This patch allows the role provided in BackgroundWorkerInitializeConnection()
> and BackgroundWorkerInitializeConnectionByOid() to lack login authorization.

Interesting.  Yes, there would be use cases for that, I suppose.

> +  uint32 flags,
>char *out_dbname)
>  {

This may be more adapted with a bits32 for the flags.

> +# Ask the background workers to connect with this role with the flag in 
> place.
> +$node->append_conf(
> +'postgresql.conf', q{
> +worker_spi.role = 'nologrole'
> +worker_spi.bypass_login_check = true
> +});
> +$node->restart;
> +
> +# An error message should not be issued.
> +ok( !$node->log_contains(
> +"role \"nologrole\" is not permitted to log in", $log_start),
> +"nologrole allowed to connect if BGWORKER_BYPASS_ROLELOGINCHECK is set");
> +
>  done_testing();

It would be cheaper to use a dynamic background worker for such tests.
Something that I've been tempted to do in this module is to extend the
amount of data that's given to bgw_main_arg when launching a worker
with worker_spi_launch().  How about extending the SQL function so as
it is possible to give in input a role name (or a regrole), a database
name (or a database OID) and a text[] for the flags?  This would
require a bit more refactoring, but this would be benefitial to show
or one can pass down a full structure from the registration to the
main() routine.  On top of that, it would make the addition of the new
GUCs worker_spi.bypass_login_check and worker_spi.role unnecessary.

> +# return the size of logfile of $node in bytes
> +sub get_log_size
> +{
> +my ($node) = @_;
> +
> +return (stat $node->logfile)[7];
> +}

Just use -s here.  See other tests that want to check the contents of
the logs from an offset.

> - * Allow bypassing datallowconn restrictions when connecting to database
> + * Allow bypassing datallowconn restrictions and login check when connecting
> + * to database
>   */
> -#define BGWORKER_BYPASS_ALLOWCONN 1
> +#define BGWORKER_BYPASS_ALLOWCONN 0x0001
> +#define BGWORKER_BYPASS_ROLELOGINCHECK 0x0002

The structure of the patch is inconsistent.  These flags are in
bgworker.h, but they are used also by InitPostgres().  Perhaps a
second boolean flag would be OK rather than a second set of flags for
InitPostgres() mapping with the bgworker set.
--
Michael


signature.asc
Description: PGP signature