date:20251019

Re: Suggestion to add --continue-client-on-abort option to pgbench

2025-10-19 Thread Rintaro Ikeda

Hi,

On 2025/10/02 1:22, Fujii Masao wrote:
> Regarding 0002:
> 
> - if (canRetryError(st->estatus))
> + if (continue_on_error || canRetryError(st->estatus))
>   {
>   if (verbose_errors)
>   commandError(st, PQresultErrorMessage(res));
>   goto error;
> 
> With this change, even non-SQL errors (e.g., connection failures) would
> satisfy the condition when --continue-on-error is set. Isn't that a problem?
> Shouldn't we also check that the error status is one that
> --continue-on-error is meant to handle?

I agree that connection failures should not be ignored even when
--continue-on-error is specified.
For now, I’m not sure if other cases would cause issues, so the updated patch
explicitly checks the connection status and emits an error message when the
connection is lost.

> 
> 
> + * Without --continue-on-error:
>   *
>   * failed (the number of failed transactions) =
>   *   'serialization_failures' (they got a serialization error and were not
>   * successfully retried) +
>   *   'deadlock_failures' (they got a deadlock error and were not
>   *successfully retried).
>   *
> + * With --continue-on-error:
> + *
> + * failed (number of failed transactions) =
> + *   'serialization_failures' + 'deadlock_failures' +
> + *   'other_sql_failures'  (they got some other SQL error; the transaction 
> was
> + * not retried and counted as failed due to --continue-on-error).
> 
> About the comments on failed transactions: I don't think we need
> to split them into separate "with/without --continue-on-error" sections.
> How about simplifying them like this?
> 
> 
> 
> * failed (the number of failed transactions) =
> *   'serialization_failures' (they got a serialization error and were not
> *successfully retried) +
> *   'deadlock_failures' (they got a deadlock error and were not
> *successfully retried) +
> *   'other_sql_failures'  (they failed on the first try or after retries
> *due to a SQL error other than serialization or
> *deadlock; they are counted as a failed transaction
> *only when --continue-on-error is specified).
> 
> 
Thank you for the suggestion. I’ve updated the comments as you proposed.

> 
> * 'retried' (number of all retried transactions) =
> *   successfully retried transactions +
> *   failed transactions.
> 
> Since transactions that failed on the first try (i.e., no retries) due to
> an SQL error are not counted as 'retried', shouldn't this source comment
> be updated?

Agreed. I added "failed transactions" is actually counted when they are retied.


I've attached the updated patch v17-0002. 0003 remains unchanged.

Best regards,
Rintaro Ikeda
From 8ae5be55a2704f813e200917968ae040146486ab Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Fri, 19 Sep 2025 16:54:49 +0900
Subject: [PATCH v17 2/3] pgbench: Add --continue-on-error option

When the option is set, client rolls back the failed transaction and starts a
new one when its transaction fails due to the reason other than the deadlock and
serialization failure.
---
 doc/src/sgml/ref/pgbench.sgml| 64 
 src/bin/pgbench/pgbench.c| 63 +++
 src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +++
 3 files changed, 125 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ab252d9fc74..63230102357 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -76,9 +76,8 @@ tps = 896.967014 (without initial connection time)
   and number of transactions per client); these will be equal unless the run
   failed before completion or some SQL command(s) failed.  (In
   -T mode, only the actual number of transactions is printed.)
-  The next line reports the number of failed transactions due to
-  serialization or deadlock errors (see 
-  for more information).
+  The next line reports the number of failed transactions (see
+   for more information).
   The last line reports the number of transactions per second.
  
 
@@ -790,6 +789,9 @@ pgbench  options 
 d
  
   deadlock failures;
  
+ 
+  other failures;
+ 
 
 See  for more information.

@@ -914,6 +916,26 @@ pgbench  options 
 d
   
  
 
+ 
+  --continue-on-error
+  
+   
+Allows clients to continue running even if an SQL statement fails due 
to
+errors other than serialization or deadlock. Unlike serialization and 
deadlock
+failures, clients do not retry the same transactions but proceed to 
the next
+transaction. This option is useful when your custom script may raise 
errors for
+reasons such as unique constraints violation. Without this option, the
+client is aborted after such errors.

Re: BRIN: Prevent the heapblk overflow during index summarization on very large tables resulting in an infinite loop

2025-10-19 Thread David Rowley

On Sun, 19 Oct 2025 at 19:03, sunil s  wrote:
> Previously, heapBlk was defined as an unsigned 32-bit integer. When 
> incremented
> by pagesPerRange on very large tables, it could wrap around, causing the 
> condition
> heapBlk < nblocks to remain true indefinitely — resulting in an infinite loop.

> This was explained very nicely by Tomas Vondra[1] and below two solutions were
>
> suggested.
>  i)  Change to int64
>  ii) Tracking the prevHeapBlk

This is similar to what table_block_parallelscan_nextpage() has to do
to avoid wrapping around when working with tables containing near 2^32
pages. I'd suggest using uint64 rather than int64 and also adding a
comment to mention why that type is being used rather than
BlockNumber. Something like: "We make use of uint64 for heapBlk as a
BlockNumber could wrap for tables with close to 2^32 pages."

You could move the declaration of heapBlk into the for loop line so
that the comment about using uint64 is closer to the declaration.

I suspect you will also want to switch to using uint64 for the
"pageno" variable at the end of the loop. My compiler isn't warning
about the "pageno = heapBlk;", but maybe other compilers will.

Not for this patch, but I wonder why we switch memory contexts so
often in that final tbm_add_page() loop. I think it would be better to
switch once before the loop and back again after it. What's there
seems a bit wasteful for any pagesPerRange > 1.

David

isolation tester limitation in case of multiple injection points in a single command

2025-10-19 Thread Mihail Nikalayeu

Hello, everyone!

While stabilizing tests for [0] I realized I have no idea how to set
up a pretty simple scenario using an isolation tester.

It looks like this:
* session S1 start long running query (CREATE INDEX CONCURRENTLY) and
traps into injection point X
* session S2 executes some command
* session S2 wakes up S1
* now S2 needs to run one more command BUT only AFTER S1 traps into
another injection point Y (and it is still the same CREATE INDEX)
<--- tricky part here

My first idea was to add one more injection point Z with 'notice' type
before Y (not a bulletproof way, but okay for now) - and use some
S2_noop (S1 notices 1) step
But AFAIU injection_point's 'notice' is a totally different type of notice
compared to RAISE NOTICE - so, isolation_tester just ignores it.

Second idea is to implement some polling procedure to find S1 waiting
in pg_stat_activity... But that feels weird for a spec test. Maybe we
should implement some injection_point_await utility?

So, my proposal options:
a) tell me if there's a clear solution I missed
b) for 'wait' injection_point - learn isolation tester to detect it
and implement logic like 'S2_query(S1 traps Y)'
c) learn isolation tester to treat 'notice' from injection_point the
same way as RAISE NOTICE + (preferable) add possibility to
apply 'notice' and 'wait' for the same injection point (like
'wait+notice')
d) add injection_point_await - utility to wait until injected point is not
"waited" by some backend

What do you think? Any other ideas?

Best regards,
Mikhail.

[0]: https://commitfest.postgresql.org/patch/5160/

Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()

2025-10-19 Thread David E. Wheeler

On Oct 18, 2025, at 22:48, Mankirat Singh  wrote:

> I've implemented the support for .abi-compliance-history file. If the file is 
> present in a STABLE branch, it will be used by default; otherwise, the latest 
> tag will be used. You can view the changes here[1].

I’ve now deployed this change (well, a subsequent change that does it better) 
to baza, so once the .abi-compliance-history has been committed to the rev 18 
branch (assuming the format is compatible), it should start passing again.

Best,

David

signature.asc
Description: Message signed with OpenPGP

Re: Optimize LISTEN/NOTIFY

2025-10-19 Thread Joel Jacobson

On Thu, Oct 16, 2025, at 22:16, Tom Lane wrote:
> "Joel Jacobson"  writes:
>> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>>> realistic here?
>
>> I agree this looks like a real problem, since I guess it's not
>> completely unthinkable someone might have
>> some kind of trigger on a table, that could fire off NOTIFY
>> for each row, possibly causing hundreds of thousands of
>> notifies in the same db txn.
>
> We already de-duplicate identical NOTIFY operations for exactly that
> reason (cf. AsyncExistsPendingNotify).  However, non-identical NOTIFYs
> obviously can't be merged.
>
> I wonder whether we could adapt that de-duplication logic so that
> it produces a list of unique channel names in addition to a list
> of unique NOTIFY events.  One way could be a list/hashtable of
> channels used, and for each one a list/hashtable of unique payloads,
> rather than the existing single-level list/hashtable.

Thanks for the great idea! Yes, this was indeed possible.

0002-optimize_listen_notify-v20.patch:
* Added channelHashtab field, created and updated together with hashtab.
  If we have channelHashtab, it's used within PreCommit_Notify to
  quickly build pendingNotifyChannelsl.

In this email, I'm also answering to the feedback from Arseniy Mukhin,
and I've based the alt1, alt2, alt3 .txt patches on top of v20.

On Sat, Oct 18, 2025, at 18:41, Arseniy Mukhin wrote:
> Thank you for the new version and all implementations!

Thanks for review and great ideas!

>> > I think we can perhaps salvage the idea if we invent a separate
>> > "advisory" queue position field, which tells its backend "hey,
>> > you could skip as far as here if you want", but is not used for
>> > purposes of SLRU truncation.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
>
>pos = QUEUE_BACKEND_POS(i);
>
>/* Direct advancement for idle backends at the old head */
>if (pendingNotifies != NULL &&
>   QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
>{
>   QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
>
> If we have several notifying backends, it looks like only the first
> one will be able to do direct advancement here. Next notifying backend
> will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
> wake up the listener and pos will be the same as it was for the first
> notifying backend.

Right.

> It seems that to accumulate direct advancement from
> several notifying backends we need to compare queueHeadBeforeWrite
> with advisoryPos here.

*** 0002-optimize_listen_notify-v20-alt1.txt:

* Fixed; compare advisoryPos with queueHeadBeforeWrite instead of pos.

> And we also need to advance advisoryPos to the
> listener's position after reading if advisoryPos falls behind.

* Fixed; set advisoryPos to max(max,advisoryPos) in PG_FINALLY block.

* Also noted Exec_ListenPreCommit didn't set advisoryPos to max
  for the first LISTEN, now fixed.

> Minute of brainstorming
>
> I also thought about a workload that probably frequently can be met.
> Let's say we have sequence of notifications:
>
> F F F T F F F T F F F T
>
> Here F - notification from the channel we don't care about and T - the 
> opposite.
> It seems that after the first 'T' notification it will be more
> difficult for notifying backends to do 'direct advancement' as there
> will be some lag before the listener reads the notification and
> advances its position. Not sure if it's a problem, probably it depends
> on the intensity of notifications.

Hmm, I realize both the advisoryPos and donePos ideas share a problem;
they both require listening backends to wakeup eventually anyway,
just to advance the 'pos'.

The holy grail would be to avoid this context switching cost entirely,
and only need to wakeup listening backends when they are actually
interested in the queued notifications. I think the third idea,
alt3, is most promising in achieving this goal.

> But maybe we can use a bit more
> sophisticated data structure here? Something like a list of skip
> ranges. Every entry in the list is the range (pos1, pos2) that the
> listener can skip during the reading. So instead of advancing
> advisoryPos every notifying backend should add skip range to the list.
> Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> for example. Listeners on their side should clear the list before
> reading and skip all ranges from it. What do you think? Is it
> overkill?

Hmm, maybe, but I'm a bit wary about too much complication.
Hopefully there is a simpler solution that avoids the need for this,
but sure, if we can't find one, then I'm positive to try this skip ranges idea.

>> > Alternatively, split the queue pos
>> > into "this is where to read next" and "this is as much as I'm
>> > definitively done with", where the second field gets advanced at
>> > the

Re: Optimize LISTEN/NOTIFY

2025-10-19 Thread Joel Jacobson

On Mon, Oct 20, 2025, at 00:06, Joel Jacobson wrote:
> Attachments:
> * 0001-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20-alt1.txt
> * 0002-optimize_listen_notify-v20-alt3.txt
> * 0002-optimize_listen_notify-v20-alt2.txt

My apologies, I forgot to attach 0002-optimize_listen_notify-v20.patch.

/Joel


0001-optimize_listen_notify-v20.patch
Description: Binary data


0002-optimize_listen_notify-v20.patch
Description: Binary data
From afff0f3f8b01cfde369c564025313e6acc9a610a Mon Sep 17 00:00:00 2001
From: Joel Jacobson 
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos

---
 src/backend/commands/async.c | 63 +---
 1 file changed, 58 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..6a02f5e3acc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
 (x).page != (y).page ? (x) : \
 (x).offset > (y).offset ? (x) : (y))
 
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+   (asyncQueuePagePrecedes((x).page, (y).page) || \
+((x).page == (y).page && (x).offset < (y).offset))
+
 /*
  * Parameter determining how often we try to advance the tail pointer:
  * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.  This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid;  /* backend's database 
OID, or InvalidOid */
ProcNumber  nextListener;   /* id of next listener, or 
INVALID_PROC_NUMBER */
QueuePosition pos;  /* backend has read queue up to 
here */
+   QueuePosition advisoryPos;  /* backend could skip queue to here */
boolwakeupPending;  /* signal sent but not yet processed */
 } QueueBackendStatus;
 
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
 #define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
 #define QUEUE_NEXT_LISTENER(i) 
(asyncQueueControl->backend[i].nextListener)
 #define QUEUE_BACKEND_POS(i)   (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i)  
(asyncQueueControl->backend[i].advisoryPos)
 #define QUEUE_BACKEND_WAKEUP_PENDING(i)
(asyncQueueControl->backend[i].wakeupPending)
 
 /*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+   SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+   QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
 * Even though we may take and release NotifyQueueLock multiple times
 * while writing, the heavyweight lock guarantees this region contains
 * only our messages.  Therefore, any backend still positioned at the
-* queue head from before our write can be safely advanced to the 
current
+* queue head from before our write can be advised to skip to the 
current
 * queue head without waking it.
 *
+* We use the advisoryPos field rather than directly modifying pos.
+* The backend controls its own pos field and will check advisoryPos
+* when it's safe to do so.
+*
 * False-positive possibility: if a backend was previously signaled but
 * hasn't yet awoken, we'll skip advancing it (because wakeupPending is
 * true).  This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
 i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+   QueuePosition advisoryPos;
int64   lag;
int32   pid;
 
@@ -2055,15 +2069,31 @@ SignalBackends(void)
continue;
 
pos = QUEUE_BACKEND_POS(i);
+   advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
 
-   /* Direct advancement for idle backends at the old head */
+   /*
+* Direct advancement for idle backends at the old head.
+*
+* We check advisoryPos rather than pos to allow accumulating 
advances
+* from multiple consecutive notifying backends.  If we checked 
pos,
+*

Re: Question about InvalidatePossiblyObsoleteSlot()

2025-10-19 Thread Bertrand Drouvot

Hi,

On Fri, Oct 17, 2025 at 03:08:07PM -0700, Masahiko Sawada wrote:
> On Fri, Oct 17, 2025 at 12:18 AM Bertrand Drouvot
>  wrote:

> > do you think that's also safe to not
> > invalidate the slots for effective_xmin and catalog_effective_xmin if they
> > advance far enough?
> 
> I find the same in xmin cases. ResolveRecoveryConflictWithSnapshot()
> is called only during the recovery by the startup process, and it also
> tries to invalidate possibly obsolete slots. Catalog tuples are not
> removed as long as the startup calls
> ResolveRecoveryConflictWithSnapshot() before actually removing the
> tuples and it's busy in InvalidatePossiblyObsoleteSlot().

I looked more closely at the xmin related cases and I agree with the above.

> I might be
> missing something but this is the reason why I'm confused with the
> 818fefd8fd4 fix and the proposed change.

Yeah so 818fefd8fd4 is well suited for tests consistency but in some rare cases
it invalidates a slot while it would be safe not to do so.

OTOH it looks to me that the initial pre-818fefd8fd4 intend was to invalidate
the slot as per this comment:

"
/*
 * Re-acquire lock and start over; we expect to invalidate the
 * slot next time (unless another process acquires the slot in the
 * meantime).
 */
"

The fact that it could move forward far enough before we terminate the
process holding the slot is a race condition due to the fact that we released
the mutex. 

If the above looks right to you then 818fefd8fd4 is doing what was "initially"
expected, do you agree?

If so, then maybe it's fine to keep 818fefd8fd4 as is?

Regards, 

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Preserve index stats during ALTER TABLE ... TYPE ...

2025-10-19 Thread Bertrand Drouvot

Hi,

On Mon, Oct 20, 2025 at 10:53:37AM +0900, Michael Paquier wrote:
> On Thu, Oct 16, 2025 at 03:39:59PM -0500, Sami Imseih wrote:
> > This sounds like a good enhancement. This will also take care of the
> > index stats being preserved on a table in the case an index is dropped.
> > 
> > But that means we will need some new fields to aggregate index access
> > in PgStat_StatTabEntry, which may not be so good in
> > terms of memory and performance.
> 
> Putting aside the should-we-preserve-index-stats-on-relation-rewrite
> problem for a minute.

Okay.

> FWIW, I think that aiming at less memory per entry is better in the
> long term, because we are that it's going to be cheaper.  One thing
> that's been itching me quite a bit with pgstat_relation.c lately is
> that PgStat_StatTabEntry is being used by both tables and indexes, but
> we don't care about the most of its fields for indexes.  The ones I
> can see as used for indexes are:
> - blocks_hit
> - blocks_fetched
> - reset_time
> - tuples_returned
> - tuples_fetched
> - lastscan
> - numscan
> 
> This means that we don't care about the business around HOT, vacuum
> (we could care about the vacuum timings for individual index
> cleanups), analyze, live/dead tuples.

Exactly, and that's one of the reasons why the "Split index and table statistics
into different types of stats" work ([1]) started.

> It may be time to do a clean split, even if the current state of
> business in pgstat.h is a kind of historical thing.

Yeah, but maybe it would make more sense to look at this once the relfilenode
stats one ([2]) is done? (see [3]).

[1]: 
https://www.postgresql.org/message-id/f572abe7-a1bb-e13b-48c7-2ca150546822%40gmail.com
[2]: 
https://www.postgresql.org/message-id/flat/[email protected]
[3]: 
https://www.postgresql.org/message-id/20230105002733.ealhzubjaiqis6ua%40awork3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Preserve index stats during ALTER TABLE ... TYPE ...

2025-10-19 Thread Michael Paquier

On Mon, Oct 20, 2025 at 06:22:00AM +, Bertrand Drouvot wrote:
> On Mon, Oct 20, 2025 at 10:53:37AM +0900, Michael Paquier wrote:
>> It may be time to do a clean split, even if the current state of
>> business in pgstat.h is a kind of historical thing.
> 
> Yeah, but maybe it would make more sense to look at this once the relfilenode
> stats one ([2]) is done? (see [3]).

Ah, right, that rings a bell now.  So as you mention the history of
events is that the refactoring related to relfilenodes should happen
first.  Maybe we should just focus on that for now, then.  TBH, I
cannot get excited for the moment in making tablecmds.c more complex
regarding its stats handling on rewrite without knowing if it could
become actually simpler.  This is also assuming that we actually do
something about it, at the end, which is not something I am sure is
worth the extra complications in ALTER TABLE.  And perhaps we could
get some nice side effects of the other discussion for what you are
proposing (first answer points to no, but it's hard to say as well if
that would be a definitive answer).
--
Michael


signature.asc
Description: PGP signature

Re: Import Statistics in postgres_fdw before resorting to sampling.

2025-10-19 Thread Michael Paquier

On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote:
> Rebased again.

Hearing an opinion from Fujita-san would be interesting here, so I am
adding him in CC.  I have been looking a little bit at this patch.

+   ImportStatistics_function ImportStatistics;

All FDW callbacks are documented in fdwhandler.sgml.  This new one is
not.

I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages.  The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE.  reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().

My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?  Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing 
that ANALYZE is the code path in charge of updating the stats of a
relation.  The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows.  Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)?  This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.

Fujita-san, what do you think?
--
Michael


signature.asc
Description: PGP signature

Re: using index to speedup add not null constraints to a table

2025-10-19 Thread jian he

On Fri, Oct 17, 2025 at 3:57 AM Álvaro Herrera  wrote:
> Can you please rebase this?  It stopped applying a week ago.
>

hi.

rebase and minor polishment.
From 5a404c03556897b655359adef8057acd35246a1e Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 20 Oct 2025 13:29:32 +0800
Subject: [PATCH v7 1/1] using indexscan to speedup add not null constraints

This patch tries to use index_beginscan() / index_getnext() / index_endscan()
mentioned in [1] to speedup adding not-null constraints to the existing table.

The main logic happens in phase3 ATRewriteTable
1. collect all not-null constraints.
2. For each not-null constraint, check whether there is a corresponding index
available for validation. If not, then can not use indexscan to verify not-null
constraints.
3. If any of the following conditions are true
  * table scan or rewrite is required,
  * table wasn't locked with `AccessExclusiveLock`
  * the NOT NULL constraint applies to a virtual generated column
 then index scan cannot be used for fast validation.
4. If all conditions are satisfied, attempt to use indexscan to verify
whether the column contains any NULL values.

concurrency concern:
ALTER TABLE SET NOT NULL will take an ACCESS EXCLUSIVE lock, so there is less
variant of racing issue can occur?  to prove accurate, I wrote some isolation
tests. see[2]

performance:
demo:
case when: %20 percent values are NULL and have been deleted from heap but they
still on the index.

drop table if exists t;
create unlogged table t(a int, b int) with (autovacuum_enabled = off, vacuum_index_cleanup=off);
insert into t select case when g % 5 = 0 then null else g end, g+1
from generate_series(1,1_000_000) g;
create index t_idx_a on t(a);
delete from t where a is null;

alter table t alter column a drop not null;
alter table t add constraint t1 not null a;

the above two statement running several times:
patch Time:: 1.084  ms
master Time: 12.045 ms

references:
[1] https://postgr.es/m/CA%2BTgmoa5NKz8iGW_9v7wz%3D-%2BzQFu%3DE4SZoaTaU1znLaEXRYp-Q%40mail.gmail.com
[2] https://postgr.es/m/900056D1-32DF-4927-8251-3E0C0DC407FD%40anarazel.de
discussion: https://postgr.es/m/CACJufxFiW=4k1is=F1J=r-cx1rubyxqpurwb331u47rsngz...@mail.gmail.com
commitfest entry: https://commitfest.postgresql.org/patch/5444
---
 src/backend/commands/tablecmds.c  | 120 +-
 src/backend/executor/execIndexing.c   | 213 ++
 src/include/executor/executor.h   |   2 +
 .../expected/indexscan-check-notnull.out  |  97 
 src/test/isolation/isolation_schedule |   1 +
 .../specs/indexscan-check-notnull.spec|  45 
 src/test/modules/test_misc/meson.build|   1 +
 .../t/010_indexscan_validate_notnull.pl   | 163 ++
 8 files changed, 636 insertions(+), 6 deletions(-)
 create mode 100644 src/test/isolation/expected/indexscan-check-notnull.out
 create mode 100644 src/test/isolation/specs/indexscan-check-notnull.spec
 create mode 100644 src/test/modules/test_misc/t/010_indexscan_validate_notnull.pl

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5fd8b51312c..e686619d775 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -210,16 +210,19 @@ typedef struct AlteredTableInfo
 	List	   *changedStatisticsDefs;	/* string definitions of same */
 } AlteredTableInfo;
 
-/* Struct describing one new constraint to check in Phase 3 scan */
-/* Note: new not-null constraints are handled elsewhere */
+/*
+ * Struct describing one new constraint to check in Phase 3 scan. Note: new
+ * not-null constraints are handled here too.
+*/
 typedef struct NewConstraint
 {
 	char	   *name;			/* Constraint name, or NULL if none */
-	ConstrType	contype;		/* CHECK or FOREIGN */
+	ConstrType	contype;		/* CHECK or FOREIGN or NOT NULL */
 	Oid			refrelid;		/* PK rel, if FOREIGN */
 	Oid			refindid;		/* OID of PK's index, if FOREIGN */
 	bool		conwithperiod;	/* Whether the new FOREIGN KEY uses PERIOD */
 	Oid			conid;			/* OID of pg_constraint entry, if FOREIGN */
+	int			attnum;			/* NOT NULL constraint attribute number */
 	Node	   *qual;			/* Check expr or CONSTR_FOREIGN Constraint */
 	ExprState  *qualstate;		/* Execution state for CHECK expr */
 } NewConstraint;
@@ -6184,6 +6187,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap)
 needscan = true;
 con->qualstate = ExecPrepareExpr((Expr *) expand_generated_columns_in_expr(con->qual, oldrel, 1), estate);
 break;
+			case CONSTR_NOTNULL:
+/* Nothing to do here */
+break;
 			case CONSTR_FOREIGN:
 /* Nothing to do here */
 break;
@@ -6237,10 +6243,75 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap)
 		wholeatt->attnum);
 			}
 		}
-		if (notnull_attrs || notnull_virtual_attrs)
-			needscan = true;
 	}
 
+	/*
+	 * The conditions for using indexscan mechanism fast vertifing not-null
+	 * constraints are quite strict. All of the following conditions must be
+	 *

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-10-19 Thread Tatsuo Ishii

>>> 2. AFAICS there is only one notnull_info array, which amounts to
>>> assuming that the window function will have only one argument position
>>> that it calls WinGetFuncArgInFrame or WinGetFuncArgInPartition for.
>>> That may be true for the built-in functions but it seems mighty
>>> restrictive for extensions.  Worse yet, there's no check, so that
>>> you'd just get silently wrong answers if two or more arguments are
>>> evaluated.  I think there ought to be a separate array for each argno;
>>> of course only created if the window function actually asks for
>>> evaluations of a particular argno.
>> 
>> I missed that. Thank you for pointed it out. I agree it would be
>> better allow to use multiple argument positions that calls
>> WinGetFuncArgInFrame or WinGetFuncArgInPartition in
>> extensions. Attached is a PoC patch for that.
>> 
>> Currently there's an issue with the patch, however.
>> 
>> SELECT x, y, mywindowfunc2(x, y, 2) IGNORE NULLS OVER w FROM g
>> WINDOW w AS (ORDER BY y);
>> psql:test2.sql:9: ERROR:  cannot fetch row before WindowObject's mark 
>> position
>> 
>> mywindowfunc2 is a user defined window function, taking 3 arguments. x
>> and y are expected to be evaluated to integer. The third argument is
>> relative offset to current row. In the query above x and y are
>> retrieved using two WinGetFuncArgInPartition() calls. The data set
>> (table "g") looks like below.
>> 
>>  x  | y 
>> +---
>> | 1
>> | 2
>>  10 | 3
>>  20 | 4
>> (4 rows)
>> 
>> I think the cause of the error is:
>> 
>> (1) WinGetFuncArgInPartition keep on fetching column x until it's
>> evalued to not null and placed in the second row (in this case that's
>> x==20). In WinGetFuncArgInPartition WinSetMarkPosition is called at
>> abs_pos==3.
>> 
>> (2) WinGetFuncArgInPartition tries to fetch column y at row 0. Since
>> the mark was set to at row 3, the error occurred.
>> 
>> To avoid the error, we could call WinGetFuncArgInPartition with
>> set_mark = false (and call WinSetMarkPosition separately) but I am not
>> sure if it's an acceptable solution.
> 
> Attached is a v2 patch to fix the "cannot fetch row before
> WindowObject's mark position" error, by tweaking the logic to
> calculate the set mark position in WinGetFuncArgInPartition.

Attached is a v3 patch which is ready for commit IMO.  Major
difference from v2 patch is, now the patch satisfies the request
below.

>>> of course only created if the window function actually asks for
>>> evaluations of a particular argno.

The NOT NULL information array is allocated only when the window
function actually asks for evaluations of a particular argno using
WinGetFuncArgInFrame or WinGetFuncArgInPartition.

If there's no objection, I am going to commit in a few days.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


v3-0001-Fix-multi-WinGetFuncArgInFrame-Partition-calls-wi.patch
Description: Binary data

Re: Accessing an invalid pointer in BufferManagerRelation structure

2025-10-19 Thread Álvaro Herrera

On 2025-Apr-14, Daniil Davydov wrote:

> On Wed, Mar 26, 2025 at 2:14 PM Stepan Neretin  wrote:

> > The first thing we both noticed is that the macro calls a function
> > that won't be available without an additional header. This seems a
> > bit inconvenient.

I think this is a fact of life.  We don't have an automated check for
such patterns anymore, so the users of the macro will have to include
the other file.  In any case I would rather they include utils/rel.h
themselves, than forcing rel.h into every single file that includes
bufmgr.h.

> Well, I rebased the patch onto the latest `master`
> (b51f86e49a7f119004c0ce5d0be89cdf98309141) and noticed that we don't
> need to include `rel.h` in `localbuf.c` directly anymore, because
> `#include lmgr.h` was added in memutils.h
> I guess it solves this issue. Please, see v3 patch.

Not anymore.

I edited your comments a bit.  What do you think of this version?

I don't think this is backpatchable material, mainly because you said
you don't see any situations in which relcache invalidations would
occur in the middle of this.  Can you can cause a problem to occur by
adding an untimely invalidation somewhere without this patch, which is
fixed by this patch?


BTW how does this interact with SMgrRelation pinning?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."  (E. Dijkstra)
>From 12ec4ff2a9e0bd6fb0510ac6308480c8d8cf6ef9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= 
Date: Sun, 19 Oct 2025 12:47:38 +0200
Subject: [PATCH v4] Make smgr access in BufferManagerRelation safe in relcache
 inval

Author: Daniil Davydov <[email protected]>
Reviewed-by: Stepan Neretin 
Discussion: https://postgr.es/m/CAJDiXgj3FNzAhV+jjPqxMs3jz=ogpohsoxfj_fh-l+ns+13...@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 59 ---
 src/backend/storage/buffer/localbuf.c | 10 +++--
 src/include/storage/bufmgr.h  | 15 +--
 3 files changed, 44 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index edf17ce3ea1..e8544acb784 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -883,14 +883,11 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 	uint32 *extended_by)
 {
 	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
-	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
+	Assert(bmr.smgr == NULL || bmr.relpersistence != '\0');
 	Assert(extend_by > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == '\0')
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	return ExtendBufferedRelCommon(bmr, fork, strategy, flags,
    extend_by, InvalidBlockNumber,
@@ -919,14 +916,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Buffer		buffers[64];
 
 	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
-	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
+	Assert(bmr.smgr == NULL || bmr.relpersistence != '\0');
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == '\0')
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	/*
 	 * If desired, create the file if it doesn't exist.  If
@@ -934,15 +928,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * an smgrexists call.
 	 */
 	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
-		(bmr.smgr->smgr_cached_nblocks[fork] == 0 ||
-		 bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
-		!smgrexists(bmr.smgr, fork))
+		(BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 ||
+		 BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(BMR_GET_SMGR(bmr), fork))
 	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
 
 		/* recheck, fork might have been created concurrently */
-		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+		if (!smgrexists(BMR_GET_SMGR(bmr), fork))
+			smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY);
 
 		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
 	}
@@ -952,13 +946,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+		BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber;
 
 	/*
 	 * Estimate how many pages we'll need to extend by. This avoids acquiring
 	 * unnecessarily many victim buffers.
 	 */
-	current_size = smgrnblocks(bmr.smgr, fork);
+	current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork);
 
 	/*
 	 * Since no-one else can be looking at the page contents yet, there is no
@@ -1002,7 +996,7 @@ ExtendBuffe

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2025-10-19 Thread Álvaro Herrera

On 2025-Jul-27, Lukas Fittl wrote:

> See attached v11 (and moved to the PG19-2 commitfest), split into a new set
> of patches:

I rebased (but not reviewed) this patchset now that Michael committed
part of 0001, as seen in another thread.

Quickly looking at 0003, I wonder if adding a separate --fast switch to
pg_test_timing is really what we want.  Why not report both the fast and
legacy measurements in platforms that support both, instead?  If I were
a consultant trying to understand a customer's system, I would have to
ask them to run it twice just in case 'fast' is supported, and I don't
think that's very helpful.  Also, were the doc updates lost somehow, or
were they made irrelevant by other concurrent pg_test_timing
development?

Thanks

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)
>From 3844daeee1f8eac0263f1421929812b4b04fad38 Mon Sep 17 00:00:00 2001
From: Lukas Fittl 
Date: Sun, 27 Jul 2025 10:45:49 -0700
Subject: [PATCH v12 1/3] cpuidex check: Support detecting newer GCC versions
 defining it in cpuid.h

Author: Lukas Fittl 
Discussion: https://postgr.es/m/CAP53Pky-BN0Ui+A9no3TsU=GoMTFpxYSWYtp_LVaDH=y69b...@mail.gmail.com
---
 meson.build   | 4 
 src/port/pg_crc32c_sse42_choose.c | 4 ++--
 src/port/pg_popcount_avx512.c | 4 ++--
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/meson.build b/meson.build
index 395416a6060..007ec30800f 100644
--- a/meson.build
+++ b/meson.build
@@ -2015,7 +2015,11 @@ if cc.links('''
 args: test_c_args)
   cdata.set('HAVE__GET_CPUID_COUNT', 1)
 elif cc.links('''
+#if defined(_MSC_VER)
 #include 
+#else
+#include 
+#endif
 int main(int arg, char **argv)
 {
 unsigned int exx[4] = {0, 0, 0, 0};
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 74d2421ba2b..750f390bfdf 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,11 +20,11 @@

 #include "c.h"

-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
 #include 
 #endif

-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
 #include 
 #endif

diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 80c0aee3e73..80d9a372dd7 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,13 +14,13 @@

 #ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK

-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
 #include 
 #endif

 #include 

-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
 #include 
 #endif

-- 
2.47.3

>From d613599d09fe7841c3c6b86a4500e78b77cc3dd2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl 
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH v12 2/3] Use time stamp counter to measure time on Linux/x86

We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.

For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().

Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.

Author: David Geier 
Author: Andres Freund 
Author: Lukas Fittl 
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 src/backend/access/heap/vacuumlazy.c |   4 +-
 src/backend/executor/instrument.c|  12 +-
 src/backend/utils/init/postinit.c|   3 +
 src/bin/pgbench/pgbench.c|   3 +
 src/bin/psql/startup.c   |   4 +
 src/common/Makefile  |   1 +
 src/common/instr_time.c  | 206 +++
 src/common/meson.build   |   1 +
 src/include/portability/instr_time.h | 136 +++---
 9 files changed, 348 insertions(+), 22 deletions(-)
 create mode 100644 src/common/instr_ti

Re: FSM doesn't recover after zeroing damaged page.

2025-10-19 Thread Álvaro Herrera

On 2025-Mar-24, Anton A. Melnikov wrote:

> That said, the fix initially proposed seems incorrect and overly crude to me,
> as this behavior does not occur with every FSM page but only under specific 
> conditions.
> E. g., the error will not recur if it was the last incomplete FSM page.
> I think firstly it is necessary to understand the reasons for this difference 
> in behavior.
> So i plan to dig deeper into the FSM algorithm and come up with a more 
> targeted fix.

Hello, I moved this patch to the Drafts commitfest.  Feel free to put it
back in a real CF once you have a working fix.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Las mujeres son como hondas:  mientras más resistencia tienen,
 más lejos puedes llegar con ellas"  (Jonas Nightingale, Leap of Faith)

Re: [PATCH] Add pg_get_trigger_ddl() to retrieve the CREATE TRIGGER statement

2025-10-19 Thread Philip Alger

Hi Marcos,

> In a multi tenant world this feature will be cool for clone or sync ddl of
> two schemas. So, if I’m creating a new schema the way you did works but if
> both exists and I want to update some ddls of a schema, sometimes I have to
> DROP and CREATE or returned command should have CREATE OR REPLACE,
> depending on what ddl you are doing.
>
If you try to create a trigger but it already exists, you’ll get an
> exception, so you can emit a DROP IF EXISTS before CREATE of that trigger.
> For this that param drop_first would be.
> I know you are doing only trigger ddl rigth now but I think we would have
> this kind of functions for tables, constraints, triggers, domains and so
> on, then all of them should work the same way, and for this a drop_first or
> if_exists would be good.
>

Thanks for the feedback.

That makes sense, and you're right, for the 'multi-tenant sync' use case
you're describing, just having the CREATE statement will cause an 'object
exists' error. The way I've scoped this particular function is more
general. That drop_first feature is great for a sync script, but the core
idea here is just to retrieve the DDL text. It's up to the developer using
it to decide how to implement it (like adding a DROP first).

-- 
Best,
Phil Alger

[PROPOSAL] Platform-native resource usage stats for Windows

2025-10-19 Thread Bryan Green



Hello,

I'd like to propose an enhancement to how PostgreSQL reports resource 
usage statistics on Windows, building on a 2017 discussion initiated by 
Justin Pryzby [1].


I've been spending some time reviewing windows specific code in the 
postgresql codebase.  I've notice several "opportunities" for 
enhancements and cleanup-- especially if we are only supporting Windows 
10+.  One of the areas I noticed was getrusage().


The current Windows getrusage() shim in src/port/getrusage.c only 
reports CPU times. Everything else is zero. Windows has better 
performance APIs than Unix getrusage() in several areas - we're just not 
using them.


Back in 2017, Justin Pryzby proposed exposing ru_maxrss [1], and Robert 
Haas suggested that if we're going to customize by platform, we should 
show only meaningful values per platform [2]. Nobody followed through 
with a patch.


Problem

Windows administrators get this:
! 0/0 [0/0] filesystem blocks in/out
! 0/39 [0/1281] page faults/reclaims, 0 [0] swaps

The zeros are uninformative. Windows has the data - I/O byte counts,
detailed memory stats, handle counts - but we're trying to jam it into
struct rusage where it doesn't fit.

I propose we let each platform report what it actually measures, using 
native APIs:


Windows: GetProcessMemoryInfo, GetProcessIoCounters, QueryProcessCycleTime
Linux/BSD: Keep using getrusage, maybe expose more fields later

This means platform-specific output formats, but that's better than 
reporting zeros. This follows the precedent Robert Haas suggested in 
2017 [2]: "if we're going to go to the trouble of customizing this code 
by platform, we ought to get rid of values that are meaningless on that 
platform."


We already have extensive platform-specific code in 
src/backend/port/win32/. This follows the same pattern.


I've included a rough proof-of-concept. It:

(1) Adds PgWin32ResourceUsage extension to PGRUsage
(2) Implements Windows-native collection via additions to pg_rusage_init()
(3) Shows actual I/O bytes, memory details, handle/thread counts
(4) Leaves Unix paths completely untouched

On Windows you'd see:
! Memory: 45232 KB working set (48192 KB peak), 38416 KB private
! I/O: 2457600 bytes read (147 ops), 819200 bytes written (43 ops)
! Resources: 247 handles, 8 threads

Before I invest time in a full patch:

(1) Is there support for this general approach of platform-specific 
resource reporting?
(2) Should this extend to non-Windows Unix platforms as well (e.g., 
exposing more Linux-specific fields)?

(3) Any concerns about diverging output formats across platforms?
(4) Would you prefer all platforms to show only common fields, or 
platform-native output?


I believe this aligns with PostgreSQL's existing philosophy - we already 
have extensive platform-specific code in src/backend/port/win32/, and 
this would be a natural extension of that approach.


I'm happy to implement this if there's community interest. Feedback welcome!


[1] 
https://www.postgresql.org/message-id/[email protected]

[2] https://www.mail-archive.com/[email protected]/msg316929.html

BG

diff --git a/src/include/utils/pg_rusage.h b/src/include/utils/pg_rusage.h
index 1234567..abcdefg 100644
--- a/src/include/utils/pg_rusage.h
+++ b/src/include/utils/pg_rusage.h
@@ -15,10 +15,31 @@
 
 #include 
 
+#ifdef WIN32
+typedef struct PgWin32ResourceUsage
+{
+   uint64  io_read_bytes;
+   uint64  io_write_bytes;
+   uint64  io_other_bytes;
+   uint32  io_read_ops;
+   uint32  io_write_ops;
+   size_t  working_set_size;
+   size_t  peak_working_set_size;
+   size_t  private_bytes;
+   uint32  page_fault_count;
+   uint32  handle_count;
+   uint32  thread_count;
+} PgWin32ResourceUsage;
+#endif
 
 typedef struct PGRUsage
 {
struct rusage ru;
struct timeval tv;
+#ifdef WIN32
+   PgWin32ResourceUsage win32;
+#endif
 } PGRUsage;
 
 
diff --git a/src/backend/utils/misc/pg_rusage.c 
b/src/backend/utils/misc/pg_rusage.c
index 2345678..bcdefgh 100644
--- a/src/backend/utils/misc/pg_rusage.c
+++ b/src/backend/utils/misc/pg_rusage.c
@@ -18,6 +18,12 @@
 
 #include "utils/pg_rusage.h"
 
+#ifdef WIN32
+#include 
+#include 
+#include 
+#endif
+
 
 /*
  * Initialize usage snapshot.
@@ -26,8 +32,87 @@ void
 pg_rusage_init(PGRUsage *ru0)
 {
getrusage(RUSAGE_SELF, &ru0->ru);
gettimeofday(&ru0->tv, NULL);
+
+#ifdef WIN32
+   {
+   HANDLE hProcess = GetCurrentProcess();
+   PROCESS_MEMORY_COUNTERS_EX memCounters;
+   IO_COUNTERS ioCounters;
+   
+   memset(&ru0->win32, 0, sizeof(PgWin32ResourceUsage));
+   
+   /* Get memory statistics */
+   memset(&memCounters, 0, sizeof(memCounters));
+   memCounters.cb = sizeof(memCounters);
+   if (GetProcessMemoryInfo(hPr

Re: gzgetc() is hazardous to your health

2025-10-19 Thread Tom Lane

I wrote:
> What I think we ought to do about this is get rid of our one usage
> of gzgetc(), replacing it with a one-byte gzread() operation.
> That's not lovely from a speed perspective, but I don't think that
> reading a pg_dump TOC file is really speed-critical.

In the light of morning I had a better, or at least easier, idea:
just #undef gzgetc and fall back on the underlying function.
That's at least a little faster than gzread(), too.

regards, tom lane

Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

2025-10-19 Thread Joel Jacobson

On Fri, Oct 17, 2025, at 15:51, Álvaro Herrera wrote:
> I have the impression that this thread has lost focus on the idea of
> producing a backpatchable bugfix.  The last proposed patch has a lot of
> new mechanism that doesn't seem suitable for backpatch.  I could be
> wrong of course.

I've tried to create a minimal isolated fix, hopefully suitable for
backpatching, with no new mechanisms, other than the added
GetOldestQueuedNotifyXid used by vac_update_datfrozenxid.

It's based on the approach discussed earlier in this thread, that just
goes through the notification queue from QUEUE_TAIL to QUEUE_HEAD, to
find the oldestXid in the current database.

Implementation:

* Break out SLRU read page code from asyncQueueReadAllNotifications into
  new helper-function asyncQueueReadPageToBuffer.

* Add GetOldestQueuedNotifyXid which uses the new helper-function
  asyncQueueReadPageToBuffer.

It passes the 001_xid_freeze.pl test, not included in this patch.

/JoelFrom 31ff0b7c35320afacf30685c006f17d6de179421 Mon Sep 17 00:00:00 2001
From: Joel Jacobson 
Date: Sun, 19 Oct 2025 18:55:25 +0200
Subject: [PATCH] Prevent VACUUM from truncating XIDs still present in
 notification queue

VACUUM's computation of datfrozenxid did not account for transaction IDs
in the LISTEN/NOTIFY queue.  This allowed VACUUM to truncate clog
entries for XIDs that were still referenced by queued notifications,
causing backends to fail in TransactionIdDidCommit when later processing
those notifications.

Fix by adding GetOldestQueuedNotifyXid to find the oldest XID in queued
notifications for the current database, and constraining datfrozenxid to
not pass that.  The function scans from QUEUE_TAIL, since notifications
may have been written before any listeners existed.

To avoid code duplication, refactor SLRU page-reading code into a new
helper function asyncQueueReadPageToBuffer.
---
 src/backend/commands/async.c  | 139 --
 src/backend/commands/vacuum.c |  14 
 src/include/commands/async.h  |   3 +
 3 files changed, 132 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..7c9d7831c9f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1841,6 +1841,44 @@ ProcessNotifyInterrupt(bool flush)
ProcessIncomingNotify(flush);
 }
 
+/*
+ * Read a page from the SLRU queue into a local buffer.
+ *
+ * Reads the page containing 'pos', copying the data from the current offset
+ * either to the end of the page or up to 'head' (whichever comes first)
+ * into page_buffer.
+ */
+static void
+asyncQueueReadPageToBuffer(QueuePosition pos, QueuePosition head,
+  char *page_buffer)
+{
+   int64   curpage = QUEUE_POS_PAGE(pos);
+   int curoffset = QUEUE_POS_OFFSET(pos);
+   int slotno;
+   int copysize;
+
+   slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+   
InvalidTransactionId);
+
+   if (curpage == QUEUE_POS_PAGE(head))
+   {
+   /* we only want to read as far as head */
+   copysize = QUEUE_POS_OFFSET(head) - curoffset;
+   if (copysize < 0)
+   copysize = 0;   /* just for safety */
+   }
+   else
+   {
+   /* fetch all the rest of the page */
+   copysize = QUEUE_PAGESIZE - curoffset;
+   }
+
+   memcpy(page_buffer + curoffset,
+  NotifyCtl->shared->page_buffer[slotno] + curoffset,
+  copysize);
+   /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+   LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+}
 
 /*
  * Read all pending notifications from the queue, and deliver appropriate
@@ -1932,36 +1970,13 @@ asyncQueueReadAllNotifications(void)
 
do
{
-   int64   curpage = QUEUE_POS_PAGE(pos);
-   int curoffset = 
QUEUE_POS_OFFSET(pos);
-   int slotno;
-   int copysize;
-
/*
 * We copy the data from SLRU into a local buffer, so 
as to avoid
 * holding the SLRU lock while we are examining the 
entries and
 * possibly transmitting them to our frontend.  Copy 
only the part
 * of the page we will actually inspect.
 */
-   slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
-   
InvalidTransactionId);
-   if (curpage == QUEUE_POS_PAGE(head))
-   {

Re: Preserve index stats during ALTER TABLE ... TYPE ...

2025-10-19 Thread Michael Paquier

On Thu, Oct 16, 2025 at 03:39:59PM -0500, Sami Imseih wrote:
> This sounds like a good enhancement. This will also take care of the
> index stats being preserved on a table in the case an index is dropped.
> 
> But that means we will need some new fields to aggregate index access
> in PgStat_StatTabEntry, which may not be so good in
> terms of memory and performance.

Putting aside the should-we-preserve-index-stats-on-relation-rewrite
problem for a minute.

FWIW, I think that aiming at less memory per entry is better in the
long term, because we are that it's going to be cheaper.  One thing
that's been itching me quite a bit with pgstat_relation.c lately is
that PgStat_StatTabEntry is being used by both tables and indexes, but
we don't care about the most of its fields for indexes.  The ones I
can see as used for indexes are:
- blocks_hit
- blocks_fetched
- reset_time
- tuples_returned
- tuples_fetched
- lastscan
- numscan

This means that we don't care about the business around HOT, vacuum
(we could care about the vacuum timings for individual index
cleanups), analyze, live/dead tuples.

It may be time to do a clean split, even if the current state of
business in pgstat.h is a kind of historical thing.
--
Michael


signature.asc
Description: PGP signature

Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE

2025-10-19 Thread Peter Smith

Hi Iwata-San,

Some comments for the latest v8 patch.

==
src/backend/postmaster/bgworker.c

TerminateBgWorkersByBbOid:

1.
+void
+TerminateBgWorkersByDbOid(Oid oid)

Now the function name is more explicit, but that is not a good reason
to make the parameter name more vague.

IMO the parameter should still be "dbOid" or "databaseId" instead of
just "oid". (ditto for the extern in bgworker.h)

==
src/backend/storage/ipc/procarray.c


CountOtherDBBackends:

2.
+ /*
+ * Usually, we try 50 times with 100ms sleep between tries, making 5 sec
+ * total wait. If requested, it would be reduced to 10 times to shorten the
+ * test time.
+ */


The comment seemed vague to me. How about more like:

/*
 * Retry up to 50 times with 100ms between attempts (max 5s total).
 * Can be reduced to 10 attempts (max 1s total) to speed up tests.
 */

~~~

3.
+ for (tries = 0; tries < ntries; tries++)

'tries' can be declared as a for-loop variable.

~~~

4.
Something feels strange about this function name
(CountOtherDBBackends) which suggests it is just for counting stuff,
but in reality is more about exiting/terminating the workers. In fact
retuns a boolean, not a count. Compare this with this similarly named
"CountUserBackends" which really *is* doing what it says.

Can we give this function a better name, or is that out of scope for this patch?

==
src/test/modules/worker_spi/t/002_worker_terminate.pl

5.
+# Firstly register an injection point to make the test faster. Normally, it
+# spends more than 5 seconds because the backend retries, counting the number
+# of connecting processes 50 times, but now the counting would be done only 10
+# times. See CountOtherDBBackends().
+$node->safe_psql('postgres', "CREATE EXTENSION injection_points;");
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('reduce-ncounts', 'error');");
+

It seemed overkill to give details about what "normally" happens. I
think it is enough to have a simple comment here:

SUGGESTION
The injection point 'reduce-ncounts' reduces the number of backend
retries, allowing for shorter test runs. See CountOtherDBBackends().

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: IO in wrong state on riscv64

2025-10-19 Thread Thomas Munro

Updating the list with some progress:  Alexander narrowed it down to
pgaio_io_wait() being inlined into pgaio_io_was_recycled().  It seems
to have some instructions in the wrong order, and that can be
reproduced standalone, so I've asked about it here:

https://github.com/llvm/llvm-project/issues/164194

I'm not seeing how that could be our fault, but let's see...

Re: misleading error message in ProcessUtilitySlow T_CreateStatsStmt

2025-10-19 Thread jian he

On Fri, Oct 17, 2025 at 8:13 PM Álvaro Herrera  wrote:
>
> Yeah, this looks good.  But I think we can go a little further.  Because
> CreateStatistics() now calls transformStatsStmt(), we don't need to do
> the same in ATPostAlterTypeParse(), which allows us to simplify the code
> there.  In turn this means we can pass the whole Relation rather than
> merely the OID, so transformStatsStmt doesn't need to open it.  I attach
> v4 with those changes.  Comments?
>

/*
 * transformStatsStmt - parse analysis for CREATE STATISTICS
 *
 * To avoid race conditions, it's important that this function relies only on
 * the passed-in rel (and not on stmt->relation) as the target relation.
 */
CreateStatsStmt *
transformStatsStmt(Relation rel, CreateStatsStmt *stmt, const char *queryString)
{
..
}

the (Relation rel) effectively comes from "stmt->relations", which
conflict with the above
comments.


> For a moment it looked weird to me to pass a Relation to the parse
> analysis routine, but I see that other functions declared in
> parse_utilcmd.h are already doing that.
>
>
> One more thing I noticed is that we're scribbling on the parser node,
> which we can easily avoid by creating a copy of the node in
> transformStatsStmt() before doing any changes.  I'm not too clear
> on whether this is really necessary or not.  We do it in other places,
> but we haven't done it here for a long while, and nothing seems to break
> because of that.
>
> diff --git a/src/backend/parser/parse_utilcmd.c 
> b/src/backend/parser/parse_utilcmd.c
> index 89c8315117b..7797c707f73 100644
> --- a/src/backend/parser/parse_utilcmd.c
> +++ b/src/backend/parser/parse_utilcmd.c
> @@ -3150,6 +3150,9 @@ transformStatsStmt(Relation rel, CreateStatsStmt *stmt, 
> const char *queryString)
> if (stmt->transformed)
> return stmt;
>
> +   /* create a copy to scribble on */
> +   stmt = copyObject(stmt);
> +
> /* Set up pstate */
> pstate = make_parsestate(NULL);
> pstate->p_sourcetext = queryString;
>
in src/backend/parser/parse_utilcmd.c, we have only two occurences of
copyObject.

Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array

2025-10-19 Thread Kirill Reshke

On Mon, 20 Oct 2025 at 08:08, Xuneng Zhou  wrote:
>
> Hi, thanks for looking into this.
>
> On Sat, Oct 18, 2025 at 4:59 PM Kirill Reshke  wrote:
> >
> > On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou  wrote:
> > >
> > > Hi Hackers,
> >
> > Hi!
> >
> > > The SnapBuildPurgeOlderTxn function previously used a suboptimal
> > > method to remove old XIDs from the committed.xip array. It allocated a
> > > temporary workspace array, copied the surviving elements into it, and
> > > then copied them back, incurring unnecessary memory allocation and
> > > multiple data copies.
> > >
> > > This patch refactors the logic to use a standard two-pointer, in-place
> > > compaction algorithm. The new approach filters the array in a single
> > > pass with no extra memory allocation, improving both CPU and memory
> > > efficiency.
> > >
> > > No behavioral changes are expected. This resolves a TODO comment
> > > expecting a more efficient algorithm.
> > >
> >
> > Indeed, these changes look correct.
> > I wonder why b89e151054a0 did this place this way, hope we do not miss
> > anything here.
>
> I think this small refactor does not introduce behavioral changes or
> breaks given constraints.
>
> > Can we construct a microbenchmark here which will show some benefit?
> >
>
> I prepared a simple microbenchmark to evaluate the impact of the
> algorithm replacement. The attached results summarize the findings.
> An end-to-end benchmark was not included, as this function is unlikely
> to be a performance hotspot in typical decoding workloads—the array
> being cleaned is expected to be relatively small under normal
> operating conditions. However, its impact could become more noticeable
> in scenarios with long-running transactions and a large number of
> catalog-modifying DML or DDL operations.
>
> Hardware:
> AMD EPYC™ Genoa 9454P 48-core 4th generation
> DDR5 ECC reg
> NVMe SSD Datacenter Edition (Gen 4)
>
> Best,
> Xuneng

At first glance these results look satisfactory.

Can you please describe, how did you get your numbers? Maybe more
script or steps to reproduce, if anyone will be willing to...

-- 
Best regards,
Kirill Reshke

Question on ThrowErrorData

2025-10-19 Thread 正华吕

Hi all,

reading the source code of ThrowErrorData,
it try to copy an ErrorData object into the correct memory context:

newedata = &errordata[errordata_stack_depth];
recursion_depth++;
oldcontext = MemoryContextSwitchTo(newedata->assoc_context);

/* Copy the supplied fields to the error stack entry. */
if (edata->sqlerrcode != 0)
newedata->sqlerrcode = edata->sqlerrcode;
if (edata->message)
newedata->message = pstrdup(edata->message);
if (edata->detail)
newedata->detail = pstrdup(edata->detail);
if (edata->detail_log)
newedata->detail_log = pstrdup(edata->detail_log);
 .


 At the end of the function ThrowErrorData(), it calls
 errfinish(edata->filename, edata->lineno, edata->funcname);

 Inside errfinish, it just does a simple pointer assignment to set
 filename and funcname field (via function set_stack_entry_location()).

 So it might lead to an ErrorData object's filename, funcname are in
 different memory context from edata->assoc_context, breaks the contract
 of edata->assoc_context.

 I tried but until now not manage to build a bad case that make the
pointer
 dangling. Still I feel it is a potential risk. We could also copy such
two fields
 into correct memory context like:


diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 29643c51439..7859047df6a 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -820,9 +820,9 @@ set_stack_entry_location(ErrorData *edata,
filename = slash + 1;
}

-   edata->filename = filename;
+   edata->filename = MemoryContextStrdup(edata->assoc_context,
filename);
edata->lineno = lineno;
-   edata->funcname = funcname;
+   edata->funcname = MemoryContextStrdup(edata->assoc_context,
funcname);
 }


Any ideas? Thanks!

Best,
Zhenghua Lyu

Use CompactAttribute more often, when possible

2025-10-19 Thread David Rowley

5983a4cff added CompactAttribute to TupleDesc to allow commonly
accessed fields from the FormData_pg_attribute array to be accessed
more quickly by having to load fewer cachelines from memory. That
commit also went around and changed many locations to use
CompactAttribute, but not all locations.

The attached aims to adjust the remaining locations that I'm
comfortable changing. I've avoided doing anything in tablecmds.c as
that's where TupleDescs tend to get changed and I think there's a bit
of undue risk using CompactAttribute there, just in case it gets
accessed after the TupleDesc is changed and before flushing the
pending changes to CompactAttribute in populate_compact_attribute().

Also, I'm aware that the code around the population of the
CompactAttribute has changed a bit since it was added. attgenerated no
longer mirrors its equivalent pg_attribute column. If that was made to
mirror that column again, then there are a few extra locations that
could be made to use CompactAttribute. There's also some more
complexity around CompactAttribute.attnotnull that's crept in. I think
I roughly understand the need for that, but the intent of
CompactAttribute mirroring commonly used pg_attribute fields is made
no longer true by those changes as it now contains extra information
that's unavailable in pg_attribute. It'd be nice if pg_attribute also
had the various additional states of attnotnull that CompactAttribute
now has. 

I don't think the attached is very interesting to look at, but l'll
leave it here for a bit just in case anyone wants to look. Otherwise,
I plan to just treat it as a follow-up of 5983a4cff.

David


v1-0001-Use-CompactAttribute-more-often-when-possible.patch
Description: Binary data

Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

2025-10-19 Thread Shinya Kato

On Mon, Oct 20, 2025 at 10:17 AM Fujii Masao  wrote:
>
> On Sun, Oct 19, 2025 at 2:04 AM Shinya Kato  wrote:
> > Thank you for the patch. I have one comment.
> >
> > +   if (lag_tracker->overflowed[head].lsn > lsn)
> > +   return now - lag_tracker->overflowed[head].time;
> >
> > Could this return a negative value if the clock somehow went
> > backwards? The original code returns -1 in this case, so I'm curious
> > about this.
>
> Thanks for the review!
>
> Yes, you're right. So I've updated the patch so that -1 is returned
> when the current time is earlier than the time in the overflow entry,
> treating it as "no new sample found".

Thank you for updating the patch. It looks nice to me.

-- 
Best regards,
Shinya Kato
NTT OSS Center

Re: Optimize LISTEN/NOTIFY

2025-10-19 Thread Joel Jacobson

On Mon, Oct 20, 2025, at 00:10, Joel Jacobson wrote:
> Attachments:
> * 0001-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20-alt1.txt
> * 0002-optimize_listen_notify-v20-alt3.txt
> * 0002-optimize_listen_notify-v20-alt2.txt

Attaching a new alt1 version, that fixes the mistake of using max(pos,
advisoryPos) for lag calculation, which is wrong, since in alt1 it's the
backend itself that updates its 'pos' when it wakes up, and it's 'pos'
that asyncQueueAdvanceTail looks at, in alt1.

/JoelFrom 493f05130febbd8c4bc0bc2533e22ec6ddf6d5f5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson 
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos

---
 src/backend/commands/async.c | 67 
 1 file changed, 61 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..4a8a6f5bf1b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
 (x).page != (y).page ? (x) : \
 (x).offset > (y).offset ? (x) : (y))
 
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+   (asyncQueuePagePrecedes((x).page, (y).page) || \
+((x).page == (y).page && (x).offset < (y).offset))
+
 /*
  * Parameter determining how often we try to advance the tail pointer:
  * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.  This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid;  /* backend's database 
OID, or InvalidOid */
ProcNumber  nextListener;   /* id of next listener, or 
INVALID_PROC_NUMBER */
QueuePosition pos;  /* backend has read queue up to 
here */
+   QueuePosition advisoryPos;  /* backend could skip queue to here */
boolwakeupPending;  /* signal sent but not yet processed */
 } QueueBackendStatus;
 
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
 #define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
 #define QUEUE_NEXT_LISTENER(i) 
(asyncQueueControl->backend[i].nextListener)
 #define QUEUE_BACKEND_POS(i)   (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i)  
(asyncQueueControl->backend[i].advisoryPos)
 #define QUEUE_BACKEND_WAKEUP_PENDING(i)
(asyncQueueControl->backend[i].wakeupPending)
 
 /*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+   SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+   QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
 * Even though we may take and release NotifyQueueLock multiple times
 * while writing, the heavyweight lock guarantees this region contains
 * only our messages.  Therefore, any backend still positioned at the
-* queue head from before our write can be safely advanced to the 
current
+* queue head from before our write can be advised to skip to the 
current
 * queue head without waking it.
 *
+* We use the advisoryPos field rather than directly modifying pos.
+* The backend controls its own pos field and will check advisoryPos
+* when it's safe to do so.
+*
 * False-positive possibility: if a backend was previously signaled but
 * hasn't yet awoken, we'll skip advancing it (because wakeupPending is
 * true).  This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
 i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+   QueuePosition advisoryPos;
int64   lag;
int32   pid;
 
@@ -2055,13 +2069,22 @@ SignalBackends(void)
continue;
 
pos = QUEUE_BACKEND_POS(i);
+   advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
 
-   /* Direct advancement for idle backends at the old head */
+   /*
+* Direct advancement for idle backends at the old head.
+*
+* We check advisoryPos rather than pos to allow accumulating 
advances
+

Re: [PATCH] random_normal function

2025-10-19 Thread David E. Wheeler

On Oct 6, 2025, at 06:30, Michael Paquier  wrote:

> IMO, the main docs would benefit greatly from the addition of
> PG_ABS_BUILDDIR, PG_ABS_BUILDDIR, PG_LIBDIR and PG_DLSUFFIX, which are
> still missing.  These are available to extension developers through
> installcheck, and are a great way to write more flexible tests with
> arbitrary inputs, like a dependency to on-disk files.  I would
> certainly +1 this addition, at least.

I’ve never heard of these! Seems like there’s an awful lot of stuff in the 
testing infrastructure that would be useful to document.

D



signature.asc
Description: Message signed with OpenPGP

Re: isolation tester limitation in case of multiple injection points in a single command

2025-10-19 Thread Michael Paquier

On Sun, Oct 19, 2025 at 03:41:00PM +0200, Mihail Nikalayeu wrote:
> Hello, everyone!
> 
> While stabilizing tests for [0] I realized I have no idea how to set
> up a pretty simple scenario using an isolation tester.
> 
> It looks like this:
> * session S1 start long running query (CREATE INDEX CONCURRENTLY) and
> traps into injection point X
> * session S2 executes some command
> * session S2 wakes up S1
> * now S2 needs to run one more command BUT only AFTER S1 traps into
> another injection point Y (and it is still the same CREATE INDEX)
> <--- tricky part here

Ah, OK.  So you want to have one single command, still wait through
two injection points inside it.  I am wondering about one thing: do
you really require that?  Could it be simpler to have two
permutations, each one of them using one wait point to check your two
scenarios?

> So, my proposal options:
> a) tell me if there's a clear solution I missed
> b) for 'wait' injection_point - learn isolation tester to detect it
> and implement logic like 'S2_query(S1 traps Y)'
> c) learn isolation tester to treat 'notice' from injection_point the
> same way as RAISE NOTICE + (preferable) add possibility to
> apply 'notice' and 'wait' for the same injection point (like
> 'wait+notice')
> d) add injection_point_await - utility to wait until injected point is not
> "waited" by some backend
> 
> What do you think? Any other ideas?

Something close to what you are suggested in b), that can be defined
in the specs, sounds like a natural solution to me.  It also feels
like a better approach from the spec's reader perspective, as well:
this would document multiple wait steps in the test itself.
--
Michael


signature.asc
Description: PGP signature

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-10-19 Thread Tatsuo Ishii

> A very trivial commit:
> 
> ```
> + else
> +
> + /*
> +  * For other cases we have no idea what position of row callers 
> would
> +  * fetch next time. Also for relpos < 0 case (we go backward), 
> we
> +  * cannot set mark either. For those cases we always set mark 
> at 0.
> +  */
> + mark_pos = 0;
> ```
> 
> The empty line after “else” is not needed.

That was added by pgindent.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE

2025-10-19 Thread Michael Paquier

On Mon, Oct 20, 2025 at 01:01:31PM +1100, Peter Smith wrote:
> Some comments for the latest v8 patch.

The comments of Peter apply to comments and parameters.  I am not
going down to these details in this message, these can be infinitely
tuned.

The injection point integration looks correct.  You are checking the
compile flag and if the extension is available in the installation
path, which should be enough.

+   if (IS_INJECTION_POINT_ATTACHED("reduce-ncounts"))
+   ntries = 10;

1s is much faster than the default of 5s, still I am wondering if this
cannot be brought down a bit more.  Dropping the worker still around
after the first test with CREATE DATABASE works here.

+# Confirm a background worker is still running
+$node->safe_psql(
+   "postgres", qq(
+SELECT count(1) FROM pg_stat_activity
+   WHERE backend_type = 'worker_spi dynamic';));

This does not check that the worker that does not have the flag set is
still running: you are not feeding the output of this query to an is()
test.

+   is($result, 't', "dynamic bgworker launched");

In launch_bgworker(), this uses the same test description for all the
callers of this subroutine.  Let's prefix it with $testcase.

+void
+TerminateBgWorkersByDbOid(Oid oid)

FWIW, while reading this code, I was wondering about one improvement
that could show benefits for more extension code than only what we are
discussing here because external code has no access to
BackgroundWorkerSlot while holding the LWLock BackgroundWorkerLock in
a single loop, by rewriting this new routine with something like that:
void TerminateBackgroundWorkerMatchin(
bool (*do_terminate) (int pid, BackgroundWorker *, Datum))

Then the per-database termination would be a custom routine, defined
also in bgworker.c.  Other extension code could define their own
filtering callback routine.  Just an idea in passing, to let extension
code take more actions on bgworker slots in use-based on a PGPROC
entry, like a role ID for example, or it could be a different factor.
Feel free to dislike such a funky idea if you do not like it and say
so, of course.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Fix POSIX compliance in pgwin32_unsetenv()

2025-10-19 Thread Michael Paquier

On Sat, Oct 18, 2025 at 01:26:40PM -0500, Bryan Green wrote:
> I noticed that pgwin32_unsetenv() in src/port/win32env.c lacks the input
> validation that its sibling function pgwin32_setenv() has (lines 126-132).
> 
> Without these checks, the function will crash on NULL input via
> strlen(NULL), and will accept empty strings or strings containing '=' in
> violation of POSIX.1-2008.
> 
> The attached patch adds the same validation that pgwin32_setenv already
> does, making the two functions consistent.  This is purely defensive - it
> only affects callers passing invalid arguments.

I presume that you have tried to use this routine on some external
code on WIN32 to note that it was just crashing.  

The current state of pgwin32_unsetenv() dates back to 0154345078fb.
The POSIX checks of setenv() are more recent than that, as in
7ca37fb0406b down to v14.  I agree that the inconsistency in handling
the input arguments is annoying, so if there are no objections let's
apply the same checks down to v14 like the setenv() piece.  It's
better than a hard crash.
--
Michael


signature.asc
Description: PGP signature

Re: Executing pg_createsubscriber with a non-compatible control file

2025-10-19 Thread Michael Paquier

On Sat, Oct 18, 2025 at 08:55:35AM +0900, Michael Paquier wrote:
> With the addition of the version check in the binary, there is no
> downside in documenting this requirement.

Okay, I have added a note about that in the docs, then applied your
patch on HEAD.  Thanks!

If others would like to apply the same kind of version checks in some
of the other in-core tools, please feel free to mention them here or
on a new thread.
--
Michael


signature.asc
Description: PGP signature

Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

2025-10-19 Thread Fujii Masao

On Sat, Oct 18, 2025 at 9:16 AM Chao Li  wrote:
> I think I put all concentration on the big picture yesterday, so I ignored 
> this implementation detail:
>
> ```
> +   if (lag_tracker->overflowed[head].lsn > lsn)
> +   return now - lag_tracker->overflowed[head].time;
> ```
>
> Should this “>” be “>=“?

I think either ">" or ">=" would work. However, the following loop
in LagTrackerRead() advances the read head when the LSN
in the current buffer entry is equal to the given LSN, so I followed
the same logic for the overflow entry and used ">" there.

Regards,

-- 
Fujii Masao

Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

2025-10-19 Thread Fujii Masao

On Sun, Oct 19, 2025 at 2:04 AM Shinya Kato  wrote:
> Thank you for the patch. I have one comment.
>
> +   if (lag_tracker->overflowed[head].lsn > lsn)
> +   return now - lag_tracker->overflowed[head].time;
>
> Could this return a negative value if the clock somehow went
> backwards? The original code returns -1 in this case, so I'm curious
> about this.

Thanks for the review!

Yes, you're right. So I've updated the patch so that -1 is returned
when the current time is earlier than the time in the overflow entry,
treating it as "no new sample found".

Regards,

-- 
Fujii Masao


v3-0001-Fix-stalled-lag-columns-in-pg_stat_replication-wh.patch
Description: Binary data

Re: Fix lag columns in pg_stat_replication not advancing when replay LSN stalls

2025-10-19 Thread Fujii Masao

On Mon, Oct 20, 2025 at 10:12 AM Fujii Masao  wrote:
>
> On Sat, Oct 18, 2025 at 9:16 AM Chao Li  wrote:
> > I think I put all concentration on the big picture yesterday, so I ignored 
> > this implementation detail:
> >
> > ```
> > +   if (lag_tracker->overflowed[head].lsn > lsn)
> > +   return now - lag_tracker->overflowed[head].time;
> > ```
> >
> > Should this “>” be “>=“?
>
> I think either ">" or ">=" would work. However, the following loop
> in LagTrackerRead() advances the read head when the LSN
> in the current buffer entry is equal to the given LSN

I forgot to mention which loop actually advances the read head in that case.
It's the following code in LagTrackerRead().


/* Read all unread samples up to this LSN or end of buffer. */
while (lag_tracker->read_heads[head] != lag_tracker->write_head &&
   lag_tracker->buffer[lag_tracker->read_heads[head]].lsn <= lsn)
{
time = lag_tracker->buffer[lag_tracker->read_heads[head]].time;
lag_tracker->last_read[head] =
lag_tracker->buffer[lag_tracker->read_heads[head]];
lag_tracker->read_heads[head] =
(lag_tracker->read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;
}


Regards,

-- 
Fujii Masao

Re: [PATCH] Fix POSIX compliance in pgwin32_unsetenv()

2025-10-19 Thread Bryan Green


On 10/19/25 20:02, Michael Paquier wrote:

On Sat, Oct 18, 2025 at 01:26:40PM -0500, Bryan Green wrote:

I noticed that pgwin32_unsetenv() in src/port/win32env.c lacks the input
validation that its sibling function pgwin32_setenv() has (lines 126-132).

Without these checks, the function will crash on NULL input via
strlen(NULL), and will accept empty strings or strings containing '=' in
violation of POSIX.1-2008.

The attached patch adds the same validation that pgwin32_setenv already
does, making the two functions consistent.  This is purely defensive - it
only affects callers passing invalid arguments.


I presume that you have tried to use this routine on some external
code on WIN32 to note that it was just crashing.

The current state of pgwin32_unsetenv() dates back to 0154345078fb.
The POSIX checks of setenv() are more recent than that, as in
7ca37fb0406b down to v14.  I agree that the inconsistency in handling
the input arguments is annoying, so if there are no objections let's
apply the same checks down to v14 like the setenv() piece.  It's
better than a hard crash.
--
Michael


I have been going through all of the windows code line by line.  That is 
how I initially noticed this.  I then wrote a program to exercise the 
code and confirm the crash. I agree it should be backported.


BG

Re: get rid of RM_HEAP2_ID

2025-10-19 Thread Michael Paquier

On Wed, Oct 15, 2025 at 05:39:38PM +0300, Heikki Linnakangas wrote:
> We could store xl_crc unaligned. IIRC all the structs that follow that are
> already stored unaligned.

If we do just that, would there be any downside in moving xl_crc to be
just after xl_prev?  That would keep the record assembling part
simpler, because XLogRecord is treated as a single object, where the
code relies on a MAXALIGN64() to make sure that the header is aligned.
--
Michael


signature.asc
Description: PGP signature

Re: pg_restore --no-policies should not restore policies' comment

2025-10-19 Thread Fujii Masao

On Fri, Oct 17, 2025 at 7:23 PM Fujii Masao  wrote:
>
> On Thu, Oct 16, 2025 at 11:23 PM jian he  wrote:
> > > As for the unnecessary code for security labels on extensions
> > > you mentioned earlier, I've created a patch to remove it. Patch attached.
> >
> > looks good to me.
>
> Thanks for the review! Unless there are any objections, I'll commit the patch.

I've pushed the patch. Thanks!

Regards,

-- 
Fujii Masao

Re: Logical Replication of sequences

2025-10-19 Thread Chao Li

> On Oct 17, 2025, at 17:34, Chao Li  wrote:
> 
> I may find some time to review 0002 and 0003 next week.

I just reviewed 0002 and 0003. Got some comments for 0002, and no comment for 
0003.

1 - 0002 - commit comment
```
Sequences have 3 states:
   - INIT (needs [re]synchronizing)
   - READY (is already synchronized)
```

3 states or 2 states? Missing DATASYNC?

2 - 0002 - launcher.c
```
+ * For both apply workers and sequence sync workers, the relid should be set to
+ * InvalidOid, as they manage changes across all tables and sequences. For 
table
+ * sync workers, the relid should be set to the OID of the relation being
+ * synchronized.
```

Nit: “both” sounds not necessarily.

3 - 0002 - launcher.c
```
  * Stop the logical replication worker for subid/relid, if any.
  */
 void
-logicalrep_worker_stop(Oid subid, Oid relid)
+logicalrep_worker_stop(Oid subid, Oid relid, LogicalRepWorkerType wtype)
```

Should the comment be updated? “subid/relid/wtype"

4 - 0002 - launcher.c
```
-   worker = logicalrep_worker_find(subid, relid, true);
+   worker = logicalrep_worker_find(subid, relid, WORKERTYPE_APPLY, true);
```

Based the comment you added to “logicalrep_worker_find()”, for apply worker, 
relid should be set to InvalidOid. (See comment 2)

Then if you change this function to only work for WORKERTYPE_APPLY, relid 
should be hard code to InvalidOid, so that “relid” can be removed from 
parameter list of logicalrep_worker_wakeup().

5 - 0002 - launcher.c
```
/*
* report_error_sequences
*
* Reports discrepancies in sequence data between the publisher and subscriber.
```

Nit: “Reports” -> “Report”

Here “reports” also work. But looking at the previous function’s comment:

```
 * Handle sequence synchronization cooperation from the apply worker.
*
* Start a sequencesync worker if one is not already running. The active
```

You used “handle” and “start” but “handles” and “starts”.

“Report” better matches imperative style in PG comments.

6 - 0002 - sequencesync.c
```
+report_error_sequences(StringInfo insuffperm_seqs, StringInfo mismatched_seqs)
+{
+   StringInfo  combined_error_detail = makeStringInfo();
+   StringInfo  combined_error_hint = makeStringInfo();
```

By the end of this function, I think we need to destroyStringInfo() these two 
strings.

7 - 0002 - sequencesync.c
```
+static void
+append_sequence_name(StringInfo buf, const char *nspname, const char *seqname,
+int *count)
+{
+   if (buf->len > 0)
+   appendStringInfoString(buf, ", “);
```

Why this “if” check is needed?

8 - 0002 - sequencesync.c 
```
+   destroyStringInfo(seqstr);
+   destroyStringInfo(cmd);
```

Instead of making and destroying seqstr and cmd in every iteration, we can 
create them before “while”, and only resetStringInfo() them for every 
iteration, which should be cheaper and faster.

9 - 0002 - sequencesync.c
```
+   return h1 ^ h2;
+   /* XOR combine */
+}
```

This code is very descriptive, so the commend looks redundant. Also, why put 
the comment below the code line?

10 - 0002 - syncutils.c
```
 /*
- * Common code to fetch the up-to-date sync state info into the static lists.
+ * Common code to fetch the up-to-date sync state info for tables and 
sequences.
  *
- * Returns true if subscription has 1 or more tables, else false.
+ * The pg_subscription_rel catalog is shared by tables and sequences. Changes
+ * to either sequences or tables can affect the validity of relation states, so
+ * we identify non-ready tables and non-ready sequences together to ensure
+ * consistency.
  *
- * Note: If this function started the transaction (indicated by the parameter)
- * then it is the caller's responsibility to commit it.
+ * Returns true if subscription has 1 or more tables, else false.
  */
 bool
-FetchRelationStates(bool *started_tx)
+FetchRelationStates(bool *has_pending_sequences, bool *started_tx)
```
has_pending_sequences is not explained in the function comment.

11 - 0002 - syncutils.c
```
/*
* has_subtables and has_subsequences is declared as static, since the
* same value can be used until the system table is invalidated.
*/
static bool has_subtables = false;
static bool has_subsequences_non_ready = false;
```

The comment says “has_subsequences”, but the real var name is 
“has_subsequences_non_ready".

12 - 0002 - syncutils.c
```
 bool
-FetchRelationStates(bool *started_tx)
+FetchRelationStates(bool *has_pending_sequences, bool *started_tx)
```

I searched over the code, this function has 3 callers, none of them want 
results for both table and sequence, so that a caller, for example:

```
bool
HasSubscriptionTablesCached(void)
{
bool started_tx;
bool has_subrels;
bool has_pending_sequences;

/* We need up-to-date subscription tables info here */
has_subrels = FetchRelationStates(&has_pending_sequences, &started_tx);

if (started_tx)
{
CommitTransactionCommand();
pgstat_report_stat(true);
}

ret

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2025-10-19 Thread Chao Li




> On Oct 19, 2025, at 17:53, Tatsuo Ishii  wrote:
> 
> 
> If there's no objection, I am going to commit in a few days.
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
> 

A very trivial commit:

```
+   else
+
+   /*
+* For other cases we have no idea what position of row callers 
would
+* fetch next time. Also for relpos < 0 case (we go backward), 
we
+* cannot set mark either. For those cases we always set mark 
at 0.
+*/
+   mark_pos = 0;
```

The empty line after “else” is not needed.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

42 matches

Mail list logo