date:20240304

Re: POC, WIP: OR-clause support for indexes

2024-03-04 Thread Andrei Lepikhov


On 5/3/2024 12:30, Andrei Lepikhov wrote:

On 4/3/2024 09:26, jian he wrote:

... and the new version of the patchset is attached.

--
regards,
Andrei Lepikhov
Postgres Professional
From 1c3ac3e006cd66ff40f1ddaaa09e3fc0f3a75ca5 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Fri, 2 Feb 2024 22:01:09 +0300
Subject: [PATCH 1/2] Transform OR clauses to ANY expression.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) 
on the
preliminary stage of optimization when we are still working with the
expression tree.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an operator which returns boolean result and has a commuter (for the case of
reverse order of constant and non-constant parts of the expression,
like 'CX op expr').
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
Authors: Alena Rybakina , Andrey Lepikhov 

Reviewed-by: Peter Geoghegan , Ranier Vilela 
Reviewed-by: Alexander Korotkov , Robert Haas 

Reviewed-by: jian he 
---
 .../postgres_fdw/expected/postgres_fdw.out|  16 +-
 doc/src/sgml/config.sgml  |  17 +
 src/backend/nodes/queryjumblefuncs.c  |  27 ++
 src/backend/parser/parse_expr.c   | 339 ++
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |   6 +-
 src/include/nodes/queryjumble.h   |   1 +
 src/include/optimizer/optimizer.h |   1 +
 src/test/regress/expected/create_index.out| 156 +++-
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 219 +--
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  35 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 22 files changed, 906 insertions(+), 69 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index c355e8f3f7..0523bbd8f7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN 
(SELECT * FROM ft5 WHERE
  Foreign Scan
Output: t1.c1, t1.c2, ft5.c1, ft5.c2
Relations: (public.ft4 t1) LEFT JOIN (public.ft5)
-   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 
10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10))
+   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 
IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10))
 (4 rows)
 
 SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 
WHERE c1 < 10) t2 ON (t1.c1 = t2.c1)
@@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full 
join ft5 t2 on (t1.c1 = t2
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 
20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY 
array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST
+   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS 
NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT 
(r1.c1 % 5)) ASC NULLS LAST
 (4 rows)
 
 select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = 
t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 
order by 1;
@@ -3123,7 +3123,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) 
from ft4 t1 full join ft
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5) ORDER BY (t1.c1 % 5))), ((t2.c1 % 
3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5) ORDER BY ((r1.c1 % 5)) 
ASC NULLS LAST), (r2.c1 % 3) FROM ("S 1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON 
(((r1.c1 = r2.c1 WHERE (((r1.c1 < 20) OR ((r1.c1 IS NULL) AND (r2.c1 < 
5 GROUP BY 2 ORDER BY

Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-04 Thread Heikki Linnakangas


On 01/03/2024 12:15, Bertrand Drouvot wrote:

Hi hackers,

I think that pgstat_reset_replslot() is missing LWLock protection. Indeed, we
don't have any guarantee that the slot is active (then preventing it to be
dropped/recreated) when this function is executed.


Yes, so it seems at quick glance. We have a similar issue in 
pgstat_fetch_replslot(); it might return stats for wrong slot, if the 
slot is dropped/recreated concurrently. Do we care?



--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -46,6 +46,8 @@ pgstat_reset_replslot(const char *name)
 
 	Assert(name != NULL);
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

+
/* Check if the slot exits with the given name. */
slot = SearchNamedReplicationSlot(name, true);


SearchNamedReplicationSlot() will also acquire the lock in LW_SHARED 
mode, when you pass need_lock=true. So that at least should be changed 
to false.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread 'Alvaro Herrera'

On 2024-Mar-05, Hayato Kuroda (Fujitsu) wrote:

> Basically sounds good. My concerns are:
> 
> * GetNamedDSMSegment() does not returns a raw pointer to dsm_segment. This 
> means
>   that it may be difficult to do dsm_unpin_segment on the caller side.

Maybe we don't need a "named" DSM segment at all, and instead just use
bare dsm segments (dsm_create and friends) or a DSA -- not sure.  But
see commit 31ae1638ce35, which removed use of a DSA in autovacuum/BRIN.
Maybe fixing this is just a matter of reverting that commit.  At the
time, there was a belief that DSA wasn't supported everywhere so we
couldn't use it for autovacuum workitem stuff, but I think our reliance
on DSA is now past the critical point.

BTW, we should turn BRIN autosummarization to be on by default.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Las mujeres son como hondas:  mientras más resistencia tienen,
 más lejos puedes llegar con ellas"  (Jonas Nightingale, Leap of Faith)

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Mats Kindahl

On Tue, Mar 5, 2024 at 7:31 AM Michael Paquier  wrote:

> On Mon, Mar 04, 2024 at 03:41:16PM +0300, Aleksander Alekseev wrote:
> >> I wanted to hook into the EXPLAIN output for queries and add some
> >> extra information, but since there is no standard_ExplainOneQuery() I
> >> had to copy the code and create my own version.
> >>
> >> Since the pattern with other hooks for a function
> >> WhateverFunction() seems to be that there is a
> >> standard_WhateverFunction() for each WhateverFunction_hook, I
> >> created a patch to follow this pattern for your consideration.
>
> So you've wanted to be able to add some custom information at the end
> or the beginning of ExplainState's output buffer, before falling back
> to the in-core path.  What was the use case, if I may ask?
>

Yes, that was the use-case. We have some caches added by extending
ExecutorStart and other executor-related functions using the hooks there
and want to show cache hits and misses in the plan.

I realize that a more advanced system is possible to create where you can
customize the output even more, but in this case I just wanted to add a
section with some additional information related to plan execution. Also,
the code in explain.c seems to not be written with extensibility in mind,
so I did not want to make too big a change here before thinking through how
this would work.

> >> I was also considering adding a callback so that you can annotate
> >> any node with explanatory information that is not a custom scan
> >> node. This could be used to propagate and summarize information
> >> from custom scan nodes, but I had no immediate use for that so did
> >> not add it here. I would still be interested in hearing if you
> >> think this is something that would be useful to the community.
>
> That depends.
>

Just to elaborate: the intention was to allow a section to be added to
every node in the plan containing information from further down and also
allow this information to propagate upwards. We happen to have buffer
information right now, but allowing something similar to be added
dynamically by extending ExplainNode and passing down a callback to
standard_ExplainOneQuery.

Best wishes,
Mats Kindahl

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Tuesday, March 5, 2024 2:35 PM shveta malik  wrote:
> 
> On Tue, Mar 5, 2024 at 9:15 AM Amit Kapila  wrote:
> >
> > On Tue, Mar 5, 2024 at 6:10 AM Peter Smith 
> wrote:
> > >
> > > ==
> > > src/backend/replication/walsender.c
> > >
> > > 5. NeedToWaitForWal
> > >
> > > + /*
> > > + * Check if the standby slots have caught up to the flushed
> > > + position. It
> > > + * is good to wait up to the flushed position and then let the
> > > + WalSender
> > > + * send the changes to logical subscribers one by one which are
> > > + already
> > > + * covered by the flushed position without needing to wait on every
> > > + change
> > > + * for standby confirmation.
> > > + */
> > > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) return true;
> > > +
> > > + *wait_event = 0;
> > > + return false;
> > > +}
> > > +
> > >
> > > 5a.
> > > The comment (or part of it?) seems misplaced because it is talking
> > > WalSender sending changes, but that is not happening in this function.
> > >
> >
> > I don't think so. This is invoked only by walsender and a static
> > function. I don't see any other better place to mention this.
> >
> > > Also, isn't what this is saying already described by the other
> > > comment in the caller? e.g.:
> > >
> >
> > Oh no, here we are explaining the wait order.
> 
> I think there is a scope of improvement here. The comment inside
> NeedToWaitForWal() which states that we need to wait here for standbys on
> flush-position(and not on each change) should be outside of this function. It 
> is
> too embedded. And the comment which states the order of wait (first flush and
> then standbys confirmation) should be outside the for-loop in
> WalSndWaitForWal(), but yes we do need both the comments. Attached a
> patch (.txt) for comments improvement, please merge if appropriate.

Thanks, I have slightly modified the top-up patch and merged it.

Attach the V106 patch which addressed above and Peter's comments[1].

[1] 
https://www.postgresql.org/message-id/CAHut%2BPsATK8z1TEcfFE8zWoS1hagqsvaWYCgom_zYtScfwO7uQ%40mail.gmail.com

Best Regards,
Hou zj


v106-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v106-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch


v106-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v106-0002-Document-the-steps-to-check-if-the-standby-is-r.patch

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Tuesday, March 5, 2024 8:40 AM Peter Smith  wrote:
> 
> Here are some review comments for v105-0001
> 
> ==
> doc/src/sgml/config.sgml
> 
> 1.
> +   
> +The standbys corresponding to the physical replication slots in
> +standby_slot_names must configure
> +sync_replication_slots = true so they can receive
> +logical failover slots changes from the primary.
> +   
> 
> /slots changes/slot changes/

Changed.

> 
> ==
> doc/src/sgml/func.sgml
> 
> 2.
> +The function may be waiting if the specified slot is a logical 
> failover
> +slot and  linkend="guc-standby-slot-names">standby_slot_names me>
> +is configured.
> 
> I know this has been through multiple versions already, but this
> latest wording "may be waiting..." doesn't seem very good to me.
> 
> How about one of these?
> 
> * The function may not be able to return immediately if the specified
> slot is a logical failover slot and standby_slot_names is configured.
> 
> * The function return might be blocked if the specified slot is a
> logical failover slot and standby_slot_names is configured.
> 
> * If the specified slot is a logical failover slot then the function
> will block until all physical slots specified in standby_slot_names
> have confirmed WAL receipt.
> 
> * If the specified slot is a logical failover slot then the function
> will not return until all physical slots specified in
> standby_slot_names have confirmed WAL receipt.

I prefer the last one.


> 
> ~~~
> 
> 3.
> +slot may return to an earlier position. The function may be waiting 
> if
> +the specified slot is a logical failover slot and
> + linkend="guc-standby-slot-names">standby_slot_names me>
> 
> 
> Same as previous review comment #2

Changed.

> 
> ==
> src/backend/replication/slot.c
> 
> 4. WaitForStandbyConfirmation
> 
> + * Used by logical decoding SQL functions that acquired logical failover 
> slot.
> 
> IIUC it doesn't work like that. pg_logical_slot_get_changes_guts()
> calls here unconditionally (i.e. the SQL functions don't even check if
> they are failover slots before calling this) so the comment seems
> misleading/redundant.

I removed the "acquired logical failover slot.".

> 
> ==
> src/backend/replication/walsender.c
> 
> 5. NeedToWaitForWal
> 
> + /*
> + * Check if the standby slots have caught up to the flushed position. It
> + * is good to wait up to the flushed position and then let the WalSender
> + * send the changes to logical subscribers one by one which are already
> + * covered by the flushed position without needing to wait on every change
> + * for standby confirmation.
> + */
> + if (NeedToWaitForStandbys(flushed_lsn, wait_event))
> + return true;
> +
> + *wait_event = 0;
> + return false;
> +}
> +
> 
> 5a.
> The comment (or part of it?) seems misplaced because it is talking
> WalSender sending changes, but that is not happening in this function.
> 
> Also, isn't what this is saying already described by the other comment
> in the caller? e.g.:
> 
> + /*
> + * Within the loop, we wait for the necessary WALs to be flushed to
> + * disk first, followed by waiting for standbys to catch up if there
> + * are enough WALs or upon receiving the shutdown signal. To avoid the
> + * scenario where standbys need to catch up to a newer WAL location in
> + * each iteration, we update our idea of the currently flushed
> + * position only if we are not waiting for standbys to catch up.
> + */
> 

I moved these comments based on Shveta's suggestion.

> ~
> 
> 5b.
> Most of the code is unnecessary. AFAICT all this is exactly same as just 1 
> line:
> 
> return NeedToWaitForStandbys(flushed_lsn, wait_event);

Changed.

> 
> ~~~
> 
> 6. WalSndWaitForWal
> 
> + /*
> + * Within the loop, we wait for the necessary WALs to be flushed to
> + * disk first, followed by waiting for standbys to catch up if there
> + * are enough WALs or upon receiving the shutdown signal. To avoid the
> + * scenario where standbys need to catch up to a newer WAL location in
> + * each iteration, we update our idea of the currently flushed
> + * position only if we are not waiting for standbys to catch up.
> + */
> 
> Regarding that 1st sentence: maybe this logic used to be done
> explicitly "within the loop" but IIUC this logic is now hidden inside
> NeedToWaitForWal() so the comment should mention that.

Changed.

Best Regards,
Hou zj

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 11:44 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Mon, Mar 04, 2024 at 01:28:04PM +, Zhijie Hou (Fujitsu) wrote:
> > Attach the V105 patch set
> 
> Thanks!
> 
> Sorry I missed those during the previous review:

No problem, thanks for the comments!

> 
> 1 ===
> 
> Commit message: "these functions will block until"
> 
> s/block/wait/ ?
> 
> 2 ===
> 
> +when used with logical failover slots, will block until all
> 
> s/block/wait/ ?
> 
> It seems those are the 2 remaining "block" that could deserve the proposed
> above change.

I prefer using 'block' here. And it seems others also suggest
to change the 'wait'[1].

> 
> 3 ===
> 
> +   invalidated = slot->data.invalidated != RS_INVAL_NONE;
> +   inactive = slot->active_pid == 0;
> 
> invalidated = (slot->data.invalidated != RS_INVAL_NONE); inactive =
> (slot->active_pid == 0);
> 
> instead?
> 
> I think it's easier to read and it looks like this is the way it's written in 
> other
> places (at least the few I checked).

I think the current code is consistent with other similar code in slot.c.
(grep "data.invalidated != RS_INVAL_NONE").

[1] 
https://www.postgresql.org/message-id/CAHut%2BPsATK8z1TEcfFE8zWoS1hagqsvaWYCgom_zYtScfwO7uQ%40mail.gmail.com

Best Regards,
Hou zj

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Nisha Moond

I did performance tests for the v99 patch w.r.t. wait time analysis.
As this patch is introducing a wait for standby before sending changes
to a subscriber, at the primary node, logged time at the start and end
of the XLogSendLogical() call (which eventually calls
WalSndWaitForWal()) and calculated total time taken by this function
during the load run.

For load, ran pgbench for 15 minutes:
Creating tables: pgbench -p 5833 postgres -qis 2
Running benchmark: pgbench postgres -p 5833 -c 10 -j 3 -T 900 -P 20

Machine details:
11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz 32GB RAM
OS - Windows 10 Enterprise

Test setup:
Primary node -->
  -> One physical standby node
  -> One subscriber node having only one subscription with failover=true

-- the slot-sync relevant parameters are set to default (OFF) for all
the tests i.e.
hot_standby_feedback = off
sync_replication_slots = false

-- addition configuration on each instance is:
shared_buffers = 6GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

To review the wait time impact with and without patch, compared three
cases (did two runs for each case)-
(1) HEAD code:
time taken in run 1 = 103.935631 seconds
time taken in run 2 = 104.832186 seconds

(2) HEAD code + v99 patch ('standby_slot_names' is not set):
time taken in run 1 = 104.076343 seconds
time taken in run 2 = 103.116226 seconds

(3) HEAD code + v99 patch + a valid 'standby_slot_names' is set:
time taken in run 1 = 103.871012 seconds
time taken in run 2 = 103.793524 seconds

The time consumption of XLogSendLogical() is almost same in all the
cases and no performance degradation is observed.

--
Thanks,
Nisha

Re: Synchronizing slots from primary to standby

2024-03-04 Thread shveta malik

On Tue, Mar 5, 2024 at 9:15 AM Amit Kapila  wrote:
>
> On Tue, Mar 5, 2024 at 6:10 AM Peter Smith  wrote:
> >
> > ==
> > src/backend/replication/walsender.c
> >
> > 5. NeedToWaitForWal
> >
> > + /*
> > + * Check if the standby slots have caught up to the flushed position. It
> > + * is good to wait up to the flushed position and then let the WalSender
> > + * send the changes to logical subscribers one by one which are already
> > + * covered by the flushed position without needing to wait on every change
> > + * for standby confirmation.
> > + */
> > + if (NeedToWaitForStandbys(flushed_lsn, wait_event))
> > + return true;
> > +
> > + *wait_event = 0;
> > + return false;
> > +}
> > +
> >
> > 5a.
> > The comment (or part of it?) seems misplaced because it is talking
> > WalSender sending changes, but that is not happening in this function.
> >
>
> I don't think so. This is invoked only by walsender and a static
> function. I don't see any other better place to mention this.
>
> > Also, isn't what this is saying already described by the other comment
> > in the caller? e.g.:
> >
>
> Oh no, here we are explaining the wait order.

I think there is a scope of improvement here. The comment inside
NeedToWaitForWal() which states that we need to wait here for standbys
on flush-position(and not on each change) should be outside of this
function. It is too embedded. And the comment which states the order
of wait (first flush and then standbys confirmation) should be outside
the for-loop in WalSndWaitForWal(), but yes we do need both the
comments. Attached a patch (.txt) for comments improvement, please
merge if appropriate.

thanks
Shveta
From e03332839dd804f3bc38937e677ba87ac10f981b Mon Sep 17 00:00:00 2001
From: Shveta Malik 
Date: Tue, 5 Mar 2024 11:29:52 +0530
Subject: [PATCH v2] Comments improvement.

---
 src/backend/replication/walsender.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index 96f44ba85d..6a082b42eb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1799,13 +1799,6 @@ NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr 
flushed_lsn,
return true;
}
 
-   /*
-* Check if the standby slots have caught up to the flushed position. It
-* is good to wait up to the flushed position and then let the WalSender
-* send the changes to logical subscribers one by one which are already
-* covered by the flushed position without needing to wait on every 
change
-* for standby confirmation.
-*/
if (NeedToWaitForStandbys(flushed_lsn, wait_event))
return true;
 
@@ -1818,7 +1811,10 @@ NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr 
flushed_lsn,
  *
  * If the walsender holds a logical slot that has enabled failover, we also
  * wait for all the specified streaming replication standby servers to
- * confirm receipt of WAL up to RecentFlushPtr.
+ * confirm receipt of WAL up to RecentFlushPtr. It's beneficial to wait
+ * here for the confirmation from standbys up to RecentFlushPtr rather
+ * than waiting in ReorderBufferCommit() before transmitting each
+ * change to logical subscribers, which is already covered in RecentFlushPtr.
  *
  * Returns end LSN of flushed WAL.  Normally this will be >= loc, but if we
  * detect a shutdown request (either from postmaster or client) we will return
@@ -1847,6 +1843,12 @@ WalSndWaitForWal(XLogRecPtr loc)
else
RecentFlushPtr = GetXLogReplayRecPtr(NULL);
 
+   /*
+* In the loop, we wait for the necessary WALs to be flushed to disk
+* initially, and then we wait for standbys to catch up. Upon receiving
+* the shutdown signal, we immediately transition to waiting for 
standbys
+* to catch up.
+*/
for (;;)
{
boolwait_for_standby_at_stop = false;
@@ -1877,12 +1879,10 @@ WalSndWaitForWal(XLogRecPtr loc)
XLogBackgroundFlush();
 
/*
-* Within the loop, we wait for the necessary WALs to be 
flushed to
-* disk first, followed by waiting for standbys to catch up if 
there
-* are enough WALs or upon receiving the shutdown signal. To 
avoid the
-* scenario where standbys need to catch up to a newer WAL 
location in
-* each iteration, we update our idea of the currently flushed
-* position only if we are not waiting for standbys to catch up.
+* Update our idea of the currently flushed position, but do it 
only
+* if we haven't started waiting for standbys. This is done to 
prevent
+* standbys from needing to catch up to a more recent WAL 
location
+* with each iteration.
 */
if

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 03:41:16PM +0300, Aleksander Alekseev wrote:
>> I wanted to hook into the EXPLAIN output for queries and add some
>> extra information, but since there is no standard_ExplainOneQuery() I
>> had to copy the code and create my own version. 
>>
>> Since the pattern with other hooks for a function
>> WhateverFunction() seems to be that there is a
>> standard_WhateverFunction() for each WhateverFunction_hook, I
>> created a patch to follow this pattern for your consideration.

So you've wanted to be able to add some custom information at the end
or the beginning of ExplainState's output buffer, before falling back
to the in-core path.  What was the use case, if I may ask?

>> I was also considering adding a callback so that you can annotate
>> any node with explanatory information that is not a custom scan
>> node. This could be used to propagate and summarize information
>> from custom scan nodes, but I had no immediate use for that so did
>> not add it here. I would still be interested in hearing if you
>> think this is something that would be useful to the community.

That depends.

> I registered the patch on the nearest open CF [1] and marked it as
> RfC. It is a pretty straightforward refactoring.
> 
> [1]: https://commitfest.postgresql.org/48/4879/

I know that we're in the middle of commit fest 47 while this is in 48,
but I can't really see a reason why we should not do that earlier than
v18.  One point about core is to be flexible for extension code.  So I\
have no objections, others are free to comment.
--
Michael


signature.asc
Description: PGP signature

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

Hi, here are some review comments for v7-0002

==
Commit Message

1.
This commit adds a hash table to binaryheap in order to track of
positions of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

~

/to track of positions of each nodes/to track the position of each node/

~~~

2.
There is no user of it but it will be used by a upcoming patch.

~

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

==
src/common/binaryheap.c

3.
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */

Typo: *also*"

~~~

4.
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) , sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"

4a.
The comment in simplehash.h says
 *   The following parameters are only relevant when SH_DEFINE is defined:
 *   - SH_KEY - ...
 *   - SH_EQUAL(table, a, b) - ...
 *   - SH_HASH_KEY(table, key) - ...
 *   - SH_STORE_HASH - ...
 *   - SH_GET_HASH(tb, a) - ...

So maybe it is nicer to reorder the #defines in that same order?

SUGGESTION:
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif

+#define SH_DEFINE
+#define SH_KEY key
+#define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(bh_node_type)) == 0)
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) , sizeof(bh_node_type))
+#include "lib/simplehash.h"

~~

4b.
The comment in simplehash.h says that "it's preferable, if possible,
to store the element's hash in the element's data type", so should
SH_STORE_HASH and SH_GET_HASH also be defined here?

~~~

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

~~~

6.
+ heap->bh_indexed = indexed;
+ if (heap->bh_indexed)
+ {
+#ifdef FRONTEND
+ heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+ heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+ NULL);
+#endif
+ }
+

The heap allocation just uses palloc instead of palloc0 so it might be
better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will
never get a situation where bh_indexed is false but bh_nodeidx has
some (garbage) value.

~~~

7.
+/*
+ * Set the given node at the 'index', and updates its position accordingly.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)

7a.
I felt the 1st sentence should be more like:

SUGGESTION
Set the given node at the 'index' and track it if required.

~

7b.
IMO the parameters would be better the other way around (e.g. 'index'
before the 'node') because that's what the assignments look like:


heap->bh_nodes[heap->bh_size] = d;

becomes:
bh_set_node(heap, heap->bh_size, d);

~~~

8.
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
+{
+ bh_nodeidx_entry *ent;
+ bool found = false;
+
+ /* Set the node to the nodes array */
+ heap->bh_nodes[index] = node;
+
+ if (heap->bh_indexed)
+ {
+ /* Remember its index in the nodes array */
+ ent = bh_nodeidx_insert(heap->bh_nodeidx, node, );
+ ent->idx = index;
+ }
+
+ return found;
+}

8a.
That 'ent' declaration can be moved to the inner block scope, so it is
closer to where it is needed.

~

8b.
+ /* Remember its index in the nodes array */

The comment is worded a bit ambiguously. IMO a simpler comment would
be: "/* Keep track of the node index. */"

~~~

9.
+static void
+bh_delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!heap->bh_indexed)
+ return;
+
+ (void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

Since there is only 1 statement IMO it is simpler to write this
function like below:

if (heap->bh_indexed)
  (void) bh_nodeidx_delete(heap->bh_nodeidx, node);

~~~

10.
+/*
+ * Replace the node at 'idx' with the

Re: Add comment to specify timeout unit in ConditionVariableTimedSleep()

2024-03-04 Thread Michael Paquier

On Tue, Mar 05, 2024 at 09:39:11AM +0530, shveta malik wrote:
> ConditionVariableTimedSleep() accepts a timeout parameter, but it
> doesn't explicitly state the unit for the timeout anywhere. To
> determine this, one needs to look into the details of the function to
> find it out from the comments of the internally called function
> WaitLatch(). It would be beneficial to include a comment in the header
> of ConditionVariableTimedSleep() specifying that the timeout is in
> milliseconds, similar to what we have for other non-static functions
> like WaitLatch and WaitEventSetWait. Attached the patch for the same.

That sounds like a good idea to me, so I'm OK with your suggestion.
--
Michael


signature.asc
Description: PGP signature

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 02:11:08PM +0900, Sutou Kouhei wrote:
> If this is a blocker of making COPY format extendable, can
> we defer moving the existing text/csv/binary format
> implementations to Copy{From,To}Routine for now as Michael
> suggested to proceed making COPY format extendable? (Can we
> add Copy{From,To}Routine without changing the existing
> text/csv/binary format implementations?)

Yeah, I assume that it would be the way to go so as we don't do any
dispatching in default cases.  A different approach that could be done
is to hide some of the parts of binary and text/csv in inline static
functions that are equivalent to the routine callbacks.  That's
similar to the previous versions of the patch set, but if we come back
to the argument that there is a risk of blocking optimizations of more
of the local areas of the per-row processing in NextCopyFrom() and
CopyOneRowTo(), what you have sounds like a good balance.

CopyOneRowTo() could do something like that to avoid the extra
indentation:
if (cstate->routine)
{
cstate->routine->CopyToOneRow(cstate, slot);
MemoryContextSwitchTo(oldcontext);
return;
}

NextCopyFrom() does not need to be concerned by that.

> I attach a patch for it.

> There is a large hunk for CopyOneRowTo() that is caused by
> indent change. I also attach "...-w.patch" that uses "git
> -w" to remove space only changes. "...-w.patch" is only for
> review. We should use .patch without -w for push.

I didn't know this trick.  That's indeed nice..  I may use that for
other stuff to make patches more presentable to the eyes.  And that's
available as well with `git diff`.

If we basically agree about this part, how would the rest work out
with this set of APIs and the possibility to plug in a custom value
for FORMAT to do a pg_proc lookup, including an example of how these
APIs can be used?
--
Michael

signature.asc
Description: PGP signature

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Andy Fan



David Rowley  writes:

> On Tue, 5 Mar 2024 at 00:37, Andy Fan  wrote:
>>
>> David Rowley  writes:
>> > I don't think it would be right to fudge the costs in any way, but I
>> > think the risk factor for IndexPaths could take into account the
>> > number of unmatched index clauses and increment the risk factor, or
>> > "certainty_factor" as it is currently in my brain-based design. That
>> > way add_path() would be more likely to prefer the index that matches
>> > the most conditions.
>>
>> This is somehow similar with my proposal at [1]?  What do you think
>> about the treat 'col op const' as 'col op $1' for the marked column?
>> This could just resolve a subset of questions in your mind, but the
>> method looks have a solid reason.
>
> Do you mean this?

Yes, it is not cautious enough to say "similar" too quick.

After reading your opinion again, I think what you are trying to do is
adding one more dimension to Path compared with the existing cost and
pathkey information and it would take effects on add_path stage. That is
impressive, and I'm pretty willing to do more testing once the v1 is
done.

I just noted you have expressed your idea about my proposal 1,

> We should do anything like add column options in the meantime. Those
> are hard to remove once added.

I will try it very soon.  and I'm a bit of upset no one care about my
proposal 2 which is the AI method, I see many companies want to
introduce AI to planner even I don't seen any impressive success, but
this user case looks like a candidate. 

>> + /*
>> + * To make the planner more robust to handle some inaccurate statistics
>> + * issue, we will add a extra cost to qpquals so that the less qpquals
>> + * the lower cost it has.
>> + */
>> + cpu_run_cost += 0.01 * list_length(qpquals);
>
> I don't think it's a good idea to add cost penalties like you proposed
> there. This is what I meant by "I don't think it would be right to
> fudge the costs in any way".
>
> If you modify the costs to add some small penalty so that the planner
> is more likely to favour some other plan, what happens if we then
> decide the other plan has some issue and we want to penalise that for
> some other reason? Adding the 2nd penalty might result in the original
> plan choice again. Which one should be penalised more? I think the
> uncertainty needs to be tracked separately.
>
> Fudging the costs like this is also unlikely to play nicely with
> add_path's use of STD_FUZZ_FACTOR.  There'd be an incentive to do
> things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large
> enough penalty.

I agree and I just misunderstood your proposal yesterday. 

-- 
Best Regards
Andy Fan

Re: POC, WIP: OR-clause support for indexes

2024-03-04 Thread Andrei Lepikhov


On 4/3/2024 09:26, jian he wrote:

On Thu, Feb 29, 2024 at 4:59 PM Andrei Lepikhov

Feel free to add, change or totally rewrite these changes.

On replacement of static ScalarArrayOpExpr dest with dynamic allocated one:
After discussion [1] I agree with that replacement.

Some style (and language) changes in comments I haven't applied because 
it looks debatable for me.



I think it should be something like:
+ gettext_noop("Transform a sequence of OR expressions to an array
expression."),
+ gettext_noop("The planner will replace expression like 'x=c1 OR x=c2 "
+ "to the expression 'x = ANY( ARRAY[c1,c2])''"

Fixed


queryId may not be a good variable name here?
Not sure. QueryId is a concept, part of queryjumble technique and can be 
used by other tools. It just tells the developer what it is the same 
thing as Query Jumbling but for a separate expression.
At least you don't insist on removing of JumbleState return pointer that 
looks strange for a simple hash ...


comment `/* Compute query ID */`
seems not correct, here we are just hashing the expression?

The same as above.

+/*
+ * Dynahash match function to use in guc_hashtab
the above comments seem not correct?

Yes, fixed.


` It applies to equality expressions only.` seems not correct?
`select * from tenk1 where unique1 < 1 or unique1 < 2; ` can also do
the transformation.

Yes, I forgot it.

`similarity of variable sides.` seems not correct,
should  it be 'sameness of the variable sides`?

The term 'equivalence' looks better *).


in [2], we can get:
SOME is a synonym for ANY. IN is equivalent to = ANY.

but still transforming OR to ANY is not intuitive.
a normal user may not know what is "transforming OR to ANY".
so maybe adding a simple example at

would be great. which, I did at previous thread.
Not sure. Examples in that section are unusual things. What's more, 
should a user who doesn't know what it means to change this setting? 
Let's wait for other opinions.


[1] https://www.postgresql.org/message-id/2157387.1709068...@sss.pgh.pa.us

--
regards,
Andrei Lepikhov
Postgres Professional

Re: speed up a logical replica setup

2024-03-04 Thread Shlok Kyal

Hi,
> I applied the v25 patch and did RUN the 'pg_createsubscriber' command.
> I was unable to execute it and experienced the following error:
>
> $ ./pg_createsubscriber -D node2/ -P "host=localhost port=5432
> dbname=postgres"  -d postgres -d db1 -p 9000 -r
> ./pg_createsubscriber: invalid option -- 'p'
> pg_createsubscriber: hint: Try "pg_createsubscriber --help" for more
> information.
>
> Here, the --p is not accepting the desired port number. Thoughts?

I investigated it and found that:

+ while ((c = getopt_long(argc, argv, "d:D:nP:rS:t:v",
+ long_options, _index)) != -1)
+ {

Here 'p', 's' and 'U' options are missing so we are getting the error.
Also, I think the 'S' option should be removed from here.

I also see that specifying long options like --subscriber-port,
--subscriber-username, --socket-directory works fine.
Thoughts?

Thanks and regards,
Shlok Kyal

Re: remaining sql/json patches

2024-03-04 Thread jian he

On Tue, Mar 5, 2024 at 9:22 AM Amit Langote  wrote:
>
> Thanks for the heads up.  Attaching rebased patches.
>

Walking through the v41-0001-Add-SQL-JSON-query-functions.patch documentation.
I found some minor cosmetic issues.

+   
+select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a'
RETURNING int[] OMIT QUOTES);
+
+   
this example is not so good, it returns NULL, makes it harder to
render the result.

+   context_item (the document); seen
+for more details on what
+   path_expression can contain.
"seen" should be "see"?

+   
+This function must return a JSON string, so if the path expression
+returns multiple SQL/JSON items, you must wrap the result using the
+WITH WRAPPER clause.  If the wrapper is
"must" may be not correct?
since we have a RETURNING clause.
"generally" may be more accurate, I think.
maybe we can rephrase the sentence:
+This function generally return a JSON string, so if the path expression
+yield multiple SQL/JSON items, you must wrap the result using the
+WITH WRAPPER clause

+is spcified, the returned value will be of type text.
+If no RETURNING is spcified, the returned value will
two typos, and should be "specified".

+Note that if the path_expression
+is strict and ON ERROR behavior
+is ON ERROR, an error is generated if it yields no
+items.

may be the following:
+Note that if the path_expression
+is strict and ON ERROR behavior
+is ERROR, an error is generated if it yields no
+items.

most of the place, you use
 path_expression
but there are two place you use:
path_expression
I guess that's ok, but the appearance is different.
  more prominent. Anyway, it is a minor issue.

+json_query.  Note that scalar strings returned
+by json_value always have their quotes removed,
+equivalent to what one would get with OMIT QUOTES
+when using json_query.

I think we can simplify it like the following:

+json_query.  Note that scalar strings returned
+by json_value always have their quotes removed,
+equivalent to OMIT QUOTES
+when using json_query.

Re: remaining sql/json patches

2024-03-04 Thread Andy Fan

Hi,

> On Tue, Mar 5, 2024 at 12:03 AM Alvaro Herrera  
> wrote:
>> On 2024-Mar-04, Erik Rijkers wrote:
>>
>> > In my hands (applying with patch), the patches, esp. 0001, do not apply.
>> > But I see the cfbot builds without problem so maybe just ignore these 
>> > FAILED
>> > lines.  Better get them merged - so I can test there...
>>
>> It's because of dbbca2cf299b.  It should apply cleanly if you do "git
>> checkout dbbca2cf299b^" first ...  That commit is so recent that
>> evidently the cfbot hasn't had a chance to try this patch again since it
>> went in, which is why it's still green.
>
> Thanks for the heads up.  Attaching rebased patches.

In the commit message of 0001, we have:

"""
Both JSON_VALUE() and JSON_QUERY() functions have options for
handling EMPTY and ERROR conditions, which can be used to specify
the behavior when no values are matched and when an error occurs
during evaluation, respectively.

All of these functions only operate on jsonb values. The workaround
for now is to cast the argument to jsonb.
"""

which is not clear for me why we introduce JSON_VALUE() function, is it
for handling EMPTY or ERROR conditions? I think the existing cast
workaround have a similar capacity?

Then I think if it is introduced as a performance improvement like [1],
then the test at [1] might be interesting. If this is the case, the
method in [1] can avoid the user to modify these queries for the using
the new function. 

[1] https://www.postgresql.org/message-id/8734t6c5rh@163.com

-- 
Best Regards
Andy Fan

Add comment to specify timeout unit in ConditionVariableTimedSleep()

2024-03-04 Thread shveta malik

Hi hackers,

ConditionVariableTimedSleep() accepts a timeout parameter, but it
doesn't explicitly state the unit for the timeout anywhere. To
determine this, one needs to look into the details of the function to
find it out from the comments of the internally called function
WaitLatch(). It would be beneficial to include a comment in the header
of ConditionVariableTimedSleep() specifying that the timeout is in
milliseconds, similar to what we have for other non-static functions
like WaitLatch and WaitEventSetWait. Attached the patch for the same.

thanks
Shveta


v1-0001-Specify-timeout-unit-in-ConditionVariableTimedSle.patch
Description: Binary data

Re: initdb's -c option behaves wrong way?

2024-03-04 Thread Kyotaro Horiguchi

At Mon, 4 Mar 2024 09:39:39 +0100, Daniel Gustafsson  wrote in 
> 
> 
> > On 4 Mar 2024, at 02:01, Tom Lane  wrote:
> > 
> > Daniel Gustafsson  writes:
> >> I took the liberty to add this to the upcoming CF to make sure we don't 
> >> lose
> >> track of it.
> > 
> > Thanks for doing that, because the cfbot pointed out a problem:
> > I should have written pg_strncasecmp not strncasecmp.  If this
> > version tests cleanly, I'll push it.
> 
> +1, LGTM.

Thank you for fixing this, Tom!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-03-04 Thread Jeff Davis

On Wed, 2024-02-28 at 09:29 -0800, Jeff Davis wrote:
> On Wed, 2024-02-28 at 10:55 -0600, Nathan Bossart wrote:
> > I'm afraid I don't have a better idea than adding a short note in
> > each
> > affected commands's page.
> 
> OK, that works for now.

Committed.

The only changes are documentation and test updates.

This is a behavior change, so it still carries some risk, though we've
had a lot of discussion and generally it seems to be worth it. If it
turns out worse than expected during beta, of course we can re-revert
it.

I will restate the risks here, which come basically from two places:

(1) Functions called from index expressions which rely on search_path
(and don't have a SET clause).

Such a function would have already been fairly broken before my commit,
because anyone accessing the table without the right search_path would
have seen an error or wrong results. And there is no means to set the
"right" search_path for autoanalyze or logical replication, so those
would not have worked with such a broken function before my commit, no
matter what.

That being said, surely some users did have such broken functions, and
with this commit, they will have to remedy them with a SET clause.
Fortunately, the performance impact of doing so has been greatly
reduced.

(2) Matierialized views which call functions that rely on search_path
(and don't have a SET clause).

This is arguably a worse kind of breakage because materialized views
are often refreshed only by the table owner, and it's easier to control
search_path when running REFRESH. Additionally, functions called from
materialized views are more likely to be "interesting" than functions
called from an index expression. However, the remedy is
straightforward: use a SET clause.

Regards,
Jeff Davis

Re: speed up a logical replica setup

2024-03-04 Thread Shubham Khanna

On Sat, Mar 2, 2024 at 2:19 AM Euler Taveira  wrote:
>
> On Thu, Feb 22, 2024, at 12:45 PM, Hayato Kuroda (Fujitsu) wrote:
>
> Based on idea from Euler, I roughly implemented. Thought?
>
> 0001-0013 were not changed from the previous version.
>
> V24-0014: addressed your comment in the replied e-mail.
> V24-0015: Add disconnect_database() again, per [3]
> V24-0016: addressed your comment in [4].
> V24-0017: addressed your comment in [5].
> V24-0018: addressed your comment in [6].
>
>
> Thanks for your review. I'm attaching v25 that hopefully addresses all pending
> points.
>
> Regarding your comments [1] on v21, I included changes for almost all items
> except 2, 20, 23, 24, and 25. It still think item 2 is not required because
> pg_ctl will provide a suitable message. I decided not to rearrange the block 
> of
> SQL commands (item 20) mainly because it would avoid these objects on node_f.
> Do we really need command_checks_all? Depending on the output it uses
> additional cycles than command_ok.
>
> In summary:
>
> v24-0002: documentation is updated. I didn't apply this patch as-is. Instead, 
> I
> checked what you wrote and fix some gaps in what I've been written.
> v24-0003: as I said I don't think we need to add it, however, I won't fight
> against it if people want to add this check.
> v24-0004: I spent some time on it. This patch is not completed. I cleaned it 
> up
> and include the start_standby_server code. It starts the server using the
> specified socket directory, port and username, hence, preventing external 
> client
> connections during the execution.
> v24-0005: partially applied
> v24-0006: applied with cosmetic change
> v24-0007: applied with cosmetic change
> v24-0008: applied
> v24-0009: applied with cosmetic change
> v24-0010: not applied. Base on #15, I refactored this code a bit. pg_fatal is
> not used when there is a database connection open. Instead, pg_log_error()
> followed by disconnect_database(). In cases that it should exit immediately,
> disconnect_database() has a new parameter (exit_on_error) that controls if it
> needs to call exit(1). I go ahead and did the same for connect_database().
> v24-0011: partially applied. I included some of the suggestions (18, 19, and 
> 21).
> v24-0012: not applied. Under reflection, after working on v24-0004, the target
> server will start with new parameters (that only accepts local connections),
> hence, during the execution is not possible anymore to detect if the target
> server is a primary for another server. I added a sentence for it in the
> documentation (Warning section).
> v24-0013: good catch. Applied.
> v24-0014: partially applied. After some experiments I decided to use a small
> number of attempts. The current code didn't reset the counter if the 
> connection
> is reestablished. I included the documentation suggestion. I didn't include 
> the
> IF EXISTS in the DROP PUBLICATION because it doesn't solve the issue. Instead,
> I refactored the drop_publication() to not try again if the DROP PUBLICATION
> failed.
> v24-0015: not applied. I refactored the exit code to do the right thing: print
> error message, disconnect database (if applicable) and exit.
> v24-0016: not applied. But checked if the information was presented in the
> documentation; it is.
> v24-0017: good catch. Applied.
> v24-0018: not applied.
>
> I removed almost all boolean return and include the error logic inside the
> function. It reads better. I also changed the connect|disconnect_database
> functions to include the error logic inside it. There is a new parameter
> on_error_exit for it. I removed the action parameter from pg_ctl_status() -- I
> think someone suggested it -- and the error message was moved to outside the
> function. I improved the cleanup routine. It provides useful information if it
> cannot remove the object (publication or replication slot) from the primary.
>
> Since I applied v24-0004, I realized that extra start / stop service are
> required. It mean pg_createsubscriber doesn't start the transformation with 
> the
> current standby settings. Instead, it stops the standby if it is running and
> start it with the provided command-line options (socket, port,
> listen_addresses). It has a few drawbacks:
> * See v34-0012. It cannot detect if the target server is a primary for another
>   server. It is documented.
> * I also removed the check for standby is running. If the standby was stopped 
> a
>   long time ago, it will take some time to reach the start point.
> * Dry run mode has to start / stop the service to work correctly. Is it an
>   issue?
>
> However, I decided to include --retain option, I'm thinking about to remove 
> it.
> If the logging is enabled, the information during the pg_createsubscriber will
> be available. The client log can be redirected to a file for future 
> inspection.
>
> Comments?

I applied the v25 patch and did RUN the 'pg_createsubscriber' command.
I was unable to execute it and experienced the following

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Amit Kapila

On Tue, Mar 5, 2024 at 6:10 AM Peter Smith  wrote:
>
> ==
> src/backend/replication/walsender.c
>
> 5. NeedToWaitForWal
>
> + /*
> + * Check if the standby slots have caught up to the flushed position. It
> + * is good to wait up to the flushed position and then let the WalSender
> + * send the changes to logical subscribers one by one which are already
> + * covered by the flushed position without needing to wait on every change
> + * for standby confirmation.
> + */
> + if (NeedToWaitForStandbys(flushed_lsn, wait_event))
> + return true;
> +
> + *wait_event = 0;
> + return false;
> +}
> +
>
> 5a.
> The comment (or part of it?) seems misplaced because it is talking
> WalSender sending changes, but that is not happening in this function.
>

I don't think so. This is invoked only by walsender and a static
function. I don't see any other better place to mention this.

> Also, isn't what this is saying already described by the other comment
> in the caller? e.g.:
>

Oh no, here we are explaining the wait order.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Amit Kapila

On Mon, Mar 4, 2024 at 2:27 PM Bertrand Drouvot
 wrote:
>
> On Thu, Feb 29, 2024 at 03:38:59PM +0530, Amit Kapila wrote:
> > On Thu, Feb 29, 2024 at 9:13 AM Peter Smith  wrote:
> > >
> > > On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > Also, adding wait sounds
> > > > more like a boolean. So, I don't see the proposed names any better
> > > > than the current one.
> > > >
> > >
> > > Anyway, the point is that the current GUC name 'standby_slot_names' is
> > > not ideal IMO because it doesn't have enough meaning by itself -- e.g.
> > > you have to read the accompanying comment or documentation to have any
> > > idea of its purpose.
> > >
> >
> > Yeah, one has to read the description but that is true for other
> > parameters like "temp_tablespaces". I don't have any better ideas but
> > open to suggestions.
>
> What about "non_lagging_standby_slots"?
>

I still prefer the current one as that at least resembles with
existing synchronous_standby_names. I think we can change the GUC name
if we get an agreement on a better name before release. At this stage,
let's move with the current one.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread vignesh C

On Wed, 28 Feb 2024 at 11:40, Amit Kapila  wrote:
>
> On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  wrote:
> >
>
> A few comments on 0003:
> ===
> 1.
> +/*
> + * Threshold of the total number of top-level and sub transactions
> that controls
> + * whether we switch the memory track state. While the MAINTAIN_HEAP state is
> + * effective when there are many transactions being decoded, in many systems
> + * there is generally no need to use it as long as all transactions
> being decoded
> + * are top-level transactions. Therefore, we use MaxConnections as
> the threshold
> + * so we can prevent switch to the state unless we use subtransactions.
> + */
> +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
>
> The comment seems to imply that MAINTAIN_HEAP is useful for large
> number of transactions but ReorderBufferLargestTXN() switches to this
> state even when there is one transaction. So, basically we use the
> binary_heap technique to get the largest even when we have one
> transaction but we don't maintain that heap unless we have
> REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
> in-progress. This means there is some additional work when (build and
> reset heap each time when we pick largest xact) we have fewer
> transactions in the system but that may not be impacting us because of
> other costs involved like serializing all the changes. I think once we
> can try to stress test this by setting
> debug_logical_replication_streaming to 'immediate' to see if the new
> mechanism has any overhead.

I ran the test with a transaction having many inserts:

 | 5000 | 1   |  2  | 10|  100   | 1000
--- 
|---|||--||
Head | 26.31 | 48.84 | 93.65  | 480.05  |  4808.29   | 47020.16
Patch |  26.35  | 50.8   | 97.99  | 484.8|  4856.95   | 48108.89

The same test with debug_logical_replication_streaming= 'immediate'

 | 5000 | 1   |  2  | 10|  100   | 1000
--- 
|---|||--||
Head | 59.29   |  115.84  |  227.21 | 1156.08   |  11367.42   |  113986.14
Patch | 62.45  |  120.48  |  240.56 | 1185.12   |  11855.37   |  119921.81

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction.
In this case I noticed that the test execution with patch was taking
slightly more time.

Regards,
Vignesh

Re: Support "Right Semi Join" plan shapes

2024-03-04 Thread wenhui qiu

Hi Richard
 Agree +1 ,I think can push now.

Richard

On Tue, 5 Mar 2024 at 10:44, Richard Guo  wrote:

>
> On Tue, Jan 30, 2024 at 2:51 PM Alena Rybakina 
> wrote:
>
>> I have reviewed your patch and I think it is better to add an Assert for
>> JOIN_RIGHT_SEMI to the ExecMergeJoin and ExecNestLoop functions to
>> prevent the use of RIGHT_SEMI for these types of connections (NestedLoop
>> and MergeJoin).
>
>
> Hmm, I don't see why this is necessary.  The planner should already
> guarantee that we won't have nestloops/mergejoins with right-semi joins.
>
> Thanks
> Richard
>

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan

>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB. 
>> 
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>> 
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit 
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB. 

> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
 I assumed that releasing all of the memory at the end of executor once
 is not an option since it may consumed too many memory. Then, when and
 which entry to release becomes a trouble for me. For example:

   QUERY PLAN
 --
  Nested Loop
Join Filter: (t1.a = t2.a)
->  Seq Scan on t1
->  Seq Scan on t2
 (4 rows)

 In this case t1.a needs a longer lifespan than t2.a since it is
 in outer relation. Without the help from slot's life-cycle system, I
 can't think out a answer for the above question.

>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>> 
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common. 
>> 
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not 
true that once it is get disabled, it can't be enabled any more for the
given query.

> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.

> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the 
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is

Re: PostgreSQL Contributors Updates

2024-03-04 Thread vignesh C

On Mon, 4 Mar 2024 at 17:43, Aleksander Alekseev
 wrote:
>
> > > All,
> > >
> > > The PostgreSQL Contributor Page
> > > (https://www.postgresql.org/community/contributors/) includes people who
> > > have made substantial, long-term contributions of time and effort to the
> > > PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> > > following people for their contributions.
> > >
> > > New PostgreSQL Contributors:
> > >
> > > * Bertrand Drouvot
> > > * Gabriele Bartolini
> > > * Richard Guo
> > >
> > > New PostgreSQL Major Contributors:
> > >
> > > * Alexander Lakhin
> > > * Daniel Gustafsson
> > > * Dean Rasheed
> > > * John Naylor
> > > * Melanie Plageman
> > > * Nathan Bossart
> > >
> > > Thank you and congratulations to all!
> >
> > +1. Congratulations to all!
>
> Congratulations to all!

Congratulations to all!

Re: Support "Right Semi Join" plan shapes

2024-03-04 Thread Richard Guo

On Tue, Jan 30, 2024 at 2:51 PM Alena Rybakina 
wrote:

> I have reviewed your patch and I think it is better to add an Assert for
> JOIN_RIGHT_SEMI to the ExecMergeJoin and ExecNestLoop functions to
> prevent the use of RIGHT_SEMI for these types of connections (NestedLoop
> and MergeJoin).

Hmm, I don't see why this is necessary.  The planner should already
guarantee that we won't have nestloops/mergejoins with right-semi joins.

Thanks
Richard

Re: Support "Right Semi Join" plan shapes

2024-03-04 Thread Richard Guo

On Mon, Mar 4, 2024 at 10:33 AM wenhui qiu  wrote:

> HI Richard
>  Now it is starting the last commitfest for v17, can you respond to
> Alena Rybakina points?
>

Thanks for reminding.  Will do that soon.

Thanks
Richard

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

Hi, Here are some review comments for v7-0001

1.
/*
 * binaryheap_free
 *
 * Releases memory used by the given binaryheap.
 */
void
binaryheap_free(binaryheap *heap)
{
pfree(heap);
}


Shouldn't the above function (not modified by the patch) also firstly
free the memory allocated for the heap->bh_nodes?

~~~

2.
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ heap->bh_space *= 2;
+ heap->bh_nodes = repalloc(heap->bh_nodes,
+   sizeof(bh_node_type) * heap->bh_space);
+}

Strictly speaking, this function doesn't really "Make sure" of
anything because the caller does the check whether we need more space.
All that happens here is allocating more space. Maybe this function
comment should say something like "Double the space allocated for
nodes."

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2024-03-04 Thread Nathan Bossart

On Mon, Oct 23, 2023 at 03:25:42PM -0500, Nathan Bossart wrote:
> rebased

I saw that this thread was referenced elsewhere [0], so I figured I'd take
a fresh look.  From a quick glance, I'd say 0001 and 0002 are pretty
reasonable and could probably be committed for v17.  0003 probably requires
some more attention.  If there is indeed interest in these changes, I'll
try to spend some more time on it.

[0] https://postgr.es/m/E0D2F0CE-D27C-49B1-902B-AD8D2427F07E%40yandex-team.ru

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Switching XLog source from archive to streaming when primary available

2024-03-04 Thread Nathan Bossart

cfbot claims that this one needs another rebase.

I've spent some time thinking about this one.  I'll admit I'm a bit worried
about adding more complexity to this state machine, but I also haven't
thought of any other viable approaches, and this still seems like a useful
feature.  So, for now, I think we should continue with the current
approach.

+fails to switch to stream mode, it falls back to archive mode. If this
+parameter value is specified without units, it is taken as
+milliseconds. Default is 5min. With a lower value

Does this really need to be milliseconds?  I would think that any
reasonable setting would at least on the order of seconds.

+attempts. To avoid this, it is recommended to set a reasonable value.

I think we might want to suggest what a "reasonable value" is.

+   static bool canSwitchSource = false;
+   boolswitchSource = false;

IIUC "canSwitchSource" indicates that we are trying to force a switch to
streaming, but we are currently exhausting anything that's present in the
pg_wal directory, while "switchSource" indicates that we should force a
switch to streaming right now.  Furthermore, "canSwitchSource" is static
while "switchSource" is not.  Is there any way to simplify this?  For
example, would it be possible to make an enum that tracks the
streaming_replication_retry_interval state?

/*
 * Don't allow any retry loops to occur during 
nonblocking
-* readahead.  Let the caller process everything that 
has been
-* decoded already first.
+* readahead if we failed to read from the current 
source. Let the
+* caller process everything that has been decoded 
already first.
 */
-   if (nonblocking)
+   if (nonblocking && lastSourceFailed)
return XLREAD_WOULDBLOCK;

Why do we skip this when "switchSource" is set?

+   /* Reset the WAL source switch state */
+   if (switchSource)
+   {
+   Assert(canSwitchSource);
+   Assert(currentSource == XLOG_FROM_STREAM);
+   Assert(oldSource == XLOG_FROM_ARCHIVE);
+   switchSource = false;
+   canSwitchSource = false;
+   }

How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE?  Is
there no way it could be XLOG_FROM_PG_WAL?

+#streaming_replication_retry_interval = 5min   # time after which standby
+   # attempts to switch WAL source from 
archive to
+   # streaming replication
+   # in milliseconds; 0 disables

I think we might want to turn this feature off by default, at least for the
first release.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)

Dear Alvaro,

Thanks for giving comments!

> > I agreed it sounds good, but I don't think it can be implemented by
> > current interface. An interface for dynamically allocating memory is
> > GetNamedDSMSegment(), and it returns the same shared memory region if
> > input names are the same.  Therefore, there is no way to re-alloc the
> > shared memory.
> 
> Yeah, I was imagining something like this: the workitem-array becomes a
> struct, which has a name and a "next" pointer and a variable number of
> workitem slots; the AutoVacuumShmem struct has a pointer to the first
> workitem-struct and the last one; when a workitem is requested by
> brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a
> workitem-struct with a smallish number of elements; if we request
> another workitem and the array is full, we allocate another array via
> GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0
> (so that the list can be followed by an autovacuum worker that's
> processing the database), and it's also set as the tail of the list in
> AutoVacuumShmem (so that we know where to store further work items).
> When all items in a workitem-struct are processed, we can free it
> (I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems
> point to the next one in the list.
> 
> This way, the "array" can grow arbitrarily.
>

Basically sounds good. My concerns are:

* GetNamedDSMSegment() does not returns a raw pointer to dsm_segment. This means
  that it may be difficult to do dsm_unpin_segment on the caller side.
* dynamic shared memory is recorded in dhash (dsm_registry_table) and the entry
  won't be deleted. The reference for the chunk might be remained.

Overall, it may be needed that dsm_registry may be also extended. I do not start
working yet, so will share results after trying them.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-04 Thread Masahiko Sawada

On Mon, Mar 4, 2024 at 8:48 PM John Naylor  wrote:
>
> On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  wrote:
> >
> > On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:
>
> > > Right, I should have said "reset". Resetting a context will delete
> > > it's children as well, and seems like it should work to reset the tree
> > > context, and we don't have to know whether that context actually
> > > contains leaves at all. That should allow copying "tree context" to
> > > "leaf context" in the case where we have no special context for
> > > leaves.
> >
> > Resetting the tree->context seems to work. But I think we should note
> > for callers that the dsa_area passed to RT_CREATE should be created in
> > a different context than the context passed to RT_CREATE because
> > otherwise RT_FREE() will also free the dsa_area. For example, the
> > following code in test_radixtree.c will no longer work:
> >
> > dsa = dsa_create(tranche_id);
> > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> > :
> > rt_free(radixtree);
> > dsa_detach(dsa); // dsa is already freed.
> >
> > So I think that a practical usage of the radix tree will be that the
> > caller  creates a memory context for a radix tree and passes it to
> > RT_CREATE().
>
> That sounds workable to me.
>
> > I've attached an update patch set:
> >
> > - 0008 updates RT_FREE_RECURSE().
>
> Thanks!
>
> > - 0009 patch is an updated version of cleanup radix tree memory handling.
>
> Looks pretty good, as does the rest. I'm going through again,
> squashing and making tiny adjustments to the template. The only thing
> not done is changing the test with many values to resemble the perf
> test more.
>
> I wrote:
> > > Secondly, I thought about my recent work to skip checking if we first
> > > need to create a root node, and that has a harmless (for vacuum at
> > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > the key is bigger than 255, new nodes will go on top of the root node.
> > > These have chunk '0'. If all subsequent keys are big enough, the
> > > orginal root node will stay empty. If all keys are deleted, there will
> > > be a chain of empty nodes remaining. Again, I believe this is
> > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > on this, but likely not today.
> >
> > This turns out to be a lot trickier than it looked, so it seems best
> > to allow a trivial amount of waste, as long as it's documented
> > somewhere. It also wouldn't be terrible to re-add those branches,
> > since they're highly predictable.
>
> I put a little more work into this, and got it working, just needs a
> small amount of finicky coding. I'll share tomorrow.
>
> I have a question about RT_FREE_RECURSE:
>
> + check_stack_depth();
> + CHECK_FOR_INTERRUPTS();
>
> I'm not sure why these are here: The first seems overly paranoid,
> although harmless, but the second is probably a bad idea. Why should
> the user be able to to interrupt the freeing of memory?

Good catch. We should not check the interruption there.

> Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> tree, leading to coding like "iter->tree->ctl->root". I *think* it
> would be easier to read if the tree was a parameter to these iteration
> functions. That would require an API change, so the tests/tidstore
> would have some churn. I can do that, but before trying I wanted to
> see what you think -- is there some reason to keep the current way?

I considered both usages, there are two reasons for the current style.
I'm concerned that if we pass both the tree and RT_ITER to iteration
functions, the caller could mistakenly pass a different tree than the
one that was specified to create the RT_ITER. And the second reason is
just to make it consistent with other data structures such as
dynahash.c and dshash.c, but I now realized that in simplehash.h we
pass both the hash table and the iterator.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Preserve subscription OIDs during pg_upgrade

2024-03-04 Thread Bruce Momjian

On Mon, Feb 26, 2024 at 09:51:40AM +0530, Robert Haas wrote:
> > I am not sure that it is a good idea to relax that for PG17 at this
> > stage of the development cycle, though, as we have already done a lot
> > in this area for pg_upgrade and it may require more tweaks during the
> > beta period depending on the feedback received, so I would suggest to
> > do more improvements for the 18 cycle instead once we have a cleaner
> > picture of the whole.
> 
> That's fair.
> 
> I want to say that, unlike Tom, I'm basically in favor of preserving
> OIDs in more places across updates. It seems to have little downside
> and improve the understandability of the outcome. But that's separate
> from whether it is a good idea to build on that infrastructure in any
> particular way in the time we have left for this release.

Yes, the _minimal_ approach has changed in the past few years to make
pg_upgrade debugging easier.  The original design was ultra-conservative
where it could be, considering how radical the core functionality was.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-03-04 Thread Michael Paquier

On Mon, Feb 26, 2024 at 02:01:45PM +, Bertrand Drouvot wrote:
> Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch 
> now
> (see the attached).

I have looked at what you have here.

First, in a build where 818fefd8fd is included, this makes the test
script a lot slower.  Most of the logic is quick, but we're spending
10s or so checking that catalog_xmin has advanced.  Could it be
possible to make that faster?

A second issue is the failure mode when 818fefd8fd is reverted.  The
test is getting stuck when we are waiting on the standby to catch up,
until a timeout decides to kick in to fail the test, and all the
previous tests pass.  Could it be possible to make that more
responsive?  I assume that in the failure mode we would get an
incorrect conflict_reason for injection_inactiveslot, succeeding in
checking the failure.

+my $terminated = 0;
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+if ($node_standby->log_contains(
+'terminating process .* to release replication slot 
\"injection_activeslot\"', $logstart))
+{
+$terminated = 1;
+last;
+}
+usleep(100_000);
+}
+ok($terminated, 'terminating process holding the active slot is logged 
with injection point');

The LOG exists when we are sure that the startup process is waiting
in the injection point, so this loop could be replaced with something
like:
+   $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot');
+   ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';)

Nit: the name of the injection point should be
terminate-process-holding-slot rather than
TerminateProcessHoldingSlot, to be consistent with the other ones. 
--
Michael

signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Peter Smith

Here are some review comments for v105-0001

==
doc/src/sgml/config.sgml

1.
+   
+The standbys corresponding to the physical replication slots in
+standby_slot_names must configure
+sync_replication_slots = true so they can receive
+logical failover slots changes from the primary.
+   

/slots changes/slot changes/

==
doc/src/sgml/func.sgml

2.
+The function may be waiting if the specified slot is a logical failover
+slot and standby_slot_names
+is configured.

I know this has been through multiple versions already, but this
latest wording "may be waiting..." doesn't seem very good to me.

How about one of these?

* The function may not be able to return immediately if the specified
slot is a logical failover slot and standby_slot_names is configured.

* The function return might be blocked if the specified slot is a
logical failover slot and standby_slot_names is configured.

* If the specified slot is a logical failover slot then the function
will block until all physical slots specified in standby_slot_names
have confirmed WAL receipt.

* If the specified slot is a logical failover slot then the function
will not return until all physical slots specified in
standby_slot_names have confirmed WAL receipt.

~~~

3.
+slot may return to an earlier position. The function may be waiting if
+the specified slot is a logical failover slot and
+standby_slot_names


Same as previous review comment #2

==
src/backend/replication/slot.c

4. WaitForStandbyConfirmation

+ * Used by logical decoding SQL functions that acquired logical failover slot.

IIUC it doesn't work like that. pg_logical_slot_get_changes_guts()
calls here unconditionally (i.e. the SQL functions don't even check if
they are failover slots before calling this) so the comment seems
misleading/redundant.

==
src/backend/replication/walsender.c

5. NeedToWaitForWal

+ /*
+ * Check if the standby slots have caught up to the flushed position. It
+ * is good to wait up to the flushed position and then let the WalSender
+ * send the changes to logical subscribers one by one which are already
+ * covered by the flushed position without needing to wait on every change
+ * for standby confirmation.
+ */
+ if (NeedToWaitForStandbys(flushed_lsn, wait_event))
+ return true;
+
+ *wait_event = 0;
+ return false;
+}
+

5a.
The comment (or part of it?) seems misplaced because it is talking
WalSender sending changes, but that is not happening in this function.

Also, isn't what this is saying already described by the other comment
in the caller? e.g.:

+ /*
+ * Within the loop, we wait for the necessary WALs to be flushed to
+ * disk first, followed by waiting for standbys to catch up if there
+ * are enough WALs or upon receiving the shutdown signal. To avoid the
+ * scenario where standbys need to catch up to a newer WAL location in
+ * each iteration, we update our idea of the currently flushed
+ * position only if we are not waiting for standbys to catch up.
+ */

~

5b.
Most of the code is unnecessary. AFAICT all this is exactly same as just 1 line:

return NeedToWaitForStandbys(flushed_lsn, wait_event);

~~~

6. WalSndWaitForWal

+ /*
+ * Within the loop, we wait for the necessary WALs to be flushed to
+ * disk first, followed by waiting for standbys to catch up if there
+ * are enough WALs or upon receiving the shutdown signal. To avoid the
+ * scenario where standbys need to catch up to a newer WAL location in
+ * each iteration, we update our idea of the currently flushed
+ * position only if we are not waiting for standbys to catch up.
+ */

Regarding that 1st sentence: maybe this logic used to be done
explicitly "within the loop" but IIUC this logic is now hidden inside
NeedToWaitForWal() so the comment should mention that.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-04 Thread Jeff Davis

On Tue, 2024-02-27 at 09:48 -0500, Stephen Frost wrote:
> Attached is an updated patch which drops the 'such as' and adds a
> sentence mentioning that BRIN is the only in-core summarizing index.

The original patch reads more clearly to me. In v4, summarizing (the
exception) feels like it's dominating the description.

Also, is it standard practice to backport this kind of doc update? I
ordinarily wouldn't be inclined to do so, but v4 seems targeted at 16
as well.

Attached my own suggested wording that hopefully addresses Stephen and
Alvaro's concerns. I agree that it's tricky to write so I took a more
minimalist approach:

 * I got rid of the "In summary" sentence because (a) it's confusing
now that we're talking about summarizing indexes; and (b) it's not
summarizing anything, it's just redundant.

 * I removed the mention partial or expression indexes. It's a bit
redundant and doesn't seem especially helpful in this context.

If this is agreeable I can commit it.

Regards,
Jeff Davis

From d17ecaf1af52ba5b055c19b465d77cc53f825747 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 26 Feb 2024 17:17:54 -0500
Subject: [PATCH v5] docs: Update HOT update docs for 19d8e2308b

Commit 19d8e2308b changed when the HOT update optimization is possible
but neglected to update the Heap-Only Tuples (HOT) documentation.  This
patch updates that documentation accordingly.

Author: Elizabeth Christensen
Reviewed-By: Stephen Frost, Alvaro Herrera
Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com
---
 doc/src/sgml/storage.sgml | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 652946db7d..2e1914b95b 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -1097,8 +1097,10 @@ data. Empty in ordinary tables.
   

 
- The update does not modify any columns referenced by the table's
- indexes, including expression and partial indexes.
+ The update does not modify any columns referenced by the table's indexes,
+ not including summarizing indexes.  The only summarizing index method in
+ the core PostgreSQL distribution is BRIN.
  


@@ -1114,7 +1116,8 @@ data. Empty in ordinary tables.
   

 
- New index entries are not needed to represent updated rows.
+ New index entries are not needed to represent updated rows, however,
+ summary indexes may still need to be updated.
 


@@ -1130,14 +1133,12 @@ data. Empty in ordinary tables.
  
 
  
-  In summary, heap-only tuple updates can only be created
-  if columns used by indexes are not updated.  You can
-  increase the likelihood of sufficient page space for
+  You can increase the likelihood of sufficient page space for
   HOT updates by decreasing a table's fillfactor.
-  If you don't, HOT updates will still happen because
-  new rows will naturally migrate to new pages and existing pages with
-  sufficient free space for new row versions.  The system view fillfactor.  If you
+  don't, HOT updates will still happen because new rows
+  will naturally migrate to new pages and existing pages with sufficient free
+  space for new row versions.  The system view pg_stat_all_tables
   allows monitoring of the occurrence of HOT and non-HOT updates.
  
-- 
2.34.1

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote:
>> I can't see any user validation at the function pg_newlocale_from_collation.

> Matthias is right, look closer.  I can see more than one check,
> especially note the one related to the collation version mismatch that
> should not be silently ignored.

The fast path through that code doesn't include any checks, true,
but the point is that finding the entry proves that we made those
checks previously.  I can't agree with making those semantics
squishy in order to save a few cycles in the exact-equality case.

regards, tom lane

Re: Add new error_action COPY ON_ERROR "log"

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 05:00:00AM +0530, Bharath Rupireddy wrote:
> How about an extra option to error_action ignore-with-verbose which is
> similar to ignore but when specified emits one NOTICE per malformed
> row? With this, one can say COPY x FROM stdin (ON_ERROR
> ignore-with-verbose);.
> 
> Alternatively, we can think of adding a new option verbose altogether
> which can be used for not only this but for reporting some other COPY
> related info/errors etc. With this, one can say COPY x FROM stdin
> (VERBOSE on, ON_ERROR ignore);.

I would suggest a completely separate option, because that offers more
flexibility as each option has a separate meaning.  My main concern in
using one option to control them all is that one may want in the 
future to be able to specify more combinations of actions at query
level, especially if more modes are added to the ON_ERROR mode.  One
option would prevent that.

Perhaps ERROR_VERBOSE or ERROR_VERBOSITY would be better names, but
I'm never wedded to my naming suggestions.  Bad history with the
matter.

> There's also another way of having a separate GUC, but -100 from me
> for it. Because, it not only increases the total number of GUCs by 1,
> but also might set a wrong precedent to have a new GUC for controlling
> command level outputs.

What does this have to do with GUCs?  The ON_ERROR option does not
have one.
--
Michael

signature.asc
Description: PGP signature

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread David Rowley

On Tue, 5 Mar 2024 at 00:37, Andy Fan  wrote:
>
> David Rowley  writes:
> > I don't think it would be right to fudge the costs in any way, but I
> > think the risk factor for IndexPaths could take into account the
> > number of unmatched index clauses and increment the risk factor, or
> > "certainty_factor" as it is currently in my brain-based design. That
> > way add_path() would be more likely to prefer the index that matches
> > the most conditions.
>
> This is somehow similar with my proposal at [1]?  What do you think
> about the treat 'col op const' as 'col op $1' for the marked column?
> This could just resolve a subset of questions in your mind, but the
> method looks have a solid reason.

Do you mean this?

> + /*
> + * To make the planner more robust to handle some inaccurate statistics
> + * issue, we will add a extra cost to qpquals so that the less qpquals
> + * the lower cost it has.
> + */
> + cpu_run_cost += 0.01 * list_length(qpquals);

I don't think it's a good idea to add cost penalties like you proposed
there. This is what I meant by "I don't think it would be right to
fudge the costs in any way".

If you modify the costs to add some small penalty so that the planner
is more likely to favour some other plan, what happens if we then
decide the other plan has some issue and we want to penalise that for
some other reason? Adding the 2nd penalty might result in the original
plan choice again. Which one should be penalised more? I think the
uncertainty needs to be tracked separately.

Fudging the costs like this is also unlikely to play nicely with
add_path's use of STD_FUZZ_FACTOR.  There'd be an incentive to do
things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large
enough penalty.

David

> [1] 
> https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 03:08:03PM -0300, Ranier Vilela wrote:
> I can't see any user validation at the function pg_newlocale_from_collation.

Matthias is right, look closer.  I can see more than one check,
especially note the one related to the collation version mismatch that
should not be silently ignored.

> meson test pass all checks.

Collations are harder to test because they depend on the environment
where the test happens, especially with ICU.
--
Michael

signature.asc
Description: PGP signature

Re: Adding deprecation notices to pgcrypto documentation

2024-03-04 Thread Nathan Bossart

On Mon, Mar 04, 2024 at 10:03:13PM +0100, Daniel Gustafsson wrote:
> Cleaning out my inbox I came across this which I still think is worth doing,
> any objections to going ahead with it?

I think the general idea is reasonable, but I have two small comments:

* Should this be a "Warning" and/or moved to the top of the page?  This
  seems like a relatively important notice that folks should see when
  beginning to use pgcrypto.

* Should we actually document the exact list of algorithms along with
  detailed reasons?  This list seems prone to becoming outdated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-04 Thread Tom Lane

I wrote:
> It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
> the numeric value of zero, so I guess the majority of our BF
> animals are understanding this as "address != NULL".  But that
> doesn't look like a useful test to be making.

In hopes of noticing whether there are other similar thinkos,
I permuted the order of the SlruPageStatus enum values, and
now I get the expected warnings from gcc:

In file included from ../../../../src/include/postgres.h:45,
 from slru.c:59:
slru.c: In function ‘SimpleLruWaitIO’:
slru.c:436:38: warning: comparison between pointer and integer
  Assert(>page_status[slotno] != SLRU_PAGE_EMPTY);
  ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^
slru.c: In function ‘SimpleLruWritePage’:
slru.c:717:43: warning: comparison between pointer and integer
  Assert(>shared->page_status[slotno] != SLRU_PAGE_EMPTY);
   ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^

So it looks like it's just these two places.

regards, tom lane

Re: Injection points: some tools to wait and wake

2024-03-04 Thread Michael Paquier

On Mon, Mar 04, 2024 at 10:51:41AM +0100, Jelte Fennema-Nio wrote:
> Yeah, it makes sense that you'd want to backport fixes/changes to
> this. As long as you put a disclaimer in the docs that you can do that
> for this module, I think it would be fine. Our tests fairly regularly
> break anyway when changing minor versions of postgres in our CI, e.g.
> due to improvements in the output of isolationtester. So if changes to
> this module require some changes that's fine by me. Seems much nicer
> than having to copy-paste the code.

In my experience, anybody who does serious testing with their product
integrated with Postgres have two or three types of builds with their
own scripts: one with assertions, -DG and other developer-oriented
options enabled, and one for production deployments with more
optimized options like -O2.  Once there are custom scripts to build
and package Postgres, do we really need to move that to contrib/ at
all?  make install would work for a test module as long as the command
is run locally in its directory.
--
Michael


signature.asc
Description: PGP signature

Re: Popcount optimization using AVX512

2024-03-04 Thread Nathan Bossart

(Please don't top-post on the Postgres lists.)

On Mon, Mar 04, 2024 at 09:39:36PM +, Amonson, Paul D wrote:
> First, apologies on the patch. Find re-attached updated version.

Thanks for the new version of the patch.

>> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
>> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31)
>> +<< 31))
>>
>> IME this means that the autoconf you are using has been patched.  A
>> quick search on the mailing lists seems to indicate that it might be
>> specific to Debian [1].
>  
> I am not sure what the ask is here?  I made changes to the configure.ac
> and ran autoconf2.69 to get builds to succeed. Do you have a separate
> feedback here?

These LARGE_OFF_T changes are unrelated to the patch at hand and should be
removed.  This likely means that you are using a patched autoconf that is
making these extra changes.

> As for the refactoring, this was done to satisfy previous review feedback
> about applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly
> to avoid segfault due to the AVX512 flags. If its ok, I would prefer to
> make a single commit as the change is pretty small and straight forward.

Okay.  The only reason I suggest this is to ease review.  For example, if
there is some required refactoring that doesn't involve any functionality
changes, it can be advantageous to get that part reviewed and committed
first so that reviewers can better focus on the code for the new feature.
But, of course, that isn't necessary and/or isn't possible in all cases.

> I am not sure I understand the comment about the SIZE_VOID_P checks.
> Aren't they necessary to choose which functions to call based on 32 or 64
> bit architectures?

Yes.  My comment was that the patch appeared to make unnecessary changes to
this code.  Perhaps I am misunderstanding something here.

> Would this change qualify for Workflow A as described in [0] and can be
> picked up by a committer, given it has been reviewed by multiple
> committers so far? The scope of the change is pretty contained as well.

I think so.  I would still encourage you to create an entry for this so
that it is automatically tested via cfbot [0].

[0] http://commitfest.cputube.org/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-04 Thread Tom Lane

Alvaro Herrera  writes:
> Pushed that way, but we can discuss further wording improvements/changes
> if someone wants to propose any.

I just noticed that drongo is complaining about two lines added
by commit 53c2a97a9:

 drongo| 2024-03-04 14:34:52 | 
../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=': 
'SlruPageStatus *' differs in levels of indirection from 'int'
 drongo| 2024-03-04 14:34:52 | 
../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=': 
'SlruPageStatus *' differs in levels of indirection from 'int'

These lines are

Assert(>page_status[slotno] != SLRU_PAGE_EMPTY);

Assert(>shared->page_status[slotno] != SLRU_PAGE_EMPTY);

These are comparing the address of something with an enum value,
which surely cannot be sane.  Is the "&" operator incorrect?

It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
the numeric value of zero, so I guess the majority of our BF
animals are understanding this as "address != NULL".  But that
doesn't look like a useful test to be making.

regards, tom lane

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Tom Lane

Nathan Bossart  writes:
> On Mon, Mar 04, 2024 at 09:27:23PM +0100, Daniel Gustafsson wrote:
>> Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera
>> Alvaro's name seems wrong.

> Hm.  It looks alright to me.  I copied the name from his e-mail signature,
> which has an accent over the first 'A'.  I assume that's why it's not
> showing up correctly in some places.

I think that git has an expectation of commit log entries being in
UTF8.  The committed message looks okay from my end, but maybe some
encoding mangling happened to the version Daniel was looking at?

regards, tom lane

Re: [PATCH] Exponential backoff for auth_delay

2024-03-04 Thread Michael Banck

Hi,

On Mon, Mar 04, 2024 at 03:50:07PM -0500, Robert Haas wrote:
> I agree that two GUCs here seems to be one more than necessary, but I
> wonder whether we couldn't just say 0 means no exponential backoff and
> any other value is the maximum time. 

Alright, I have changed it so that auth_delay.milliseconds and
auth_delay.max_milliseconds are the only GUCs, their default being 0. If
the latter is 0, the former's value is always taken. If the latter is
non-zero and larger than the former, exponential backoff is applied with
the latter's value as maximum delay.

If the latter is smaller than the former then auth_delay just sets the
delay to the latter, I don't think this is problem or confusing, or
should this be considered a misconfiguration?

> The idea that 0 means unlimited doesn't seem useful in practice. 

Yeah, that was more how it was coded than a real policy decision, so
let's do away with it.

V5 attached.


Michael
>From 3563d77061480b7e022255b968a39086b0cc8814 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Wed, 27 Dec 2023 15:55:39 +0100
Subject: [PATCH v5] Add optional exponential backoff to auth_delay contrib
 module.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds the new auth_delay.max_milliseconds GUC. If set (its default is 0),
auth_delay adds exponential backoff with this GUC's value as maximum delay.

The exponential backoff is tracked per remote host and doubled for every failed
login attempt (i.e., wrong password, not just missing pg_hba line or database)
and reset to auth_delay.milliseconds after a successful authentication or when
no authentication attempts have been made for 5*max_milliseconds from that
host.

Authors: Michael Banck, based on an earlier patch by 成之焕
Reviewed-by: Abhijit Menon-Sen, Tomas Vondra
Discussion: https://postgr.es/m/ahwaxacqiwivoehs5yejpqog.1.1668569845751.hmail.zhch...@ceresdata.com
---
 contrib/auth_delay/auth_delay.c  | 216 ++-
 doc/src/sgml/auth-delay.sgml |  31 -
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 244 insertions(+), 4 deletions(-)

diff --git a/contrib/auth_delay/auth_delay.c b/contrib/auth_delay/auth_delay.c
index ff0e1fd461..5fb123d133 100644
--- a/contrib/auth_delay/auth_delay.c
+++ b/contrib/auth_delay/auth_delay.c
@@ -14,24 +14,50 @@
 #include 
 
 #include "libpq/auth.h"
+#include "miscadmin.h"
 #include "port.h"
+#include "storage/dsm_registry.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
+#define MAX_CONN_RECORDS 100
+
 /* GUC Variables */
 static int	auth_delay_milliseconds = 0;
+static int	auth_delay_max_milliseconds = 0;
 
 /* Original Hook */
 static ClientAuthentication_hook_type original_client_auth_hook = NULL;
 
+typedef struct AuthConnRecord
+{
+	char		remote_host[NI_MAXHOST];
+	double		sleep_time;		/* in milliseconds */
+	TimestampTz last_failed_auth;
+} AuthConnRecord;
+
+static shmem_startup_hook_type shmem_startup_next = NULL;
+static AuthConnRecord *acr_array = NULL;
+
+static AuthConnRecord *auth_delay_find_acr_for_host(char *remote_host);
+static AuthConnRecord *auth_delay_find_free_acr(void);
+static double auth_delay_increase_delay_after_failed_conn_auth(Port *port);
+static void auth_delay_cleanup_conn_record(Port *port);
+static void auth_delay_expire_conn_records(Port *port);
+
 /*
  * Check authentication
  */
 static void
 auth_delay_checks(Port *port, int status)
 {
+	double		delay = auth_delay_milliseconds;
+
 	/*
 	 * Any other plugins which use ClientAuthentication_hook.
 	 */
@@ -39,20 +65,190 @@ auth_delay_checks(Port *port, int status)
 		original_client_auth_hook(port, status);
 
 	/*
-	 * Inject a short delay if authentication failed.
+	 * We handle both STATUS_ERROR and STATUS_OK - the third option
+	 * (STATUS_EOF) is disregarded.
+	 *
+	 * In case of STATUS_ERROR we inject a short delay, optionally with
+	 * exponential backoff.
+	 */
+	if (status == STATUS_ERROR)
+	{
+		if (auth_delay_max_milliseconds > 0)
+		{
+			/*
+			 * Delay by 2^n seconds after each authentication failure from a
+			 * particular host, where n is the number of consecutive
+			 * authentication failures.
+			 */
+			delay = auth_delay_increase_delay_after_failed_conn_auth(port);
+
+			/*
+			 * Clamp delay to a maximum of auth_delay_max_milliseconds.
+			 */
+			delay = Min(delay, auth_delay_max_milliseconds);
+		}
+
+		if (delay > 0)
+		{
+			elog(DEBUG1, "Authentication delayed for %g seconds due to auth_delay", delay / 1000.0);
+			pg_usleep(1000L * (long) delay);
+		}
+
+		/*
+		 * Expire delays from other hosts after auth_delay_max_milliseconds *
+		 * 5.
+		 */
+		auth_delay_expire_conn_records(port);
+	}
+
+	/*
+	 * Remove host-specific delay if authentication succeeded.
+	 */
+	if (status == STATUS_OK)
+		auth_delay_cleanup_conn_record(port);
+}
+
+static double

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Nathan Bossart

On Mon, Mar 04, 2024 at 09:27:23PM +0100, Daniel Gustafsson wrote:
> Looks good from a read-through, I like it.  A few comments on the commit
> message only:
> 
> actionable details about the source of the miconfiguration.  This
> s/miconfiguration/misconfiguration/

I reworded the commit message a bit to avoid the word "misconfiguration,"
as it felt a bit misleading to me.  In any case, this was fixed, albeit
indirectly.

> Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera
> Alvaro's name seems wrong.

Hm.  It looks alright to me.  I copied the name from his e-mail signature,
which has an accent over the first 'A'.  I assume that's why it's not
showing up correctly in some places.

Anyway, I've committed this now.  Thanks for taking a look!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-04 Thread Amonson, Paul D

Hi,

First, apologies on the patch. Find re-attached updated version.

Now I have some questions
#1

> -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
> +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31)
> +<< 31))
>
> IME this means that the autoconf you are using has been patched.  A quick 
> search on the mailing lists seems to indicate that it might be specific to 
> Debian [1].

I am not sure what the ask is here?  I made changes to the configure.ac and ran 
autoconf2.69 to get builds to succeed. Do you have a separate feedback here? 

#2 
As for the refactoring, this was done to satisfy previous review feedback about 
applying the AVX512 CFLAGS to the entire pg_bitutils.c file. Mainly to avoid 
segfault due to the AVX512 flags. If its ok, I would prefer to make a single 
commit as the change is pretty small and straight forward.

#3
I am not sure I understand the comment about the SIZE_VOID_P checks. Aren't 
they necessary to choose which functions to call based on 32 or 64 bit 
architectures?

#4
Would this change qualify for Workflow A as described in [0] and can be picked 
up by a committer, given it has been reviewed by multiple committers so far? 
The scope of the change is pretty contained as well. 

[0] https://wiki.postgresql.org/wiki/Submitting_a_Patch

Thanks,
Paul

-Original Message-
From: Nathan Bossart  
Sent: Friday, March 1, 2024 1:45 PM
To: Amonson, Paul D 
Cc: Andres Freund ; Alvaro Herrera 
; Shankaran, Akash ; Noah 
Misch ; Tom Lane ; Matthias van de Meent 
; pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

Thanks for the new version of the patch.  I didn't see a commitfest entry for 
this one, and unfortunately I think it's too late to add it for the March 
commitfest.  I would encourage you to add it to July's commitfest [0] so that 
we can get some routine cfbot coverage.

On Tue, Feb 27, 2024 at 08:46:06PM +, Amonson, Paul D wrote:
> After consulting some Intel internal experts on MSVC the linking issue 
> as it stood was not resolved. Instead, I created a MSVC ONLY work-around.
> This adds one extra functional call on the Windows builds (The linker 
> resolves a real function just fine but not a function pointer of the 
> same name). This extra latency does not exist on any of the other 
> platforms. I also believe I addressed all issues raised in the 
> previous reviews. The new pg_popcnt_x86_64_accel.c file is now the 
> ONLY file compiled with the
> AVX512 compiler flags. I added support for the MSVC compiler flag as 
> well. Both meson and autoconf are updated with the new refactor.
> 
> I am attaching the new patch.

I think this patch might be missing the new files.

-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) 
+<< 31))

IME this means that the autoconf you are using has been patched.  A quick 
search on the mailing lists seems to indicate that it might be specific to 
Debian [1].

-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+intpg_popcount32_slow(uint32 word);
+intpg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);

This patch appears to do a lot of refactoring.  Would it be possible to break 
out the refactoring parts into a prerequisite patch that could be reviewed and 
committed independently from the AVX512 stuff?

-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
-   if (buf == (const char *) TYPEALIGN(8, buf))
+   if (buf == (const char *)TYPEALIGN(8, buf))
{
-   const uint64 *words = (const uint64 *) buf;
+   const uint64 *words = (const uint64 *)buf;

while (bytes >= 8)
{
@@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}

-   buf = (const char *) words;
+   buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{

Most, if not all, of these changes seem extraneous.  Do we actually need to 
more strictly check SIZEOF_VOID_P?

[0] https://commitfest.postgresql.org/48/
[1] https://postgr.es/m/20230211020042.uthdgj72kp3xlqam%40awork3.anarazel.de

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

v5-0001-Add-support-for-AVX512-implemented-POPCNT.patch
Description: v5-0001-Add-support-for-AVX512-implemented-POPCNT.patch

Re: Adding deprecation notices to pgcrypto documentation

2024-03-04 Thread Daniel Gustafsson

> On 16 Nov 2023, at 21:49, Daniel Gustafsson  wrote:
> 
> In the "Allow tests to pass in OpenSSL FIPS mode" thread [0] it was discovered
> that 3DES is joining the ranks of NIST disallowed algorithms.  The attached
> patch adds a small note to the pgcrypto documentation about deprecated uses of
> algorithms.  I've kept it to "official" notices such as RFC's and NIST SP's.
> There might be more that deserve a notice, but this seemed like a good start.
> 
> Any thoughts on whether this would be helpful?

Cleaning out my inbox I came across this which I still think is worth doing,
any objections to going ahead with it?

--
Daniel Gustafsson

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-04 Thread Matthias van de Meent

On Sat, 2 Mar 2024 at 02:30, Peter Geoghegan  wrote:
>
> On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan  wrote:
> > Attached is v11, which now says something like that in the commit
> > message.
>
> Attached is v12.

Some initial comments on the documentation:

> +that searches for multiple values together.  Queries that use certain
> +SQL constructs to search for rows matching any value
> +out of a list (or an array) of multiple scalar values might perform
> +multiple primitive index scans (up to one primitive scan
> +per scalar value) at runtime.  See  linkend="functions-comparisons"/>
> +for details.

I don't think the "see  for details" is
correctly worded: The surrounding text implies that it would contain
details about in which precise situations multiple primitive index
scans would be consumed, while it only contains documentation about
IN/NOT IN/ANY/ALL/SOME.

Something like the following would fit better IMO:

+that searches for multiple values together.  Queries that use certain
+SQL constructs to search for rows matching any value
+out of a list or array of multiple scalar values (such as those
described in
+ might perform multiple primitive
+index scans (up to one primitive scan per scalar value) at runtime.

Then there is a second issue in the paragraph: Inverted indexes such
as GIN might well decide to start counting more than one "primitive
scan" per scalar value, because they may need to go through their
internal structure more than once to produce results for a single
scalar value; e.g. with queries WHERE textfield LIKE '%some%word%', a
trigram index would likely use at least 4 descents here: one for each
of "som", "ome", "wor", "ord".

> > All that really remains now is to research how we might integrate this
> > work with the recently added continuescanPrechecked/haveFirstMatch
> > stuff from Alexander Korotkov, if at all.
>
> The main change in v12 is that I've integrated both the
> continuescanPrechecked and the haveFirstMatch optimizations. Both of
> these fields are now page-level state, shared between the _bt_readpage
> caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> appear next to the new home for _bt_checkkeys' continuescan field, in
> the new page state struct).

Cool. I'm planning to review the rest of this patch this
week/tomorrow, could you take some time to review some of my btree
patches, too?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: [PATCH] Exponential backoff for auth_delay

2024-03-04 Thread Robert Haas

On Mon, Mar 4, 2024 at 2:43 PM Michael Banck  wrote:
> > 3) Do we actually need the exponential_backoff GUC? Wouldn't it be
> > sufficient to have the maximum value, and if it's -1 treat that as no
> > backoff?
>
> That is a good question, I guess that makes sense.
>
> The next question then is: what should the default for (now)
> auth_delay.max_milliseconds be in this case, -1? Or do we say that as
> the default for auth_delay.milliseconds is 0 anyway, why would somebody
> not want exponential backoff when they switch it on and keep it at the
> current 10s/1ms)?
>
> I have not changed that for now, pending further input.

I agree that two GUCs here seems to be one more than necessary, but I
wonder whether we couldn't just say 0 means no exponential backoff and
any other value is the maximum time. The idea that 0 means unlimited
doesn't seem useful in practice. There's always going to be some
limit, at least by the number of bits we have in the data type that
we're using to do the calculation. But that limit is basically never
the right answer. I mean, I think 2^31 milliseconds is roughly 25
days, but it seems unlikely to me that delays measured in days
helpfully more secure than delays measured in minutes, and they don't
seem very convenient for users either, and do you really want a failed
connection to linger for days before failing? That seems like you're
DOSing yourself. If somebody wants to configure a very large value
explicitly, cool, they can do as they like, but we don't need to
complicate the interface to make it easier for them to do so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Daniel Gustafsson

> On 4 Mar 2024, at 18:22, Nathan Bossart  wrote:
> 
> On Mon, Mar 04, 2024 at 03:21:59PM +0100, Daniel Gustafsson wrote:
>> Looking at this again I think this is about ready to go in.  My only comment 
>> is
>> that doc/src/sgml/archive-modules.sgml probably should be updated to refer to
>> setting the errdetail, especially since we document the errormessage there.
> 
> Thanks for reviewing.  How does this look?

Looks good from a read-through, I like it.  A few comments on the commit
message only:

actionable details about the source of the miconfiguration.  This
s/miconfiguration/misconfiguration/

Reviewed-by: Daniel Gustafsson, ¡lvaro Herrera
Alvaro's name seems wrong.

--
Daniel Gustafsson

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra

On 3/4/24 18:08, Andy Fan wrote:
> ...
>>
>>> I assumed that releasing all of the memory at the end of executor once
>>> is not an option since it may consumed too many memory. Then, when and
>>> which entry to release becomes a trouble for me. For example:
>>>
>>>   QUERY PLAN
>>> --
>>>  Nested Loop
>>>Join Filter: (t1.a = t2.a)
>>>->  Seq Scan on t1
>>>->  Seq Scan on t2
>>> (4 rows)
>>>
>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>> in outer relation. Without the help from slot's life-cycle system, I
>>> can't think out a answer for the above question.
>>>
>>
>> This is true, but how likely such plans are? I mean, surely no one would
>> do nested loop with sequential scans on reasonably large tables, so how
>> representative this example is?
> 
> Acutally this is a simplest Join case, we still have same problem like
> Nested Loop + Index Scan which will be pretty common. 
> 

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

>> Also, this leads me to the need of having some sort of memory limit. If
>> we may be keeping entries for extended periods of time, and we don't
>> have any way to limit the amount of memory, that does not seem great.
>>
>> AFAIK if we detoast everything into tts_values[] there's no way to
>> implement and enforce such limit. What happens if there's a row with
>> multiple large-ish TOAST values? What happens if those rows are in
>> different (and distant) parts of the plan?
> 
> I think this can be done by tracking the memory usage on EState level
> or global variable level and disable it if it exceeds the limits and
> resume it when we free the detoast datum when we don't need it. I think
> no other changes need to be done.
> 

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

>> It seems far easier to limit the memory with the toast cache.
> 
> I think the memory limit and entry eviction is the key issue now. IMO,
> there are still some difference when both methods can support memory
> limit. The reason is my patch can grantee the cached memory will be
> reused, so if we set the limit to 10MB, we know all the 10MB is
> useful, but the TOAST cache method, probably can't grantee that, then
> when we want to make it effecitvely, we have to set a higher limit for
> this.
> 

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

> 
>>> Another difference between the 2 methods is my method have many
>>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>>> it can save some run-time effort like hash_search find / enter run-time
>>> in method 2 since I put them directly into tts_values[*].
>>>
>>> I'm not sure the factor 2 makes some real measurable difference in real
>>> case, so my current concern mainly comes from factor 1.
>>> """
>>>
>>
>> This seems a bit dismissive of the (possible) issue.
> 
> Hmm, sorry about this, that is not my intention:(
> 

No need to apologize.

>> It'd be good to do
>> some measurements, especially on simple queries

Re: Fix a typo in pg_rotate_logfile

2024-03-04 Thread Daniel Gustafsson

> On 14 Feb 2024, at 21:48, Daniel Gustafsson  wrote:
>> On 14 Feb 2024, at 19:51, Nathan Bossart  wrote:
>> On Wed, Feb 14, 2024 at 10:04:49AM -0500, Tom Lane wrote:
>>> Daniel Gustafsson  writes:

 Attached is a diff to show what it would look like to remove adminpack 
 (catalog
 version bump omitted on purpose to avoid conflicts until commit).
>>> 
>>> I don't see any references you missed, so +1.
>> 
>> Seems reasonable to me, too.
> 
> Thanks!  I'll put this in the next CF to keep it open for comments a bit
> longer, but will close it early in the CF.

This has now been pushed, adminpack has left the building.

--
Daniel Gustafsson

Re: postgres_fdw fails to see that array type belongs to extension

2024-03-04 Thread Tom Lane

I wrote:
> Now that the multirange issue is fixed (3e8235ba4), here's a
> new version that includes the needed recursion in ALTER EXTENSION.
> I spent some more effort on a proper test case, too.

I looked this over again and pushed it.

regards, tom lane

Re: Add system identifier to backup manifest

2024-03-04 Thread Robert Haas

On Mon, Mar 4, 2024 at 12:35 AM Amul Sul  wrote:
> Yes, you are correct. Both the current caller of get_controlfile() has
> access to the root directory.
>
> I have dropped the 0002 patch -- I don't have a very strong opinion to 
> refactor
> get_controlfile() apart from saying that it might be good to have both 
> versions :) .

I don't have an enormously strong opinion on what the right thing to
do is here either, but I am not convinced that the change proposed by
Michael is an improvement. After all, that leaves us with the
situation where we know the path to the control file in three
different places. First, verify_backup_file() does a strcmp() against
the string "global/pg_control" to decide whether to call
verify_backup_file(). Then, verify_system_identifier() uses that
string to construct a pathname to the file that it will be read. Then,
get_controlfile() reconstructs the same pathname using it's own logic.
That's all pretty disagreeable. Hard-coded constants are hard to avoid
completely, but here it looks an awful lot like we're trying to
hardcode the same constant into as many different places as we can.
The now-dropped patch seems like an effort to avoid this, and while
it's possible that it wasn't the best way to avoid this, I still think
avoiding it somehow is probably the right idea.

I get a compiler warning with 0002, too:

../pgsql/src/backend/backup/basebackup_incremental.c:960:22: warning:
call to undeclared function 'GetSystemIdentifier'; ISO C99 and later
do not support implicit function declarations
[-Wimplicit-function-declaration]
system_identifier = GetSystemIdentifier();
^
1 warning generated.

But I've committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Exponential backoff for auth_delay

2024-03-04 Thread Michael Banck

Hi,

On Wed, Feb 21, 2024 at 10:26:11PM +0100, Tomas Vondra wrote:
> Thanks for the patch. I took a closer look at v3, so let me share some
> review comments. Please push back if you happen to disagree with some of
> it, some of this is going to be more a matter of personal preference.

Thanks! As my patch was based on a previous patch, some of decisions
were carry-overs I am not overly attached to.

> 1) I think it's a bit weird to have two options specifying amount of
> time, but one is in seconds and one in milliseconds. Won't that be
> unnecessarily confusing? Could we do both in milliseconds?

Alright, I changed that.

See below for a discussion about the GUCs in general.

> 2) The C code defines the GUC as auth_delay.exponential_backoff, but the
> SGML docs say auth_delay.exp_backoff.

Right, an oversight from the last version where the GUC name got changed
but I forgot to change the documentation, fixed.

> 3) Do we actually need the exponential_backoff GUC? Wouldn't it be
> sufficient to have the maximum value, and if it's -1 treat that as no
> backoff?

That is a good question, I guess that makes sense.

The next question then is: what should the default for (now)
auth_delay.max_milliseconds be in this case, -1? Or do we say that as
the default for auth_delay.milliseconds is 0 anyway, why would somebody
not want exponential backoff when they switch it on and keep it at the
current 10s/1ms)?

I have not changed that for now, pending further input.

> 4) I think the SGML docs should more clearly explain that the delay is
> initialized to auth_delay.milliseconds, and reset to this value after
> successful authentication. The wording kinda implies that, but it's not
> quite clear I think.

Ok, I added some text to that end. I also added a not that
auth_delay.max_milliseconds will mean that the delay doubling never
stops.

> 4) I've been really confused by the change from
> 
>  if (status != STATUS_OK)
>to
>  if (status == STATUS_ERROR)
> 
> in auth_delay_checks, until I realized those two codes are not the only
> ones, and we intentionally ignore STATUS_EOF. I think it'd be good to
> mention that in a comment, it's not quite obvious (I only realized it
> because the e-mail mentioned it).

Yeah I agree, I tried to explain that now.

> 5) I kinda like the custom that functions in a contrib module start with
> a clear common prefix, say auth_delay_ in this case. Matter of personal
> preference, ofc.

Ok, I changed the functions to have an auth_delay_ prefix throughout..

> 6) I'm a bit skeptical about some acr_array details. Firstly, why would
> 50 entries be enough? Seems like a pretty low and arbitrary number ...
> Also, what happens if someone attempts to authenticate, fails to do so,
> and then disconnects and never tries again? Or just changes IP? Seems
> like the entry will remain in the array forever, no?

Yeah, that is how v3 of this patch worked. I have changed that now, see
below.

> Seems like a way to cause a "limited" DoS - do auth failure from 50
> different hosts, to fill the table, and now everyone has to wait the
> maximum amount of time (if they fail to authenticate).

Right, though the problem would only exist on authentication failures,
so it is really rather limited.

> I think it'd be good to consider:
> 
> - Making the acr_array a hash table, and larger than 50 entries (I
> wonder if that should be dynamic / customizable by GUC?).

I would say a GUC should be overkill for this as this would mostly be an
implementation detail.

More generally, I think now that entries are expired (see below), this
should be less of a concern, so I have not changed this to a hash table
for now but doubled MAX_CONN_RECORDS to 100 entries.

> - Make sure the entries are eventually expired, based on time (for
> example after 2*max_delay?).

I went with 5*max_milliseconds - the AuthConnRecord struct now has a
last_failed_auth timestamp member; if we increase the delay for a host,
we check if any other host expired in the meantime and remove it if so.

> - It would be a good idea to log something when we get into the "full
> table" and start delaying everyone by max_delay_seconds. (How would
> anyone know that's what's happening, seems rather confusing.)

Right, I added a log line for that. However, I made it LOG instead of
WARNING as I don't think the client would ever see it, would he?

Attached is v4 with the above changes.

Cheers,

Michael
>From 6520fb1e768bb8bc26cafad014d08d7e9ac7f12a Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Wed, 27 Dec 2023 15:55:39 +0100
Subject: [PATCH v4] Add optional exponential backoff to auth_delay contrib
 module.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds two new GUCs for auth_delay, exponential_backoff and
max_milliseconds. The former controls whether exponential backoff should be
used or not, the latter sets an maximum delay (default is 10s) in case
exponential backoff is

Re: incremental backup mishandles XLOG_DBASE_CREATE_FILE_COPY

2024-03-04 Thread Robert Haas

On Sat, Feb 24, 2024 at 12:10 PM Noah Misch  wrote:
> Agreed, those don't touch relation data files.  I think you've got all
> relation data file mutations.  XLOG_DBASE_CREATE_FILE_COPY and XLOG_DBASE_DROP
> are the only record types that touch a relation data file without mentioning
> it in XLogRecordBlockHeader, XACT_XINFO_HAS_RELFILELOCATORS, or an RM_SMGR_ID
> rlocator field.

Thanks for the review. I have committed this.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)

2024-03-04 Thread Andres Freund

Hi,

On 2024-03-04 08:47:11 -0300, Ranier Vilela wrote:
> Does filling a memory area, one by one, with branches, need strong evidence
> to prove that it is better than filling a memory area, all at once, without
> branches?

That's a bogus comparison:

a) the memset version modifies the whole array, rather than just elements that
   correspond to an item - often there will be far fewer items than the
   maximally possible number

b) the memset version initializes array elements that will be set to an actual
   value

c) switching to memset does not elide any branches, as the branch is still
   needed

And even without those, it'd still not be obviously better to use an
ahead-of-time memset(), as storing lots of values at once is more likely to be
bound by memory bandwidth than interleaving different work.

Andres

Re: RFC: Logging plan of the running query

2024-03-04 Thread Robert Haas

On Sat, Mar 2, 2024 at 10:46 AM James Coleman  wrote:
> If I can rephrase this idea: it's basically "delay this interrupt
> until inline to the next ExecProcNode execution".

Yes, but it's not just that. It also means that the code which would
handle the interrupt doesn't need to be called at every ExecProcNode.
Only when the interrupt actually arrives do we enable the code that
handles it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Ranier Vilela

Em seg., 4 de mar. de 2024 às 14:54, Matthias van de Meent <
boekewurm+postg...@gmail.com> escreveu:

> On Mon, 4 Mar 2024 at 18:39, Ranier Vilela  wrote:
> >
> > Hi,
> >
> > The function var_strcmp is a critical function.
> > Inside the function, there is a shortcut condition,
> > which allows for a quick exit.
> >
> > Unfortunately, the current code calls a very expensive function
> beforehand, which if the test was true, all the call time is wasted.
> > So, IMO, it's better to postpone the function call until when it is
> actually necessary.
>
> Thank you for your contribution.
>
> I agree it would be better, but your current patch is incorrect,
> because we need to check if the user has access to the locale (and
> throw an error if not) before we return that the two strings are
> equal.
>
I can't see any user validation at the function pg_newlocale_from_collation.

meson test pass all checks.

best regards,
Ranier Vilela

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan



>
> I decided to whip up a PoC patch implementing this approach, to get a
> better idea of how it compares to the original proposal, both in terms
> of performance and code complexity.
>
> Attached are two patches that both add a simple cache in detoast.c, but
> differ in when exactly the caching happens.
>
> toast-cache-v1 - caching happens in toast_fetch_datum, which means this
> happens before decompression
>
> toast-cache-v2 - caching happens in detoast_attr, after decompression
...
>
> I think implementation-wise this is pretty non-invasive.
..
>
> Performance-wise, these patches are slower than with Andy's patch. For
> example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
> v2 at ~150ms. I'm sure there's a number of optimization opportunities
> and v2 could get v2 closer to the 100ms.
>
> v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
> as master, because the data set is small enough to be fully cached, so
> there's no I/O and it's the decompression is the actual bottleneck.
>
> That being said, I'm not convinced v1 would be a bad choice for some
> cases. If there's memory pressure enough to evict TOAST, it's quite
> possible the I/O would become the bottleneck. At which point it might be
> a good trade off to cache compressed data, because then we'd cache more
> detoasted values.
>
> OTOH it's likely the access to TOAST values is localized (in temporal
> sense), i.e. we read it from disk, detoast it a couple times in a short
> time interval, and then move to the next row. That'd mean v2 is probably
> the correct trade off.

I don't have different thought about the above statement and Thanks for
the PoC code which would make later testing much easier. 

>> 
>> """
>> A alternative design: toast cache
>> -
>> 
>> This method is provided by Tomas during the review process. IIUC, this
>> method would maintain a local HTAB which map a toast datum to a detoast
>> datum and the entry is maintained / used in detoast_attr
>> function. Within this method, the overall design is pretty clear and the
>> code modification can be controlled in toasting system only.
>> 
>
> Right.
>
>> I assumed that releasing all of the memory at the end of executor once
>> is not an option since it may consumed too many memory. Then, when and
>> which entry to release becomes a trouble for me. For example:
>> 
>>   QUERY PLAN
>> --
>>  Nested Loop
>>Join Filter: (t1.a = t2.a)
>>->  Seq Scan on t1
>>->  Seq Scan on t2
>> (4 rows)
>> 
>> In this case t1.a needs a longer lifespan than t2.a since it is
>> in outer relation. Without the help from slot's life-cycle system, I
>> can't think out a answer for the above question.
>> 
>
> This is true, but how likely such plans are? I mean, surely no one would
> do nested loop with sequential scans on reasonably large tables, so how
> representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common. 

> Also, this leads me to the need of having some sort of memory limit. If
> we may be keeping entries for extended periods of time, and we don't
> have any way to limit the amount of memory, that does not seem great.
>
> AFAIK if we detoast everything into tts_values[] there's no way to
> implement and enforce such limit. What happens if there's a row with
> multiple large-ish TOAST values? What happens if those rows are in
> different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

> It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.


>> Another difference between the 2 methods is my method have many
>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>> it can save some run-time effort like hash_search find / enter run-time
>> in method 2 since I put them directly into tts_values[*].
>> 
>> I'm not sure the factor 2 makes some real measurable difference in real
>> case, so my current concern mainly comes from factor 1.
>> """
>> 
>
> This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

> It'd be good to do
> some measurements, especially on simple queries that can't benefit from
> the detoasting (e.g. because there's no varlena attribute).

This testing

Re: Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Matthias van de Meent

On Mon, 4 Mar 2024 at 18:39, Ranier Vilela  wrote:
>
> Hi,
>
> The function var_strcmp is a critical function.
> Inside the function, there is a shortcut condition,
> which allows for a quick exit.
>
> Unfortunately, the current code calls a very expensive function beforehand, 
> which if the test was true, all the call time is wasted.
> So, IMO, it's better to postpone the function call until when it is actually 
> necessary.

Thank you for your contribution.

I agree it would be better, but your current patch is incorrect,
because we need to check if the user has access to the locale (and
throw an error if not) before we return that the two strings are
equal.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Avoid is possible a expensive function call (src/backend/utils/adt/varlena.c)

2024-03-04 Thread Ranier Vilela

Hi,

The function var_strcmp is a critical function.
Inside the function, there is a shortcut condition,
which allows for a quick exit.

Unfortunately, the current code calls a very expensive function beforehand,
which if the test was true, all the call time is wasted.
So, IMO, it's better to postpone the function call until when it is
actually necessary.

best regards,
Ranier Vilela


avoid-call-expensive-function-varlena.patch
Description: Binary data

Re: pgsql: Improve performance of subsystems on top of SLRU

2024-03-04 Thread Alvaro Herrera

FWIW there's a stupid bug in 0002, which is fixed here.  I'm writing a
simple test for it.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Now I have my system running, not a byte was off the shelf;
It rarely breaks and when it does I fix the code myself.
It's stable, clean and elegant, and lightning fast as well,
And it doesn't cost a nickel, so Bill Gates can go to hell."
>From ecf613092a3cbc4c5112d30af7eea55b668decec Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 4 Mar 2024 11:49:01 +0100
Subject: [PATCH v3] Rework redundant loop in subtrans.c

---
 src/backend/access/transam/subtrans.c | 29 +++
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index dc9566fb51..50bb1d8cfc 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -311,7 +311,7 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
 	FullTransactionId nextXid;
 	int64		startPage;
 	int64		endPage;
-	LWLock	   *prevlock;
+	LWLock	   *prevlock = NULL;
 	LWLock	   *lock;
 
 	/*
@@ -324,42 +324,27 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
 	nextXid = TransamVariables->nextXid;
 	endPage = TransactionIdToPage(XidFromFullTransactionId(nextXid));
 
-	prevlock = SimpleLruGetBankLock(SubTransCtl, startPage);
-	LWLockAcquire(prevlock, LW_EXCLUSIVE);
-	while (startPage != endPage)
+	for (;;)
 	{
 		lock = SimpleLruGetBankLock(SubTransCtl, startPage);
-
-		/*
-		 * Check if we need to acquire the lock on the new bank then release
-		 * the lock on the old bank and acquire on the new bank.
-		 */
 		if (prevlock != lock)
 		{
-			LWLockRelease(prevlock);
+			if (prevlock)
+LWLockRelease(prevlock);
 			LWLockAcquire(lock, LW_EXCLUSIVE);
 			prevlock = lock;
 		}
 
 		(void) ZeroSUBTRANSPage(startPage);
+		if (startPage == endPage)
+			break;
+
 		startPage++;
 		/* must account for wraparound */
 		if (startPage > TransactionIdToPage(MaxTransactionId))
 			startPage = 0;
 	}
 
-	lock = SimpleLruGetBankLock(SubTransCtl, startPage);
-
-	/*
-	 * Check if we need to acquire the lock on the new bank then release the
-	 * lock on the old bank and acquire on the new bank.
-	 */
-	if (prevlock != lock)
-	{
-		LWLockRelease(prevlock);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-	}
-	(void) ZeroSUBTRANSPage(startPage);
 	LWLockRelease(lock);
 }
 
-- 
2.39.2

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Nathan Bossart

On Mon, Mar 04, 2024 at 03:21:59PM +0100, Daniel Gustafsson wrote:
> Looking at this again I think this is about ready to go in.  My only comment 
> is
> that doc/src/sgml/archive-modules.sgml probably should be updated to refer to
> setting the errdetail, especially since we document the errormessage there.

Thanks for reviewing.  How does this look?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 437e4fa9ec27ecf870530fc579cd7673dfcf96af Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 4 Mar 2024 11:15:37 -0600
Subject: [PATCH v5 1/1] Add macro for customizing an archive module WARNING
 message.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Presently, if an archive module's check_configured_cb callback
returns false, a generic WARNING about the archive module
misconfiguration is emitted, which unfortunately provides no
actionable details about the source of the miconfiguration.  This
commit introduces a macro that archive module authors can use to
add a DETAIL line to this WARNING.

Co-authored-by: Tung Nguyen
Reviewed-by: Daniel Gustafsson, Álvaro Herrera
Discussion: https://postgr.es/m/4109578306242a7cd5661171647e11b2%40oss.nttdata.com
---
 contrib/basic_archive/basic_archive.c |  7 ++-
 doc/src/sgml/archive-modules.sgml | 12 
 src/backend/archive/shell_archive.c   |  7 ++-
 src/backend/postmaster/pgarch.c   |  8 +++-
 src/include/archive/archive_module.h  |  8 
 5 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 804567e919..6b102e9072 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -161,7 +161,12 @@ check_archive_directory(char **newval, void **extra, GucSource source)
 static bool
 basic_archive_configured(ArchiveModuleState *state)
 {
-	return archive_directory != NULL && archive_directory[0] != '\0';
+	if (archive_directory != NULL && archive_directory[0] != '\0')
+		return true;
+
+	arch_module_check_errdetail("%s is not set.",
+"basic_archive.archive_directory");
+	return false;
 }
 
 /*
diff --git a/doc/src/sgml/archive-modules.sgml b/doc/src/sgml/archive-modules.sgml
index 7064307d9e..cf7438a759 100644
--- a/doc/src/sgml/archive-modules.sgml
+++ b/doc/src/sgml/archive-modules.sgml
@@ -114,6 +114,18 @@ WARNING:  archive_mode enabled, yet archiving is not configured
 In the latter case, the server will periodically call this function, and
 archiving will proceed only when it returns true.

+
+   
+
+ When returning false, it may be useful to append some
+ additional information to the generic warning message.  To do that, provide
+ a message to the arch_module_check_errdetail macro
+ before returning false.  Like
+ errdetail(), this macro accepts a format string
+ followed by an optional list of arguments.  The resulting string will be
+ emitted as the DETAIL line of the warning message.
+
+   
   
 
   
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index c95b732495..bff0ab800d 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -45,7 +45,12 @@ shell_archive_init(void)
 static bool
 shell_archive_configured(ArchiveModuleState *state)
 {
-	return XLogArchiveCommand[0] != '\0';
+	if (XLogArchiveCommand[0] != '\0')
+		return true;
+
+	arch_module_check_errdetail("%s is not set.",
+"archive_command");
+	return false;
 }
 
 static bool
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index bb0eb13a89..f97035ca03 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -88,6 +88,7 @@ typedef struct PgArchData
 } PgArchData;
 
 char	   *XLogArchiveLibrary = "";
+char	   *arch_module_check_errdetail_string;
 
 
 /* --
@@ -401,12 +402,17 @@ pgarch_ArchiverCopyLoop(void)
 			 */
 			HandlePgArchInterrupts();
 
+			/* Reset variables that might be set by the callback */
+			arch_module_check_errdetail_string = NULL;
+
 			/* can't do anything if not configured ... */
 			if (ArchiveCallbacks->check_configured_cb != NULL &&
 !ArchiveCallbacks->check_configured_cb(archive_module_state))
 			{
 ereport(WARNING,
-		(errmsg("archive_mode enabled, yet archiving is not configured")));
+		(errmsg("archive_mode enabled, yet archiving is not configured"),
+		 arch_module_check_errdetail_string ?
+		 errdetail_internal("%s", arch_module_check_errdetail_string) : 0));
 return;
 			}
 
diff --git a/src/include/archive/archive_module.h b/src/include/archive/archive_module.h
index fd59b9faf4..763af76e54 100644
--- a/src/include/archive/archive_module.h
+++ b/src/include/archive/archive_module.h
@@ -56,4 +56,12 @@ typedef const ArchiveModuleCallbacks *(*ArchiveModuleInit) (void);
 
 extern PGDLLEXPORT const

Re: Tidy fill hstv array (src/backend/access/heap/pruneheap.c)

2024-03-04 Thread Andrey M. Borodin

> On 4 Mar 2024, at 16:47, Ranier Vilela  wrote:
> 
> Does filling a memory area, one by one, with branches, need strong evidence 
> to prove that it is better than filling a memory area, all at once, without 
> branches? 

I apologise for being too quick to decide to mark the patch RwF. Wold you mind 
if restore patch as "Needs review" so we can have more feedback?

Best regards, Andrey Borodin.

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Tomas Vondra

On 3/4/24 06:33, David Rowley wrote:
> On Sun, 3 Mar 2024 at 20:08, Andy Fan  wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
> 
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.
> 
> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.
> 
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.
> 
> The exact maths to calculate the certainty_factor for this case I
> don't quite have worked out yet. I plan to work on documenting the
> design of this and try and get a prototype patch out sometime during
> this coming southern hemisphere winter so that there's at least a full
> cycle of feedback opportunity before the PG18 freeze.
> 

I've been thinking about this stuff too, so I'm curious to hear what
kind of plan you come up with. Every time I tried to formalize a more
concrete plan, I ended up with a very complex (and possible yet more
fragile) approach.

I think we'd need to consider a couple things:

1) reliability of cardinality estimates

I think this is pretty much the same concept as confidence intervals,
i.e. knowing not just the regular estimate, but also a range where the
actual value lies with high confidence (e.g. 90%).

For a single clauses this might not be terribly difficult, but for more
complex cases (multiple conditions, ...) it seems far more complex :-(
For example, let's say we know confidence intervals for two conditions.
What's the confidence interval when combined using AND or OR?

2) robustness of the paths

Knowing just the confidence intervals does not seem sufficient, though.
The other thing that matters is how this affects the paths, how robust
the paths are. I mean, if we have alternative paths with costs that flip
somewhere in the confidence interval - which one to pick? Surely one
thing to consider is how fast the costs change for each path.

3) complexity of the model

I suppose we'd prefer a systematic approach (and not some ad hoc
solution for one particular path/plan type). So this would be somehow
integrated into the cost model, making it yet more complex. I'm quite
worried about this (not necessarily because of performance reasons).

I wonder if trying to improve the robustness solely by changes in the
planning phase is a very promising approach. I mean - sure, we should
improve that, but by definition it relies on a priori information. And
not only the stats may be stale - it's a very lossy approximation of the
actual data. Even if the stats are perfectly up to date / very detailed,
there's still going to be a lot of uncertainty.

I wonder if we'd be better off if we experimented with more robust
plans, like SmoothScan [1] or g-join [2].

regards

[1]
https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf

[2]
http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/haerder/2011/JoinAndGrouping.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Extract numeric filed in JSONB more effectively

2024-03-04 Thread Andy Fan



Peter Eisentraut  writes:

> On 09.02.24 10:05, Andy Fan wrote:
>> 2. Where is the current feature blocked for the past few months?
>> It's error message compatible issue! Continue with above setup:
>> master:
>> select * from tb where (a->'b')::numeric > 3::numeric;
>> ERROR:  cannot cast jsonb string to type numeric
>> select * from tb where (a->'b')::int4 > 3::numeric;
>> ERROR:  cannot cast jsonb string to type integer
>> You can see the error message is different (numeric vs integer).
>> Patched:
>> We still can get the same error message as master BUT the code
>> looks odd.
>> select * from tb where (a->'b')::int4 > 3;
>>  QUERY PLAN
>> ---
>>   Seq Scan on public.tb
>> Output: a
>> Filter: 
>> ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
>> 'b'::text), '23'::oid))::integer > 3)
>> (3 rows)
>> You can see "jsonb_finish_numeric(..,  '23::oid)" the '23::oid' is
>> just
>> for the *"integer"* output in error message:
>> "cannot cast jsonb string to type*integer*"
>> Now the sistuation is either we use the odd argument (23::oid) in
>> jsonb_finish_numeric, or we use a incompatible error message with the
>> previous version. I'm not sure which way is better, but this is the
>> place the current feature is blocked.
>
> I'm not bothered by that.  It also happens on occasion in the backend C
> code that we pass around extra information to be able to construct
> better error messages.  The functions here are not backend C code, but
> they are internal functions, so similar considerations can apply.

Thanks for speaking on this!

>
> But I have a different question about this patch set.  This has some
> overlap with the JSON_VALUE function that is being discussed at
> [0][1]. For example, if I apply the patch
> v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run
>
> select count(*) from tb where json_value(a, '$.a' returning numeric) = 2;
>
> and I get a noticeable performance boost over
>
> select count(*) from tb where cast (a->'a' as numeric) = 2;

Here is my test and profile about the above 2 queries.

create table tb(a jsonb);
insert into tb
select jsonb_build_object('a', i) from generate_series(1, 1)i;

cat a.sql
select count(*) from tb
where json_value(a, '$.a' returning numeric) = 2;

cat b.sql
select count(*) from tb where cast (a->'a' as numeric) = 2;

pgbench -n -f xxx.sql postgres -T100 | grep lat

Then here is the result:

| query | master  | patched (patch here and jsonb_value) |
|---+-+-|
| a.sql | /   | 2.59 (ms)   |
| b.sql | 3.34 ms | 1.75 (ms)   |

As we can see the patch here has the best performance (this result looks
be different from yours?).

After I check the code, I am sure both patches *don't* have the problem
in master where it get a jsonbvalue first and convert it to jsonb and
then cast to numeric.

Then I perf the result, and find the below stuff:

JSOB_VALUE

-   32.02% 4.30%  postgres  postgres   [.] JsonPathValue
   - 27.72% JsonPathValue
  - 22.63% executeJsonPath (inlined)
 - 19.97% executeItem (inlined)
- executeItemOptUnwrapTarget
   - 17.79% executeNextItem
  - 15.49% executeItem (inlined)
 - executeItemOptUnwrapTarget
+ 8.50% getKeyJsonValueFromContainer (note here..)
+ 2.30% executeNextItem
  0.79% findJsonbValueFromContainer
+ 0.68% check_stack_depth
  + 1.51% jspGetNext
   + 0.73% check_stack_depth
   1.27% jspInitByBuffer
   0.85% JsonbExtractScalar
  + 4.91% DatumGetJsonbP (inlined)

Patch here for b.sql:
-

-   19.98% 2.10%  postgres  postgres   [.] jsonb_object_field_start
   - 17.88% jsonb_object_field_start
  - 17.70% jsonb_object_field_internal (inlined)
 + 11.03% getKeyJsonValueFromContainer
 - 6.26% DatumGetJsonbP (inlined)
+ 5.78% detoast_attr
   + 1.21% _start
   + 0.54% 0x55ddb44552a

JSONB_VALUE has a much longer way to get getKeyJsonValueFromContainer,
then I think JSON_VALUE probably is designed for some more complex path 
which need to pay extra effort which bring the above performance
difference. 

I added Amit and Alvaro to this thread in case they can have more
insight on this.

> So some questions to think about:
>
> 1. Compare performance of base case vs. this patch vs. json_value.

Done, as above. 
>
> 2. Can json_value be optimized further?

hmm, I have some troubles to understand A's performance boost over B,
then who is better. But in my test above, looks the patch here is better
on the given case and the differece may comes from

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra

On 3/4/24 02:23, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> On 2/26/24 16:29, Andy Fan wrote:
>>>
>>> ...>
>>> I can understand the benefits of the TOAST cache, but the following
>>> issues looks not good to me IIUC: 
>>>
>>> 1). we can't put the result to tts_values[*] since without the planner
>>> decision, we don't know if this will break small_tlist logic. But if we
>>> put it into the cache only, which means a cache-lookup as a overhead is
>>> unavoidable.
>>
>> True - if you're comparing having the detoasted value in tts_values[*]
>> directly with having to do a lookup in a cache, then yes, there's a bit
>> of an overhead.
>>
>> But I think from the discussion it's clear having to detoast the value
>> into tts_values[*] has some weaknesses too, in particular:
>>
>> - It requires decisions which attributes to detoast eagerly, which is
>> quite invasive (having to walk the plan, ...).
>>
>> - I'm sure there will be cases where we choose to not detoast, but it
>> would be beneficial to detoast.
>>
>> - Detoasting just the initial slices does not seem compatible with this.
>>
>> IMHO the overhead of the cache lookup would be negligible compared to
>> the repeated detoasting of the value (which is the current baseline). I
>> somewhat doubt the difference compared to tts_values[*] will be even
>> measurable.
>>
>>>
>>> 2). It is hard to decide which entry should be evicted without attaching
>>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>>> we keep is the entry we will reuse soon?
>>>
>>
>> True. But is that really a problem? I imagined we'd set some sort of
>> memory limit on the cache (work_mem?), and evict oldest entries. So the
>> entries would eventually get evicted, and the memory limit would ensure
>> we don't consume arbitrary amounts of memory.
>>
> 
> Here is a copy from the shared_detoast_datum.org in the patch. I am
> feeling about when / which entry to free is a key problem and run-time
> (detoast_attr) overhead vs createplan.c overhead is a small difference
> as well. the overhead I paid for createplan.c/setref.c looks not huge as
> well. 

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

I started with v1, but that did not help with the example workloads
(from the original post) at all. Only after I changed this to cache
decompressed data (in v2) it became competitive.

There's GUC to enable the cache (enable_toast_cache, off by default), to
make experimentation easier.

The cache is invalidated at the end of a transaction - I think this
should be OK, because the rows may be deleted but can't be cleaned up
(or replaced with a new TOAST value). This means the cache could cover
multiple queries - not sure if that's a good thing or bad thing.

I haven't implemented any sort of cache eviction. I agree that's a hard
problem in general, but I doubt we can do better than LRU. I also didn't
implement any memory limit.

FWIW I think it's a fairly serious issue Andy's approach does not have
any way to limit the memory. It will detoasts the values eagerly, but
what if the rows have multiple large values? What if we end up keeping
multiple such rows (from different parts of the plan) at once?

I also haven't implemented caching for sliced data. I think modifying
the code to support this should not be difficult - it'd require tracking
how much data we have in the cache, and considering that during lookup
and when adding entries to cache.

I've implemented the cache as "global". Maybe it should be tied to query
or executor state, but then it'd be harder to access it from detoast.c
(we'd have to pass the cache pointer in some way, and it wouldn't work
for use cases that don't go through executor).

I think implementation-wise this is pretty non-invasive.

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is

Re: LogwrtResult contended spinlock

2024-03-04 Thread Bharath Rupireddy

Thanks for looking into this.

On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis  wrote:
>
> > 3.
> > @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI)
> >  If at all, a read
> > barrier is warranted here, we can use atomic read with full barrier
>
> I don't think we need a full barrier but I'm fine with using
> pg_atomic_read_membarrier_u64() if it's better for whatever reason.

For the sake of clarity and correctness, I've used
pg_atomic_read_membarrier_u64 everywhere for reading
XLogCtl->LogwrtResult.Write and XLogCtl->LogwrtResult.Flush.

> > 5. I guess we'd better use pg_atomic_read_u64_impl and
> > pg_atomic_compare_exchange_u64_impl in
> > pg_atomic_monotonic_advance_u64
> > to reduce one level of function indirections.
>
> Aren't they inlined?

Yes, all of them are inlined. But, it seems like XXX_impl functions
are being used in implementing exposed functions as a convention.
Therefore, having pg_atomic_read_u64_impl and
pg_atomic_compare_exchange_u64_impl doesn't sound bad IMV.

> > 6.
> > + * Full barrier semantics (even when value is unchanged).
> >
> > +currval = pg_atomic_read_u64(ptr);
> > +if (currval >= target_)
> > +{
> > +pg_memory_barrier();
>
> I don't think they are exactly equivalent: in the current patch, the
> first pg_atomic_read_u64() could be reordered with earlier reads;
> whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it
> could not be. I'm not sure whether that could create a performance
> problem or not.

I left it as-is.

> > 9.
> > +copyptr = pg_atomic_read_u64(>LogwrtResult.Copy);
> > +if (startptr + count > copyptr)
> > +ereport(WARNING,
> > +(errmsg("request to read past end of generated WAL;
> > request %X/%X, current position %X/%X",
> > +LSN_FORMAT_ARGS(startptr + count),
> > LSN_FORMAT_ARGS(copyptr;
> >
> > Any specific reason for this to be a WARNING rather than an ERROR?
>
> Good question. WaitXLogInsertionsToFinish() uses a LOG level message
> for the same situation. They should probably be the same log level, and
> I would think it would be either PANIC or WARNING. I have no idea why
> LOG was chosen.

WaitXLogInsertionsToFinish adjusts upto after LOG so that the wait is
never past the current insert position even if a caller asks for
reading WAL that doesn't yet exist. And the comment there says "Here
we just assume that to mean that all WAL that has been reserved needs
to be finished."

In contrast, WALReadFromBuffers kind of enforces callers to do
WaitXLogInsertionsToFinish (IOW asks callers to send in the WAL that
exists in the server). Therefore, an ERROR seems a reasonable choice
to me, if PANIC sounds rather strong affecting all the postgres
processes.

FWIW, a PANIC when requested to flush past the end of WAL in
WaitXLogInsertionsToFinish instead of LOG seems to be good. CF bot
animals don't complain -
https://github.com/BRupireddy2/postgres/tree/be_harsh_when_request_to_flush_past_end_of_WAL_WIP.

> 0001:
>
> * The comments on the two versions of the functions are redundant, and
> the style in that header seems to be to omit the comment from the u64
> version.

Removed comments atop 64-bit version.

> * I'm not sure the AssertPointerAlignment is needed in the u32 version?

Borrowed them from pg_atomic_read_u32 and
pg_atomic_compare_exchange_u32, just like how they assert before
calling XXX_impl versions. I don't see any problem with them.

> 0002:
>
> * All updates to the non-shared LogwrtResult should update both values.
> It's confusing to update those local values independently, because it
> violates the invariant that LogwrtResult.Flush <= LogwrtResult.Write.
>
> > 2. I guess we need to update both the Write and Flush local copies in
> > AdvanceXLInsertBuffer.
>
> I agree. Whenever we update the non-shared LogwrtResult, let's update
> the whole thing. Otherwise it's confusing.

Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult
using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs.

> * pg_memory_barrier() is not needed right before a spinlock

Got rid of it as we read both Flush and Write local copies with
pg_atomic_read_membarrier_u64.

> * As mentioned above, I think GetFlushRecPtr() needs two read barriers.
> Also, I think the same for GetXLogWriteRecPtr().

Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult
using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs.

> * In general, for any place using both Write and Flush, I think Flush
> should be loaded first, followed by a read barrier, followed by a load
> of the Write pointer.

Why read Flush first rather than Write? I think it's enough to do
{read Write,  read barrier, read Flush}. This works because Write is
monotonically advanced first before Flush using full barriers and we
don't get reordering issues between the readers and writers no? Am I
missing anything here?

> And I think in most of those places there should
> be

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Bertrand Drouvot

Hi,

On Mon, Mar 04, 2024 at 01:28:04PM +, Zhijie Hou (Fujitsu) wrote:
> Attach the V105 patch set

Thanks!

Sorry I missed those during the previous review:

1 ===

Commit message: "these functions will block until"

s/block/wait/ ?

2 ===

+when used with logical failover slots, will block until all

s/block/wait/ ?

It seems those are the 2 remaining "block" that could deserve the proposed
above change.

3 ===

+   invalidated = slot->data.invalidated != RS_INVAL_NONE;
+   inactive = slot->active_pid == 0;

invalidated = (slot->data.invalidated != RS_INVAL_NONE);
inactive = (slot->active_pid == 0);

instead?

I think it's easier to read and it looks like this is the way it's written in
other places (at least the few I checked).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Experiments with Postgres and SSL

2024-03-04 Thread Heikki Linnakangas


On 01/03/2024 23:49, Jacob Champion wrote:

On Wed, Feb 28, 2024 at 4:10 AM Heikki Linnakangas  wrote:

I think we'd want to *avoid* changing the major protocol version in a
way that would introduce a new roundtrip, though.


I'm starting to get up to speed with this patchset. So far I'm mostly
testing how it works; I have yet to take an in-depth look at the
implementation.


Thank you!


I'll squint more closely at the MITM-protection changes in 0008 later.
First impressions, though: it looks like that code has gotten much
less straightforward, which I think is dangerous given the attack it's
preventing. (Off-topic: I'm skeptical of future 0-RTT support. Our
protocol doesn't seem particularly replay-safe to me.)


Let's drop that patch. AFAICS it's not needed by the rest of the patches.


If we're interested in ALPN negotiation in the future, we may also
want to look at GREASE [1] to keep those options open in the presence
of third-party implementations. Unfortunately OpenSSL doesn't do this
automatically yet.


Can you elaborate? Do we need to do something extra in the server to be 
compatible with GREASE?



If we don't have a reason not to, it'd be good to follow the strictest
recommendations from [2] to avoid cross-protocol attacks. (For anyone
currently running web servers and Postgres on the same host, they
really don't want browsers "talking" to their Postgres servers.) That
would mean checking the negotiated ALPN on both the server and client
side, and failing if it's not what we expect.


Hmm, I thought that's what the patches does. But looking closer, libpq 
is not checking that ALPN was used. We should add that. Am I right?



I'm not excited about the proliferation of connection options. I don't
have a lot of ideas on how to fix it, though, other than to note that
the current sslnegotiation option names are very unintuitive to me:
- "postgres": only legacy handshakes
- "direct": might be direct... or maybe legacy
- "requiredirect": only direct handshakes... unless other options are
enabled and then we fall back again to legacy? How many people willing
to break TLS compatibility with old servers via "requiredirect" are
going to be okay with lazy fallback to GSS or otherwise?


Yeah, this is my biggest complaint about all this. Not so much the names 
of the options, but the number of combinations of different options, and 
how we're going to test them all. I don't have any great solutions, 
except adding a lot of tests to cover them, like Matthias did.



Heikki mentioned possibly hard-coding a TLS alert if direct SSL is
attempted without server TLS support. I think that's a cool idea, but
without an official "TLS not supported" alert code (which, honestly,
would be strange to standardize) I'm kinda -0.5 on it. If the client
tells me about a handshake_failure or similar, I'm going to start
investigating protocol versions and ciphersuites; I'm not going to
think to myself that maybe the server lacks TLS support altogether.


Agreed.


(Plus, we need to have a good error message when connecting to older
servers anyway.I think we should be able to key off of the EOF coming
back from OpenSSL; it'd be a good excuse to give that part of the code
some love.)


Hmm, if OpenSSL sends ClientHello and the server responds with a 
Postgres error packet, OpenSSL will presumably consume the error packet 
or at least part of it. But with our custom BIO, we can peek at the 
server response before handing it to OpenSSL.


If it helps, we could backport a nicer error message to old server 
versions, similar to what we did with SCRAM in commit 96d0f988b1.



For the record, I'm adding some one-off tests for this feature to a
local copy of my OAuth pytest suite, which is designed to do the kinds
of testing you're running into trouble with. It's not in any way
viable for a PG17 commit, but if you're interested I can make the
patches available.


Yes please, it would be nice to see what tests you've performed, and 
have it archived.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: pgsql: Improve performance of subsystems on top of SLRU

2024-03-04 Thread Alvaro Herrera

On 2024-Mar-03, Tom Lane wrote:

> This is certainly simpler, but I notice that it holds the current
> LWLock across the line
> 
>   ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
> 
> where the old code did not.  Could the palloc take long enough that
> holding the lock is bad?

Hmm, I guess most of the time it shouldn't be much of a problem (if the
length is small so we can palloc without malloc'ing); but it could be in
the worst case.  But the fix is simple: just release the lock before.
There's no correctness argument for holding it all the way down.  I was
just confused about how the original code worked.

> Also, with this coding the "lock = NULL;" assignment just before
> "goto retry" is a dead store.  Not sure if Coverity or other static
> analyzers would whine about that.

Oh, right.  I removed that.

On 2024-Mar-04, Dilip Kumar wrote:

> I am not sure about the other changes, I mean that makes the code much
> simpler but now we are not releasing the 'MultiXactOffsetCtl' related
> bank lock, and later in the following loop, we are comparing that lock
> against 'MultiXactMemberCtl' related bank lock. This makes code
> simpler because now in the loop we are sure that we are always holding
> the lock but I do not like comparing the bank locks for 2 different
> SLRUs, although there is no problem as there would not be a common
> lock address,

True.  This can be addressed in the same way Tom's first comment is:
just release the lock before entering the second loop, and setting lock
to NULL.  This brings the code to a similar state than before, except
that the additional LWLock * variables are in a tighter scope.  That's
in 0001.


Now, I had a look at the other users of slru.c and noticed in subtrans.c
that StartupSUBTRANS we have some duplicate code that I think could be
rewritten by making the "while" block test the condition at the end
instead of at the start; changed that in 0002.  I'll leave this one for
later, because I want to add some test code for it -- right now it's
pretty much test-uncovered code.


I also looked at slru.c for uses of shared->bank_locks and noticed a
couple that could be made simpler.  That's 0003 here.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)
>From a6be7cac5de83a36e056147f3e81bea2eb1096bd Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Sun, 3 Mar 2024 15:20:36 +0100
Subject: [PATCH v2 1/3] rework locking code in GetMultiXactIdMembers

Per Coverity
---
 src/backend/access/transam/multixact.c | 53 +++---
 1 file changed, 22 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cd476b94fa..83b578dced 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1247,14 +1247,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset offset;
 	int			length;
 	int			truelength;
-	int			i;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
 	MultiXactId tmpMXact;
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	LWLock	   *prevlock = NULL;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1361,18 +1359,9 @@ retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	/*
-	 * If this page falls under a different bank, release the old bank's lock
-	 * and acquire the lock of the new bank.
-	 */
+	/* Acquire the bank lock for the page we need. */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-	if (lock != prevlock)
-	{
-		if (prevlock != NULL)
-			LWLockRelease(prevlock);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
-		prevlock = lock;
-	}
+	LWLockAcquire(lock, LW_EXCLUSIVE);
 
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
@@ -1407,17 +1396,19 @@ retry:
 
 		if (pageno != prev_pageno)
 		{
+			LWLock	   *newlock;
+
 			/*
 			 * Since we're going to access a different SLRU page, if this page
 			 * falls under a different bank, release the old bank's lock and
 			 * acquire the lock of the new bank.
 			 */
-			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-			if (prevlock != lock)
+			newlock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+			if (newlock != lock)
 			{
-LWLockRelease(prevlock);
-LWLockAcquire(lock, LW_EXCLUSIVE);
-prevlock = lock;
+LWLockRelease(lock);
+LWLockAcquire(newlock, LW_EXCLUSIVE);
+lock = newlock;
 			}
 			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
 		}
@@ -1429,8 +1420,7 @@ retry:
 		if (nextMXOffset == 0)
 		{
 			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(prevlock);
-			prevlock = NULL;
+			LWLockRelease(lock);

Re: remaining sql/json patches

2024-03-04 Thread Alvaro Herrera

On 2024-Mar-04, Erik Rijkers wrote:

> In my hands (applying with patch), the patches, esp. 0001, do not apply.
> But I see the cfbot builds without problem so maybe just ignore these FAILED
> lines.  Better get them merged - so I can test there...

It's because of dbbca2cf299b.  It should apply cleanly if you do "git
checkout dbbca2cf299b^" first ...  That commit is so recent that
evidently the cfbot hasn't had a chance to try this patch again since it
went in, which is why it's still green.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)

Re: remaining sql/json patches

2024-03-04 Thread Erik Rijkers


Op 3/4/24 om 10:40 schreef Amit Langote:

Hi Jian,

Thanks for the reviews and sorry for the late reply. Replying to all
emails in one.


> [v40-0001-Add-SQL-JSON-query-functions.patch]
> [v40-0002-Show-function-name-in-TableFuncScan.patch]
> [v40-0003-JSON_TABLE.patch]

In my hands (applying with patch), the patches, esp. 0001, do not apply. 
 But I see the cfbot builds without problem so maybe just ignore these 
FAILED lines.  Better get them merged - so I can test there...


Erik


checking file doc/src/sgml/func.sgml
checking file src/backend/catalog/sql_features.txt
checking file src/backend/executor/execExpr.c
Hunk #1 succeeded at 48 with fuzz 2 (offset -1 lines).
Hunk #2 succeeded at 88 (offset -1 lines).
Hunk #3 succeeded at 2419 (offset -1 lines).
Hunk #4 succeeded at 4195 (offset -1 lines).
checking file src/backend/executor/execExprInterp.c
Hunk #1 succeeded at 72 (offset -1 lines).
Hunk #2 succeeded at 180 (offset -1 lines).
Hunk #3 succeeded at 485 (offset -1 lines).
Hunk #4 succeeded at 1560 (offset -1 lines).
Hunk #5 succeeded at 4242 (offset -1 lines).
checking file src/backend/jit/llvm/llvmjit_expr.c
checking file src/backend/jit/llvm/llvmjit_types.c
checking file src/backend/nodes/makefuncs.c
Hunk #1 succeeded at 856 (offset -1 lines).
checking file src/backend/nodes/nodeFuncs.c
Hunk #1 succeeded at 233 (offset -1 lines).
Hunk #2 succeeded at 517 (offset -1 lines).
Hunk #3 succeeded at 1019 (offset -1 lines).
Hunk #4 succeeded at 1276 (offset -1 lines).
Hunk #5 succeeded at 1617 (offset -1 lines).
Hunk #6 succeeded at 2381 (offset -1 lines).
Hunk #7 succeeded at 3429 (offset -1 lines).
Hunk #8 succeeded at 4164 (offset -1 lines).
checking file src/backend/optimizer/path/costsize.c
Hunk #1 succeeded at 4878 (offset -1 lines).
checking file src/backend/optimizer/util/clauses.c
Hunk #1 succeeded at 50 (offset -3 lines).
Hunk #2 succeeded at 415 (offset -3 lines).
checking file src/backend/parser/gram.y
checking file src/backend/parser/parse_expr.c
checking file src/backend/parser/parse_target.c
Hunk #1 succeeded at 1988 (offset -1 lines).
checking file src/backend/utils/adt/formatting.c
Hunk #1 succeeded at 4465 (offset -1 lines).
checking file src/backend/utils/adt/jsonb.c
Hunk #1 succeeded at 2159 (offset -4 lines).
checking file src/backend/utils/adt/jsonfuncs.c
checking file src/backend/utils/adt/jsonpath.c
Hunk #1 FAILED at 68.
Hunk #2 succeeded at 1239 (offset -1 lines).
1 out of 2 hunks FAILED
checking file src/backend/utils/adt/jsonpath_exec.c
Hunk #1 succeeded at 229 (offset -5 lines).
Hunk #2 succeeded at 2866 (offset -5 lines).
Hunk #3 succeeded at 3751 (offset -5 lines).
checking file src/backend/utils/adt/ruleutils.c
Hunk #1 succeeded at 474 (offset -1 lines).
Hunk #2 succeeded at 518 (offset -1 lines).
Hunk #3 succeeded at 8303 (offset -1 lines).
Hunk #4 succeeded at 8475 (offset -1 lines).
Hunk #5 succeeded at 8591 (offset -1 lines).
Hunk #6 succeeded at 9808 (offset -1 lines).
Hunk #7 succeeded at 9858 (offset -1 lines).
Hunk #8 succeeded at 10039 (offset -1 lines).
Hunk #9 succeeded at 10909 (offset -1 lines).
checking file src/include/executor/execExpr.h
checking file src/include/nodes/execnodes.h
checking file src/include/nodes/makefuncs.h
checking file src/include/nodes/parsenodes.h
checking file src/include/nodes/primnodes.h
checking file src/include/parser/kwlist.h
checking file src/include/utils/formatting.h
checking file src/include/utils/jsonb.h
checking file src/include/utils/jsonfuncs.h
checking file src/include/utils/jsonpath.h
checking file src/interfaces/ecpg/preproc/ecpg.trailer
checking file src/test/regress/expected/sqljson_queryfuncs.out
checking file src/test/regress/parallel_schedule
checking file src/test/regress/sql/sqljson_queryfuncs.sql
checking file src/tools/pgindent/typedefs.list

Re: Statistics Import and Export

2024-03-04 Thread Bertrand Drouvot

Hi,

On Tue, Feb 20, 2024 at 02:24:52AM -0500, Corey Huinker wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker 
> wrote:
> 
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.
> >
> >
> >
> v6 posted below.

Thanks!

I had in mind to look at it but it looks like a rebase is needed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-03-04 Thread Daniel Gustafsson

> On 6 Jan 2024, at 00:03, Nathan Bossart  wrote:

> I gave it a try.

Looking at this again I think this is about ready to go in.  My only comment is
that doc/src/sgml/archive-modules.sgml probably should be updated to refer to
setting the errdetail, especially since we document the errormessage there.

--
Daniel Gustafsson

Provide a pg_truncate_freespacemap function

2024-03-04 Thread Ronan Dunklau

Hello,

As we are currently experiencing a FSM corruption issue [1], we need to 
rebuild FSM when we detect it. 

I noticed we have something to truncate a visibility map, but nothing for the 
freespace map, so I propose the attached (liberally copied from the VM 
counterpart) to allow to truncate a FSM without incurring downtime, as 
currently our only options are to either VACUUM FULL the table or stop the 
cluster and remove the FSM manually.

Does that seem correct ?


[1]  https://www.postgresql.org/message-id/flat/
1925490.taCxCBeP46%40aivenlaptop#7ace95c95cab17b6d92607e5362984ac

--
Ronan Dunklau



>From b3e28e4542311094ee7177b39cc9cdf3672d1b55 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 1 Mar 2024 08:57:49 +0100
Subject: [PATCH] Provide a pg_truncate_freespacemap function.

Similar to the pg_truncate_visibilitymap function, this one allows to
avoid downtime to fix an FSM in the case a corruption event happens
---
 contrib/pg_freespacemap/Makefile  |  5 +-
 .../pg_freespacemap--1.2--1.3.sql | 13 +
 contrib/pg_freespacemap/pg_freespacemap.c | 58 +++
 .../pg_freespacemap/pg_freespacemap.control   |  2 +-
 4 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql

diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile
index b48e4b255bc..0f4b52f5812 100644
--- a/contrib/pg_freespacemap/Makefile
+++ b/contrib/pg_freespacemap/Makefile
@@ -6,8 +6,9 @@ OBJS = \
 	pg_freespacemap.o
 
 EXTENSION = pg_freespacemap
-DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.1--1.2.sql \
-	pg_freespacemap--1.0--1.1.sql
+DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.2--1.3.sql \
+			pg_freespacemap--1.1--1.2.sql \
+			pg_freespacemap--1.0--1.1.sql
 PGFILEDESC = "pg_freespacemap - monitoring of free space map"
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_freespacemap/pg_freespacemap.conf
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
new file mode 100644
index 000..1f25877a175
--- /dev/null
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql
@@ -0,0 +1,13 @@
+/* contrib/pg_freespacemap/pg_freespacemap--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_freespacemap UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pg_truncate_freespace_map(regclass)
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_truncate_freespace_map'
+LANGUAGE C STRICT
+PARALLEL UNSAFE;
+
+
+
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index b82cab2d97e..17f6a631ad8 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -9,8 +9,12 @@
 #include "postgres.h"
 
 #include "access/relation.h"
+#include "access/xloginsert.h"
+#include "catalog/storage_xlog.h"
 #include "funcapi.h"
 #include "storage/freespace.h"
+#include "storage/smgr.h"
+#include "utils/rel.h"
 
 PG_MODULE_MAGIC;
 
@@ -20,6 +24,9 @@ PG_MODULE_MAGIC;
  */
 PG_FUNCTION_INFO_V1(pg_freespace);
 
+/* Truncate the free space map, in case of corruption */
+PG_FUNCTION_INFO_V1(pg_truncate_freespace_map);
+
 Datum
 pg_freespace(PG_FUNCTION_ARGS)
 {
@@ -40,3 +47,54 @@ pg_freespace(PG_FUNCTION_ARGS)
 	relation_close(rel, AccessShareLock);
 	PG_RETURN_INT16(freespace);
 }
+
+
+Datum
+pg_truncate_freespace_map(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	ForkNumber	fork;
+	BlockNumber block;
+
+	rel = relation_open(relid, AccessExclusiveLock);
+
+	/* Only some relkinds have a freespacemap */
+	if (!RELKIND_HAS_TABLE_AM(rel->rd_rel->relkind))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("relation \"%s\" is of wrong relation kind",
+		RelationGetRelationName(rel)),
+ errdetail_relkind_not_supported(rel->rd_rel->relkind)));
+
+
+	/* Forcibly reset cached file size */
+	RelationGetSmgr(rel)->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
+
+	/* Just pretend we're going to wipeout the whole rel */
+	block = FreeSpaceMapPrepareTruncateRel(rel, 0);
+
+	if (BlockNumberIsValid(block))
+	{
+		fork = FSM_FORKNUM;
+		smgrtruncate(RelationGetSmgr(rel), , 1, );
+	}
+
+	if (RelationNeedsWAL(rel))
+	{
+		xl_smgr_truncate xlrec;
+
+		xlrec.blkno = 0;
+		xlrec.rlocator = rel->rd_locator;
+		xlrec.flags = SMGR_TRUNCATE_FSM;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) , sizeof(xlrec));
+
+		XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+	}
+
+	relation_close(rel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control
index ac8fc5050a9..1992320691b 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.control
+++ b/contrib/pg_freespacemap/pg_freespacemap.control
@@ -1,5 +1,5 @@
 #

Re: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread 'Alvaro Herrera'

Hello Hayato,

On 2024-Mar-04, Hayato Kuroda (Fujitsu) wrote:

> OK, I understood that my initial proposal is not so valuable, so I can
> withdraw it.

Yeah, that's what it seems to me.

> About the suggetion, you imagined AutoVacuumRequestWork() and
> brininsert(), right?

Correct.

> I agreed it sounds good, but I don't think it can be implemented by
> current interface. An interface for dynamically allocating memory is
> GetNamedDSMSegment(), and it returns the same shared memory region if
> input names are the same.  Therefore, there is no way to re-alloc the
> shared memory.

Yeah, I was imagining something like this: the workitem-array becomes a
struct, which has a name and a "next" pointer and a variable number of
workitem slots; the AutoVacuumShmem struct has a pointer to the first
workitem-struct and the last one; when a workitem is requested by
brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a
workitem-struct with a smallish number of elements; if we request
another workitem and the array is full, we allocate another array via
GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0
(so that the list can be followed by an autovacuum worker that's
processing the database), and it's also set as the tail of the list in
AutoVacuumShmem (so that we know where to store further work items).
When all items in a workitem-struct are processed, we can free it
(I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems
point to the next one in the list.

This way, the "array" can grow arbitrarily.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-04 Thread Dean Rasheed

On Mon, 4 Mar 2024 at 12:34, Aleksander Alekseev
 wrote:
>
> > This was an experimental patch, I was looking for the comment on the 
> > proposed
> > approach -- whether we could simply skip the throwaway NOT NULL constraint 
> > for
> > deferred PK constraint.  Moreover,  skip that for any PK constraint.
>
> I confirm that the patch fixes the bug. All the tests pass. Looks like
> RfC to me.
>

I don't think that this is the right fix. ISTM that the real issue is
that dropping a NOT NULL constraint should not mark the column as
nullable if it is part of a PK, whether or not that PK is deferrable
-- a deferrable PK still marks a  column as not nullable.

The reason pg_dump creates these throwaway NOT NULL constraints is to
avoid a table scan to check for NULLs when the PK is later created.
That rationale still applies to deferrable PKs, so we still want the
throwaway NOT NULL constraints in that case, otherwise we'd be hurting
performance of restore.

Regards,
Dean

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-04 Thread Tomas Vondra




On 3/4/24 14:16, Ronan Dunklau wrote:
> Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit :
>> These are "my" animals (running at a local university). There's a couple
>> interesting details:
> 
> Hi Tomas,
> do you still have the failing cluster data ? 
> 
> Noah pointed me to this thread, and it looks a bit similar to the FSM 
> corruption issue I'm facing: https://www.postgresql.org/message-id/
> 1925490.taCxCBeP46%40aivenlaptop
> 
> So if you still have the data, it would be nice to see if you indeed have a 
> corrupted FSM, and if you have indications when it happened.
> 

Sorry, I nuked the buildroot so I don't have the data anymore. Let's see
if it fails again.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 5:52 PM Amit Kapila  wrote:
> 
> On Mon, Mar 4, 2024 at 9:35 AM Peter Smith 
> wrote:
> >
> > OK, if the code will remain as-is wouldn't it be better to anyway
> > change the function name to indicate what it really does?
> >
> > e.g.  NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys
> >
> 
> This seems too long. I would prefer the current name NeedToWaitForWal as
> waiting for WAL means waiting to flush the WAL and waiting to replicate it to
> standby. On similar lines, the variable name standby_slot_oldest_flush_lsn 
> looks
> too long. How about ss_oldest_flush_lsn (where ss indicates standy_slots)?
> 
> Apart from this, I have made minor modifications in the attached.

Thanks, I have merged it.

Attach the V105 patch set which addressed Peter, Amit and Bertrand's comments.

This version also includes the following changes:
* We found a string matching issue for query_until() and fixed it.
* Removed one un-used parameter from NeedToWaitForStandbys.
* Disable the sub before testing the pg_logical_slot_get_changes in 040.pl, 
this is to prevent
This test from catching the warning from another walsender.
* Ran pgindent.

Best Regards,
Hou zj

v105-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v105-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch

v105-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v105-0002-Document-the-steps-to-check-if-the-standby-is-r.patch

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 9:55 AM Peter Smith  wrote:
> 
> On Sun, Mar 3, 2024 at 6:51 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Sunday, March 3, 2024 7:47 AM Peter Smith 
> wrote:
> > >
> 
> > > 3.
> > > +   
> > > +
> > > + Value * is not accepted as it is 
> > > inappropriate to
> > > + block logical replication for physical slots that either lack
> > > + associated standbys or have standbys associated that are
> > > + not
> > > enabled
> > > + for replication slot synchronization. (see
> > > +  > > linkend="logicaldecoding-replication-slots-synchronization"/>).
> > > +
> > > +   
> > >
> > > Why does the document need to provide an excuse/reason for the rule?
> > > You could just say something like:
> > >
> > > SUGGESTION
> > > The slots must be named explicitly. For example, specifying wildcard
> > > values like * is not permitted.
> >
> > As suggested by Amit, I moved this to code comments.
> 
> Was the total removal of this note deliberate? I only suggested removing the
> *reason* for the rule, not the entire rule. Otherwise, the user won't know to
> avoid doing this until they try it and get an error.

OK, Added.

> 
> 
> > >
> > > 9. NeedToWaitForWal
> > >
> > > + /*
> > > + * Check if the standby slots have caught up to the flushed position.
> > > + It
> > > + * is good to wait up to flushed position and then let it send the
> > > + changes
> > > + * to logical subscribers one by one which are already covered in
> > > + flushed
> > > + * position without needing to wait on every change for standby
> > > + * confirmation. Note that after receiving the shutdown signal, an
> > > + ERROR
> > > + * is reported if any slots are dropped, invalidated, or inactive.
> > > + This
> > > + * measure is taken to prevent the walsender from waiting indefinitely.
> > > + */
> > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event))
> > > + return true;
> > >
> > > I felt it was confusing things for this function to also call to the
> > > other one -- it seems an overlapping/muddling of the purpose of these.
> > > I think it will be easier to understand if the calling code just
> > > calls to one or both of these functions as required.
> >
> > Same as Amit, I didn't change this.
> 
> AFAICT my previous review comment was misinterpreted. Please see [1] for
> more details.
> 
> 
> 
> Here are some more review comments for v104-1

Thanks!

> 
> ==
> Commit message
> 
> 1.
> Additionally, The SQL functions pg_logical_slot_get_changes,
> pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
> to wait for the replication slots specified in 'standby_slot_names' to catch 
> up
> before returning.
> 
> ~
> 
> Maybe that should be expressed using similar wording as the docs...
> 
> SUGGESTION
> Additionally, The SQL functions ... are modified. Now, when used with
> failover-enabled logical slots, these functions will block until all physical 
> slots
> specified in 'standby_slot_names' have confirmed WAL receipt.

Changed.

> 
> ==
> doc/src/sgml/config.sgml
> 
> 2.
> +pg_logical_slot_peek_changes,
> +when used with failover enabled logical slots, will block until all
> +physical slots specified in
> standby_slot_names have
> +confirmed WAL receipt.
> 
> /failover enabled logical slots/failover-enabled logical slots/

Changed. Note that for this comment and remaining comments, 
I used the later version we agreed(logical failover slot).

> 
> ==
> doc/src/sgml/func.sgml
> 
> 3.
> +The function may be blocked if the specified slot is a failover 
> enabled
> +slot and  linkend="guc-standby-slot-names">standby_slot_names me>
> +is configured.
> 
> 
> /a failover enabled slot//a failover-enabled slot/

Changed.

> 
> ~~~
> 
> 4.
> +slot may return to an earlier position. The function may be blocked 
> if
> +the specified slot is a failover enabled slot and
> + linkend="guc-standby-slot-names">standby_slot_names me>
> +is configured.
> 
> /a failover enabled slot//a failover-enabled slot/

Changed.

> 
> ==
> src/backend/replication/slot.c
> 
> 5.
> +/*
> + * Wait for physical standbys to confirm receiving the given lsn.
> + *
> + * Used by logical decoding SQL functions that acquired failover enabled 
> slot.
> + * It waits for physical standbys corresponding to the physical slots
> +specified
> + * in the standby_slot_names GUC.
> + */
> +void
> +WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn)
> 
> /failover enabled slot/failover-enabled slot/

Changed.

> 
> ~~~
> 
> 6.
> + /*
> + * Don't need to wait for the standby to catch up if the current
> + acquired
> + * slot is not a failover enabled slot, or there is no value in
> + * standby_slot_names.
> + */
> 
> /failover enabled slot/failover-enabled slot/

Changed.

> 
> ==
> src/backend/replication/slotfuncs.c
> 
> 7.
> +
> + /*
> + * Wake up logical

RE: Synchronizing slots from primary to standby

2024-03-04 Thread Zhijie Hou (Fujitsu)

On Monday, March 4, 2024 7:22 PM Bertrand Drouvot 
 wrote:
> 
> Hi,
> 
> On Sun, Mar 03, 2024 at 07:56:32AM +, Zhijie Hou (Fujitsu) wrote:
> > Here is the V104 patch which addressed above and Peter's comments.
> 
> Thanks!
> 
> A few more random comments:

Thanks for the comments!

> 
> 1 ===
> 
> +The function may be blocked if the specified slot is a failover
> + enabled
> 
> s/blocked/waiting/ ?

Changed.

> 
> 2 ===
> 
> +* specified slot when waiting for them to catch up. See
> +* StandbySlotsHaveCaughtup for details.
> 
> s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ?

Changed.

> 
> 3 ===
> 
> +   /* Now verify if the specified slots really exist and have
> + correct type */
> 
> remove "really"?

Changed.

> 
> 4 ===
> 
> +   /*
> +* Don't need to wait for the standbys to catch up if there is no 
> value in
> +* standby_slot_names.
> +*/
> +   if (standby_slot_names_list == NIL)
> +   return true;
> +
> +   /*
> +* Don't need to wait for the standbys to catch up if we are on a
> standby
> +* server, since we do not support syncing slots to cascading 
> standbys.
> +*/
> +   if (RecoveryInProgress())
> +   return true;
> +
> +   /*
> +* Don't need to wait for the standbys to catch up if they are already
> +* beyond the specified WAL location.
> +*/
> +   if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
> +   standby_slot_oldest_flush_lsn >= wait_for_lsn)
> +   return true;
> 
> What about using OR conditions instead?
> 
> 5 ===
> 
> +static bool
> +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
> +uint32 *wait_event)
> 
> Not a big deal but does it need to return a bool? (I mean it all depends of 
> the
> *wait_event value). Is it for better code readability in the caller?
> 
> 6 ===
> 
> +static bool
> +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
> +uint32 *wait_event)
> 
> Same questions as for NeedToWaitForStandby().

I also feel the current style looks a bit cleaner, so didn’t change these.

Best Regards,
Hou zj

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-04 Thread Ronan Dunklau

Le samedi 2 mars 2024, 23:29:52 CET Tomas Vondra a écrit :
> These are "my" animals (running at a local university). There's a couple
> interesting details:

Hi Tomas,
do you still have the failing cluster data ? 

Noah pointed me to this thread, and it looks a bit similar to the FSM 
corruption issue I'm facing: https://www.postgresql.org/message-id/
1925490.taCxCBeP46%40aivenlaptop

So if you still have the data, it would be nice to see if you indeed have a 
corrupted FSM, and if you have indications when it happened.

Best regards,

--
Ronan Dunklau

Replication conflicts not processed in ClientWrite

2024-03-04 Thread Magnus Hagander

When a backend is blocked on writing data (such as with a network
error or a very slow client), indicated with wait event ClientWrite,
it  appears to not properly notice that it's overrunning
max_standby_streaming_delay, and therefore does not cancel the
transaction on the backend.

I've reproduced this repeatedly on Ubuntu 20.04 with PostgreSQL 15 out
of the debian packages. Curiously enough, if I install the debug
symbols and restart, in order to get a backtrace, it starts processing
the cancellation again and can no longer reproduce. So it sounds like
some timing issue around it.

My simple test was, with session 1 on the standby and session 2 on the primary:
Session 1: begin transaction isolation level repeatable read;
Session 1: select count(*) from testtable;
Session 2: alter table testtable rename to testtable2;
Session 1: select * from testtable t1 cross join testtable t2;
kill -STOP 

At this point, replication lag sartgs growing on the standby and it
never terminates the session.

If I then SIGCONT it, it will get terminated by replication conflict.

If I kill the session hard, the replication lag recovers immediately.

AFAICT if the confliact happens at ClientRead, for example, it's
picked up immediately, but there's something in ClientWrite that
prevents it.

My first thought would be OpenSSL, but this is reproducible both on
tls-over-tcp and on unix sockets.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)

Dear Alvaro,

Thanks for discussing!

> 
> I think it would be worth allocating AutoVacuumShmem->av_workItems using
> dynamic shmem allocation, particularly to prevent workitems from being
> discarded just because the array is full¹; but other than that, the
> struct is just 64 bytes long so I doubt it's useful to allocate it
> dynamically.
> 
> ¹ I mean, if the array is full, just allocate another array, point to it
> from the original one, and keep going.

OK, I understood that my initial proposal is not so valuable, so I can withdraw 
it.

About the suggetion, you imagined AutoVacuumRequestWork() and brininsert(),
right? I agreed it sounds good, but I don't think it can be implemented by 
current
interface. An interface for dynamically allocating memory is 
GetNamedDSMSegment(),
and it returns the same shared memory region if input names are the same.
Therefore, there is no way to re-alloc the shared memory.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: type cache cleanup improvements

2024-03-04 Thread Aleksander Alekseev

Hi Teodor,

> I'd like to suggest two independent patches to improve performance of type 
> cache
> cleanup. I found a case where type cache cleanup was a reason for low
> performance. In short, customer makes 100 thousand temporary tables in one
> transaction.
>
> 1 mapRelType.patch
>It just adds a local map between relation and its type as it was suggested 
> in
> comment above TypeCacheRelCallback(). Unfortunately, using syscache here was
> impossible because this call back could be called outside transaction and it
> makes impossible catalog lookups.
>
> 2 hash_seq_init_with_hash_value.patch
>   TypeCacheTypCallback() loop over type hash  to find entry with given hash
> value. Here there are two problems: 1) there isn't  interface to dynahash to
> search entry with given hash value and 2) hash value calculation algorithm is
> differ from system cache. But coming hashvalue is came from system cache. 
> Patch
> is addressed to both issues. It suggests hash_seq_init_with_hash_value() call
> which inits hash sequential scan over the single bucket which could contain
> entry with given hash value, and hash_seq_search() will iterate only over such
> entries. Anf patch changes hash algorithm to match syscache. Actually, patch
> makes small refactoring of dynahash, it makes common function hash_do_lookup()
> which does initial lookup in hash.
>
> Some artificial performance test is in attachment, command to test is 'time 
> psql
> < custom_types_and_array.sql', here I show only last rollback time and total
> execution time:
> 1) master 92d2ab7554f92b841ea71bcc72eaa8ab11aae662
> Time: 33353,288 ms (00:33,353)
> psql < custom_types_and_array.sql  0,82s user 0,71s system 1% cpu 1:28,36 
> total
>
> 2) mapRelType.patch
> Time: 7455,581 ms (00:07,456)
> psql < custom_types_and_array.sql  1,39s user 1,19s system 6% cpu 41,220 total
>
> 3) hash_seq_init_with_hash_value.patch
> Time: 24975,886 ms (00:24,976)
> psql < custom_types_and_array.sql  1,33s user 1,25s system 3% cpu 1:19,77 
> total
>
> 4) both
> Time: 89,446 ms
> psql < custom_types_and_array.sql  0,72s user 0,52s system 10% cpu 12,137 
> total

These changes look very promising. Unfortunately the proposed patches
conflict with each other regardless the order of applying:

```
error: patch failed: src/backend/utils/cache/typcache.c:356
error: src/backend/utils/cache/typcache.c: patch does not apply
```

So it's difficult to confirm case 4, not to mention the fact that we
are unable to test the patches on cfbot.

Could you please rebase the patches against the recent master branch
(in any order) and submit the result of `git format-patch` ?

-- 
Best regards,
Aleksander Alekseev

Re: abi-compliance-checker

2024-03-04 Thread Peter Eisentraut


On 27.02.24 14:25, Alvaro Herrera wrote:

I like this idea and I think we should integrate it with the objective
of it becoming the workhorse of ABI-stability testing.  However, I do
not think that the subsequent patches should be part of the tree at all;
certainly not the produced .xml files in your 0004, as that would be far
too unstable and would cause a lot of pointless churn.


Looking at this again, if we don't want the .xml files in the tree, then 
we don't really need this patch series.  Most of the delicate work in 
the 0001 patch was to find the right abidw options combinations to 
reduce the variability in the .xml output files (--no-show-locs is an 
obvious example).  If we don't want to record the .xml files in the 
tree, then we don't need all these complications.


For example, if we want to check the postgres backend ABI across minor 
versions, we could just compile it multiple times and compare the 
binaries directly:


git checkout REL_16_0
meson setup build-0
meson compile -C build-0

git checkout REL_16_STABLE
meson setup build-1
meson compile -C build-1

abidiff --no-added-syms build-0/src/backend/postgres 
build-1/src/backend/postgres




The way I see this working, is that we set up a buildfarm animal [per
architecture] that runs the new rules produced by the abidw option and
stores the result locally, so that for stable branches it can turn red
when an ABI-breaking change with the previous minor release of the same
branch is introduced.  There's no point on it ever turning red in the
master branch, since we're obviously not concerned with ABI changes there.


Maybe the way forward here is to write a buildfarm module for this, and 
then see from there what further needs there are.

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Aleksander Alekseev

Hi,

> I wanted to hook into the EXPLAIN output for queries and add some extra 
> information, but since there is no standard_ExplainOneQuery() I had to copy 
> the code and create my own version.
>
> Since the pattern with other hooks for a function WhateverFunction() seems to 
> be that there is a standard_WhateverFunction() for each 
> WhateverFunction_hook, I created a patch to follow this pattern for your 
> consideration.
>
> I was also considering adding a callback so that you can annotate any node 
> with explanatory information that is not a custom scan node. This could be 
> used to propagate and summarize information from custom scan nodes, but I had 
> no immediate use for that so did not add it here. I would still be interested 
> in hearing if you think this is something that would be useful to the 
> community.

Thanks for the patch. LGTM.

I registered the patch on the nearest open CF [1] and marked it as
RfC. It is a pretty straightforward refactoring.

[1]: https://commitfest.postgresql.org/48/4879/

-- 
Best regards,
Aleksander Alekseev

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-04 Thread Aleksander Alekseev

Hi hackers,

>> I did not see any test addition for this, can we add it?
>
>
> Ok, added it in the attached version.
>
> This was an experimental patch, I was looking for the comment on the proposed
> approach -- whether we could simply skip the throwaway NOT NULL constraint for
> deferred PK constraint.  Moreover,  skip that for any PK constraint.

I confirm that the patch fixes the bug. All the tests pass. Looks like
RfC to me.

-- 
Best regards,
Aleksander Alekseev

Re: Extract numeric filed in JSONB more effectively

2024-03-04 Thread Peter Eisentraut


On 09.02.24 10:05, Andy Fan wrote:

2. Where is the current feature blocked for the past few months?

It's error message compatible issue! Continue with above setup:

master:

select * from tb where (a->'b')::numeric > 3::numeric;
ERROR:  cannot cast jsonb string to type numeric

select * from tb where (a->'b')::int4 > 3::numeric;
ERROR:  cannot cast jsonb string to type integer

You can see the error message is different (numeric vs integer).


Patched:

We still can get the same error message as master BUT the code
looks odd.

select * from tb where (a->'b')::int4 > 3;
 QUERY PLAN
---
  Seq Scan on public.tb
Output: a
Filter: ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
'b'::text), '23'::oid))::integer > 3)
(3 rows)

You can see "jsonb_finish_numeric(..,  '23::oid)" the '23::oid' is just
for the *"integer"* output in error message:

"cannot cast jsonb string to type*integer*"

Now the sistuation is either we use the odd argument (23::oid) in
jsonb_finish_numeric, or we use a incompatible error message with the
previous version. I'm not sure which way is better, but this is the
place the current feature is blocked.


I'm not bothered by that.  It also happens on occasion in the backend C 
code that we pass around extra information to be able to construct 
better error messages.  The functions here are not backend C code, but 
they are internal functions, so similar considerations can apply.



But I have a different question about this patch set.  This has some 
overlap with the JSON_VALUE function that is being discussed at [0][1]. 
For example, if I apply the patch 
v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run


select count(*) from tb where json_value(a, '$.a' returning numeric) = 2;

and I get a noticeable performance boost over

select count(*) from tb where cast (a->'a' as numeric) = 2;

So some questions to think about:

1. Compare performance of base case vs. this patch vs. json_value.

2. Can json_value be optimized further?

3. Is this patch still needed?

3a. If yes, should the internal rewriting make use of json_value or 
share code with it?



[0]: 
https://www.postgresql.org/message-id/flat/CA+HiwqE4XTdfb1nW=ojoy_tqsrhyt-q_kb6i5d4xckyrlc1...@mail.gmail.com

[1]: https://commitfest.postgresql.org/47/4377/

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan



Tomas Vondra  writes:

>>> But I'm a bit surprised the patch needs to pass a
>>> MemoryContext to so many places as a function argument - that seems to
>>> go against how we work with memory contexts. Doubly so because it seems
>>> to only ever pass CurrentMemoryContext anyway. So what's that about?
>> 
>> I think you are talking about the argument like this:
>>  
>>  /* --
>> - * detoast_attr -
>> + * detoast_attr_ext -
>>   *
>>   *  Public entry point to get back a toasted value from compression
>>   *  or external storage.  The result is always non-extended varlena form.
>>   *
>> + * ctx: The memory context which the final value belongs to.
>> + *
>>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>>   * datum, the result will be a pfree'able chunk.
>>   * --
>>   */
>> 
>> +extern struct varlena *
>> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
>> 
>> This is mainly because 'detoast_attr' will apply more memory than the
>> "final detoast datum" , for example the code to scan toast relation
>> needs some memory as well, and what I want is just keeping the memory
>> for the final detoast datum so that other memory can be released sooner,
>> so I added the function argument for that. 
>> 
>
> What exactly does detoast_attr allocate in order to scan toast relation?
> Does that happen in master, or just with the patch?

It is in the current master, for example the TupleTableSlot creation
needed by scanning the toast relation needs a memory allocating. 

> If with master, I
> suggest to ignore that / treat that as a separate issue and leave it for
> a different patch.

OK, I can make it as a seperate commit in the next version. 

> In any case, the custom is to allocate results in the context that is
> set in CurrentMemoryContext at the moment of the call, and if there's
> substantial amount of allocations that we want to free soon, we either
> do that by explicit pfree() calls, or create a small temporary context
> in the function (but that's more expensive).
>
> I don't think we should invent a new convention where the context is
> passed as an argument, unless absolutely necessary.

Hmm, in this case, if we don't add the new argument, we have to get the
detoast datum in Context-1 and copy it to the desired memory context,
which is the thing I want to avoid.  I think we have a same decision to
make on the TOAST cache method as well.

-- 
Best Regards
Andy Fan

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Aleksander Alekseev

> > All,
> >
> > The PostgreSQL Contributor Page
> > (https://www.postgresql.org/community/contributors/) includes people who
> > have made substantial, long-term contributions of time and effort to the
> > PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> > following people for their contributions.
> >
> > New PostgreSQL Contributors:
> >
> > * Bertrand Drouvot
> > * Gabriele Bartolini
> > * Richard Guo
> >
> > New PostgreSQL Major Contributors:
> >
> > * Alexander Lakhin
> > * Daniel Gustafsson
> > * Dean Rasheed
> > * John Naylor
> > * Melanie Plageman
> > * Nathan Bossart
> >
> > Thank you and congratulations to all!
>
> +1. Congratulations to all!

Congratulations to all!

-- 
Best regards,
Aleksander Alekseev

Re: Commitfest Manager for March

2024-03-04 Thread Aleksander Alekseev

Hi,

> >> The call for a CFM volunteer is still open.
> >
> > I always wanted to try. And most of the stuff I'm interested in is already 
> > committed.
> >
> > But given importance of last commitfest before feature freeze, we might be 
> > interested in more experienced CFM.
> > If I can do something useful - I'm up for it.
>
> I'm absolutely convinced you have more than enough experience with postgres
> hacking to do an excellent job.  I'm happy to give a hand as well.
>
> Thanks for volunteering!
>
> (someone from pginfra will give you the required admin permissions on the CF
> app)

Thanks for volunteering, Andrey!

If you need any help please let me know.

-- 
Best regards,
Aleksander Alekseev

1 2 >

1 - 100 of 126 matches

Mail list logo