date:20240327

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada

On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
> >
> > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  
> > > wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada 
> > > >  wrote:
> >
> > > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > > - * as well we line pointers previously marked LP_DEAD.
> > > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > > + * we line pointers previously marked LP_DEAD.
> > > >
> > > > Here maybe "into dead_items".
> >
> > - * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * remaining LP_DEAD line pointers on the page into the dead_items.
> >
> > Let me explain. It used to be "in the dead_items array." It is not an
> > array anymore, so it was changed to "in the dead_items". dead_items is
> > a variable name, and names don't take "the". "into dead_items" seems
> > most natural to me, but there are other possible phrasings.
>
> Thanks for the explanation. I was distracted. Fixed in the latest patch.
>
> >
> > > > > > Did you try it with 1MB m_w_m?
> > > > >
> > > > > I've incorporated the above comments and test results look good to me.
> > > >
> > > > Could you be more specific about what the test was?
> > > > Does it work with 1MB m_w_m?
> > >
> > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> > >
> > > FYI other test cases I tested were:
> > >
> > > * m_w_m = 2199023254528 (maximum value)
> > > initial: 1MB
> > > max: 128GB
> > >
> > > * m_w_m = 64MB (default)
> > > initial: 1MB
> > > max: 8MB
> >
> > If the test was a vacuum, how big a table was needed to hit 128GB?
>
> I just checked how TIdStoreCreateLocal() calculated the initial and
> max segment sizes while changing m_w_m, so didn't check how big
> segments are actually allocated in the maximum value test case.
>
> >
> > > > The existing comment slipped past my radar, but max_bytes is not a
> > > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > > normal sense, but in earlier patches it was the criteria for reporting
> > > > "I'm full" when asked.
> > >
> > > Updated the comment.
> >
> > + * max_bytes is not a limit; it's used to choose the memory block sizes of
> > + * a memory context for TID storage in order for the total memory 
> > consumption
> > + * not to be overshot a lot. The caller can use the max_bytes as the 
> > criteria
> > + * for reporting whether it's full or not.
> >
> > This is good information. I suggest this edit:
> >
> > "max_bytes" is not an internally-enforced limit; it is used only as a
> > hint to cap the memory block size of the memory context for TID
> > storage. This reduces space wastage due to over-allocation. If the
> > caller wants to monitor memory usage, it must compare its limit with
> > the value reported by TidStoreMemoryUsage().
> >
> > Other comments:
>
> Thanks for the suggestion!
>
> >
> > v79-0002 looks good to me.
> >
> > v79-0003:
> >
> > "With this commit, when creating a shared TidStore, a dedicated DSA
> > area is created for TID storage instead of using the provided DSA
> > area."
> >
> > This is very subtle, but "the provided..." implies there still is one.
> > -> "a provided..."
> >
> > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> > + * DSA area. The TID storage will live in the DSA area, and a memory
> > + * context rt_context will have only meta data of the radix tree.
> >
> > -> "the memory context"
>
> Fixed in the latest patch.
>
> >
> > I think you can go ahead and commit 0002 and 0003/4.
>
> I've pushed the 0002 (dsa init and max segment size) patch, and will
> push the attached 0001 patch next.

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences

2024-03-27 Thread Akshat Jaimini

Hii,
I am currently trying to review the submitted patch but I am not able to apply 
it to the master branch. 

Regards,
Akshat Jaimini

Re: Change GUC hashtable to use simplehash?

2024-03-27 Thread Jeff Davis

On Wed, 2024-03-27 at 13:44 +0700, John Naylor wrote:
> Thanks for the pointers! In v20-0001, I've drafted checking thes
> spelling first, since pg_attribute_no_sanitize_alignment has a
> similar
> version check. Then it checks for no_sanitize_address using
> __has_attribute, which goes back to gcc 5. That's plenty for the
> buildfarm and CI, and I'm not sure it's worth expending additional
> effort to cover more cases. (A similar attribute exists for MSVC in
> case it comes up.)

0001 looks good to me, thank you.

> v21-0003 adds a new file hashfn_unstable.c for convenience functions
> and converts all the duplicate frontend uses of hash_string_pointer.

Why not make hash_string() inline, too? I'm fine with it either way,
I'm just curious why you went to the trouble to create a new .c file so
it didn't have to be inlined.

Regards,
Jeff Davis

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-27 Thread Thomas Munro

With the unexplained but apparently somewhat systematic regression
patterns on certain tests and settings, I wonder if they might be due
to read_stream.c trying to form larger reads, making it a bit lazier.
It tries to see what the next block will be before issuing the
fadvise.  I think that means that with small I/O concurrency settings,
there might be contrived access patterns where it loses, and needs
effective_io_concurrency to be set one notch higher to keep up, or
something like that.  One way to test that idea would be to run the
tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
eagerly when io_combine_limit is reached, so I suppose it should be
exactly as eager as master.  The only difference then should be that
it automatically suppresses sequential fadvise calls.

Re: Can't find not null constraint, but \d+ shows that

2024-03-27 Thread jian he

On Wed, Mar 27, 2024 at 10:26 PM Tender Wang  wrote:
>
> Alvaro Herrera  于2024年3月26日周二 23:25写道：
>>
>> On 2024-Mar-26, Tender Wang wrote:
>>
>> > postgres=# CREATE TABLE t1(c0 int, c1 int);
>> > postgres=# ALTER TABLE  t1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
>> > postgres=# ALTER TABLE  t1 DROP c1;
>> >
>> > postgres=# ALTER TABLE  t1  ALTER c0 DROP NOT NULL;
>> > ERROR:  could not find not-null constraint on column "c0", relation "t1"
>>
>> Ooh, hah, what happens here is that we drop the PK constraint
>> indirectly, so we only go via doDeletion rather than the tablecmds.c
>> code, so we don't check the attnotnull flags that the PK was protecting.
>>
>> > The attached patch is my workaround solution.  Look forward your apply.
>>

after applying v2-0001-Fix-pg_attribute-attnotnull-not-reset-when-droppe.patch

something is off, now i cannot drop a table.
demo:
CREATE TABLE t2(c0 int, c1 int);
ALTER TABLE  t2 ADD CONSTRAINT t2_pk PRIMARY KEY(c0, c1);
ALTER TABLE t2 ALTER COLUMN c0 ADD GENERATED ALWAYS AS IDENTITY;
DROP TABLE t2 cascade;
Similarly, maybe there will be some issue with replica identity.


+ /*
+ * If this was a NOT NULL or the primary key, the constrained columns must
+ * have had pg_attribute.attnotnull set.  See if we need to reset it, and
+ * do so.
+ */
+ if (unconstrained_cols)
it should be if (unconstrained_cols != NIL)?,
given unconstrained_cols is a List, also "unconstrained_cols" naming
seems not intuitive.
maybe pk_attnums or pk_cols or pk_columns.


+ attrel = table_open(AttributeRelationId, RowExclusiveLock);
+ rel = table_open(con->conrelid, RowExclusiveLock);
I am not sure why we using RowExclusiveLock for con->conrelid?
given we use AccessExclusiveLock at:
/*
* If the constraint is for a relation, open and exclusive-lock the
* relation it's for.
*/
rel = table_open(con->conrelid, AccessExclusiveLock);


+ /*
+ * Since the above deletion has been made visible, we can now
+ * search for any remaining constraints on this column (or these
+ * columns, in the case we're dropping a multicol primary key.)
+ * Then, verify whether any further NOT NULL or primary key
+ * exists, and reset attnotnull if none.
+ *
+ * However, if this is a generated identity column, abort the
+ * whole thing with a specific error message, because the
+ * constraint is required in that case.
+ */
+ contup = findNotNullConstraintAttnum(RelationGetRelid(rel), attnum);
+ if (contup ||
+ bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
+  pkcols))
+ continue;

I didn't get this part.
if you drop delete a primary key,
the "NOT NULL" constraint within pg_constraint should definitely be removed?
therefore contup should be pretty sure is NULL?


  /*
- * The parser will create AT_AttSetNotNull subcommands for
- * columns of PRIMARY KEY indexes/constraints, but we need
- * not do anything with them here, because the columns'
- * NOT NULL marks will already have been propagated into
- * the new table definition.
+ * PK drop now will reset pg_attribute attnotnull to false.
+ * We should set attnotnull to true again.
  */
PK drop now will reset pg_attribute attnotnull to false,
which is what we should be expecting.
the comment didn't explain why should set attnotnull to true again?

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro

New version with some cosmetic/comment changes, and Melanie's
read_stream_reset() function merged, as required by her sequential
scan user patch.  I tweaked it slightly: it might as well share code
with read_stream_end().  I think setting distance = 1 is fine for now,
and we might later want to adjust that as we learn more about more
interesting users of _reset().
From 6b66a6412c90c8f696a8b5890596ba1ab7477191 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v11 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro 
Author: Andres Freund  (optimization tweaks)
Reviewed-by: Melanie Plageman 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Dilip Kumar 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  14 +
 src/backend/storage/buffer/bufmgr.c   | 701 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h  |  41 +-
 src/tools/pgindent/typedefs.list  |   1 +
 7 files changed, 564 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2ef..f3736000ad2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'

   
 
+  
+   io_combine_limit (integer)
+   
+io_combine_limit configuration parameter
+   
+   
+   
+
+ Controls the largest I/O size in operations that combine I/O.
+ The default is 128kB.
+
+   
+  
+
   
max_worker_processes (integer)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..7123cbbaa2a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 ForkNumber forkNum, BlockNumber blockNum,
-ReadBufferMode mode, BufferAccessStrategy strategy,
-bool *hit);
+ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 		   ForkNumber fork,
 		   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 		  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 			  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
    ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@

Re: Proposal: Introduce row-level security templates

2024-03-27 Thread Aadhav Vignesh

Hi Stojan,


> I do think there should be the option to attach it immediately to
> tables, something like `CREATE POLICY TEMPLATE  ATTACH TO
>  AS ...` but it’s not a deal-breaker for me.


I would say that because templates are not active until they’re
> attached, the logic for “validating” the query part should come when
> the template is being attached on the table. So a non-qualifying
> `USING` clause would error out on the `ATTACH` command. Similarly,
> altering the policy template would force validation on the clause over
> the already attached tables to the policy.


That's an interesting idea. I believe that it could be achieved with some
modification, but I'm thinking about the preferred behavior on attaching
templates to tables: do we want to error out instantly if we encounter a
case where the qualifier isn't applicable to a particular table, or do we
let the template get attached to other tables silently?

I would also suggest looking into the ability to attach a template to
> a schema. There are some obvious benefits of this — creating a table
> in the schema automatically gets the correct RLS behavior without
> having to explicitly attach a template or policies, making it
> secure-by-default. There is some interesting behavior of this as well
> which _may_ be beneficial. For example, if you create a schema-wide
> template, say `user_id = current_user`, and you create a table that
> doesn’t have a `user_id` column — the creation would fail. This is in
> many practical situations beneficial, as it lessens the likelihood of
> creating a table that can’t be properly secured in that application
> context.


I like this idea, as a schema-level template would be beneficial in some
cases, but this change would introduce more rigidity or disruptions. For
example, if a table needs to be created in a schema without any
restrictions on access, it would fail as the schema now enforces RLS checks
on table creation. I do feel that this proposal has its benefits, but this
also introduces a binary/dichotomous decision: either you enable RLS on
each table in the schema, or you don't.

One way to solve this is to manually modify each table that doesn't need
RLS checks by disabling it: `ALTER TABLE  DISABLE ROW LEVEL
SECURITY;`, but I'm not sure if this is ideal, as this introduces more
operational/administration complexity.

1. Make the `ON table_name` part of the `CREATE POLICY` statement
> optional, which would create the “template.” This would require
> altering the internals of the policy-table relationship to support 0,
> 1, 2, … tables instead of the current 1. Again I have no idea how this
> is implemented internally, but it could be a fairly simple change
> without having to introduce new concepts, objects, and commands.


Interesting, what does `0` entail in this case? Current behavior is to
enforce a policy on a table, if that's made optional, would that mean if no
tables are specified in `CREATE POLICY`, would it be considered as a
schema-level policy?

Wouldn't it be better if we had a way to explicitly specify when a
schema-level policy is required to be created? With the proposed behavior,
there might be cases where users might accidentally trigger/enforce a
schema-level policy if they failed to specify any table names.

2. Have templates only as the object that enables the one-to-many
> relationship between a policy and table. For example, you could create
> a policy like `CREATE POLICY owned_by_user ON table ...`, and then you
> could do something like `CREATE POLICY TEMPLATE owned_by_user AS
> POLICY schema.table.owned_by_user ATTACH TO tables...`. So essentially
> the “template object” just references an already existing policy
> attached to a table, and it allows you to attach it to other tables
> too.


I believe that's possible by utilizing the system catalogs, and finding
references to the policy as you mentioned, but it's highly sensitive to
cases where the original policy is deleted, as now you can't refer to the
original policy. There can be modifications made to `DROP POLICY` to also
remove the top-level/parent template when the original policy is deleted,
but I'm not sure if that behavior is preferred.

Thanks,
Aadhav

RE: Synchronizing slots from primary to standby

2024-03-27 Thread Zhijie Hou (Fujitsu)

Hi,

When analyzing one BF error[1], we find an issue of slotsync: Since we don't
perform logical decoding for the synced slots when syncing the lsn/xmin of
slot, no logical snapshots will be serialized to disk. So, when user starts to
use these synced slots after promotion, it needs to re-build the consistent
snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
position indicates that there are running transactions. This however could
cause the data that before the consistent point to be missed[2].

This issue doesn't exist on the primary because the snapshot at restart_lsn
should have been serialized to disk (SnapBuildProcessRunningXacts ->
SnapBuildSerialize), so even if the logical decoding restarts, it can find
consistent snapshot immediately at restart_lsn.

To fix this, we could use the fast forward logical decoding to advance the 
synced
slot's lsn/xmin when syncing these values instead of directly updating the
slot's info. This way, the snapshot will be serialized to disk when decoding.
If we could not reach to the consistent point at the remote restart_lsn, the
slot is marked as temp and will be persisted once it reaches the consistent
point. I am still analyzing the fix and will share once ready.


[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae=2024-03-19%2010%3A03%3A06
[2] The steps to reproduce the data miss issue on a primary->standby setup:
 
Note, we need to set LOG_SNAPSHOT_INTERVAL_MS to a bigger number(150) to
prevent cocurrent LogStandbySnapshot() call and enable sync_replication_slots 
on the standby.
 
1. Create a failover logical slot on the primary.
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 
'test_decoding', false, false, true);
 
2. Use the following steps to advance the restart_lsn of the failover slot to a
position where the xl_running_xacts at that position indicates that there is
running transaction.
 
TXN1
BEGIN;
create table dummy1(a int);
 
TXN2
SELECT pg_log_standby_snapshot();
 
TXN1
COMMIT;

TXN1
BEGIN;
create table dummy2(a int);
 
TXN2
SELECT pg_log_standby_snapshot();
 
TXN1
COMMIT;
 
-- the restart_lsn will be advanced to a position where there was 1 running
transaction. And we need to wait for the restart_lsn to be synced to the
standby.
SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());
 
-- insert some data here before calling next pg_log_standby_snapshot().
INSERT INTO reptable VALUES(999);
 
3. Promote the standby and try to consume the change(999) from the synced slot
on the standby. We will find that no change is returned.
 
select * from pg_logical_slot_get_changes('logicalslot', NULL, NULL);

Best Regards,
Hou zj

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-27 Thread shveta malik

On Wed, Mar 27, 2024 at 9:00 PM Bharath Rupireddy
 wrote:
>
> Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
> standby for sync slots. 0002 implementing inactive timeout GUC based
> invalidation mechanism.
>
> Please have a look.

Thanks for the patches. v29-001 looks good to me.

thanks
Shveta

Re: DOCS: add helpful partitioning links

2024-03-27 Thread Ashutosh Bapat

LGTM.

The commitfest entry is marked as RFC already.

Thanks for taking care of the comments.

-- 
Best Wishes,
Ashutosh Bapat

On Thu, Mar 28, 2024 at 5:54 AM Robert Treat  wrote:

> On Mon, Mar 25, 2024 at 6:43 AM Ashutosh Bapat
>  wrote:
> > On Fri, Mar 22, 2024 at 10:58 PM Robert Treat  wrote:
> >> v5 patch attached which I think further improves clarity/brevity of
> >> this section. I've left the patch name the same for simplicity, but
> >> I'd agree that the commit would now be more along the lines of editing
> >> / improvements / copyrighting of "Partition Maintenance" docs.
> >
> >
> > Right. Minor suggestions.
> >
> > - It is recommended to drop the now-redundant
> CHECK
> > - constraint after the ATTACH PARTITION is
> complete.  If
> > - the table being attached is itself a partitioned table, then each
> of its
> > + As illustrated above, it is recommended to avoid this scan by
> creating a
> > + CHECK constraint on the to be attached table
> that
> >
> > Instead of "to be attached table", "table to be attached" reads better.
> You may want to add "as a partition" after that.
> >
>
> That sounds more awkward to me, but I've done some rewording to avoid both.
>
> >   Similarly, if the partitioned table has a
> DEFAULT
> >   partition, it is recommended to create a CHECK
> >   constraint which excludes the to-be-attached partition's
> constraint.  If
> > - this is not done then the DEFAULT partition
> will be
> > + this is not done, the DEFAULT partition must be
> >
> > I am not sure whether replacing "will" by "must" is correct. Usually I
> have seen "will" being used in such sentences, "must" seems appropriate
> given the necessity.
> >
>
> OK
>
> Updated patch attached.
>
>
> Robert Treat
> https://xzilla.net
>

Re: Fix parallel vacuum buffer usage reporting

2024-03-27 Thread Masahiko Sawada

Hi,

Thank you for the report.

On Fri, Feb 9, 2024 at 6:10 PM Anthonin Bonnefoy
 wrote:
>
> Hi,
>
> With a db setup with pgbench, we add an additional index:
> CREATE INDEX ON pgbench_accounts(abalance)
>
> And trigger several updates and vacuum to reach a stable amount of
> dirtied pages:
> UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT;
> VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts
>
> The vacuum will report the following:
> INFO:  vacuuming "postgres.public.pgbench_accounts"
> INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
> INFO:  finished vacuuming "postgres.public.pgbench_accounts": index scans: 1
> ...
> buffer usage: 122 hits, 165 misses, 4 dirtied
>
> 4 pages were reported dirtied. However, we have 5 dirtied blocks at
> the end of the vacuum when looking at pg_buffercache:
>
> SELECT c.relname, b.relfilenode
>  FROM
> pg_buffercache b LEFT JOIN pg_class c
>  ON b.relfilenode =
> pg_relation_filenode(c.oid)
>WHERE isdirty=true;
> relname| relfilenode
> ---+-
>  pg_class  |1259
>  pgbench_accounts  |   16400
>  pgbench_accounts  |   16400
>  pgbench_accounts_pkey |   16408
>  pgbench_accounts_abalance_idx |   16480
>
> The missing dirty block comes from the parallel worker vacuuming the
> abalance index. Running vacuum with parallel disabled will give the
> correct result.
>
> Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track
> buffer usage. However, those values are not collected at the end of
> parallel vacuum workers, leading to incorrect buffer count.

True. I think we should fix it also on backbranches.

>
> Those vacuum specific globals are redundant with the existing
> pgBufferUsage and only used in the verbose output. This patch removes
> them and replaces them by pgBufferUsage which is already correctly
> collected at the end of parallel workers, fixing the buffer count.

It seems to make sense to remove VacuumPageHit and friends, only on
the master branch, if we can use BufferUsage instead.

As for the proposed patch, the following part should handle the temp tables too:

appendStringInfo(, _("avg read rate: %.3f
MB/s, avg write rate: %.3f MB/s\n"),
 read_rate, write_rate);
appendStringInfo(, _("buffer usage: %lld
hits, %lld misses, %lld dirtied\n"),
-(long long)
AnalyzePageHit,
-(long long)
AnalyzePageMiss,
-(long long)
AnalyzePageDirty);
+(long long)
bufferusage.shared_blks_hit,
+(long long)
bufferusage.shared_blks_read,
+(long long)
bufferusage.shared_blks_dirtied);
appendStringInfo(, _("system usage: %s"),
pg_rusage_show());

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Properly pathify the union planner

2024-03-27 Thread David Rowley

On Thu, 28 Mar 2024 at 15:56, Tom Lane  wrote:
> I haven't studied the underlying problem yet, so I'm not quite
> buying into whether we need this struct at all ...

The problem is, when planning a UNION child query, we want to try and
produce some Paths that would suit the top-level UNION query so that a
Merge Append -> Unique can be used rather than a Append -> Sort ->
Unique or Append -> Hash Aggregate.

The problem is informing the UNION child query about what it is.  I
thought I could do root->parent_root->parse->setOperations for a UNION
child to know what it is, but that breaks for a query such as:

WITH q1(x) AS (SELECT 1)
   SELECT FROM q1 UNION SELECT FROM q1

as the CTE also has root->parent_root->parse->setOperations set and in
the above case, that's a problem as there's some code that tries to
match the non-resjunk child targetlists up with the SetOperationStmt's
SortGroupClauses, but there's a mismatch for the CTE.  The actual
UNION children should have a 1:1 match for non-junk columns.

> but assuming
> we do, I feel like "PlannerContext" is a pretty poor name.
> There's basically nothing to distinguish it from "PlannerInfo",
> not to mention that readers would likely assume it's a memory
> context of some sort.
>
> Perhaps "SubqueryContext" or the like would be better?  It
> still has the conflict with memory contexts though.

Maybe something with "Parameters" in the name?

David

Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-27 Thread Tom Lane

jian he  writes:
> I noticed psql \dD didn't return the basetype of a domain.
> one of the usage of this feature would be in psql \dD.

Your 0002 will cause \dD to fail entirely against an older server.
I'm not necessarily against adding this info, but you can't just
ignore the expectations for psql \d commands:

 * Support for the various \d ("describe") commands.  Note that the current
 * expectation is that all functions in this file will succeed when working
 * with servers of versions 9.2 and up.  It's okay to omit irrelevant
 * information for an old server, but not to fail outright.  (But failing
 * against a pre-9.2 server is allowed.)

regards, tom lane

Re: Properly pathify the union planner

2024-03-27 Thread Tom Lane

Richard Guo  writes:
> On Wed, Mar 27, 2024 at 6:34 PM David Rowley  wrote:
>> The attached is roughly what I had in mind.  I've not taken the time
>> to see what comments need to be updated, so the attached aims only to
>> assist discussion.

> I like this idea.

I haven't studied the underlying problem yet, so I'm not quite
buying into whether we need this struct at all ... but assuming
we do, I feel like "PlannerContext" is a pretty poor name.
There's basically nothing to distinguish it from "PlannerInfo",
not to mention that readers would likely assume it's a memory
context of some sort.

Perhaps "SubqueryContext" or the like would be better?  It
still has the conflict with memory contexts though.

regards, tom lane

Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-27 Thread jian he

On Thu, Mar 21, 2024 at 10:34 AM jian he  wrote:
>
> On Mon, Mar 18, 2024 at 11:43 PM Tom Lane  wrote:
> >
> > Alexander Korotkov  writes:
> > > On Mon, Mar 18, 2024 at 2:01 AM jian he  
> > > wrote:
> > >> `
> > >> Datum
> > >> pg_basetype(PG_FUNCTION_ARGS)
> > >> {
> > >>  Oid oid;
> > >>
> > >>  oid =  get_fn_expr_argtype(fcinfo->flinfo, 0);
> > >>  PG_RETURN_OID(getBaseType(oid));
> > >> }
> > >> `
> >
> > > Looks good to me.  But should it be named pg_basetypeof()?
> >
> > I still don't like this approach.  It forces the function to be
> > used in a particular way that's highly redundant with pg_typeof.
> > I think we'd be better off with
> >
> > pg_basetype(PG_FUNCTION_ARGS)
> > {
> > Oid typid = PG_GETARG_OID(0);
> >
> > PG_RETURN_OID(getBaseType(typid));
> > }
> >
> > The use-case that the other definition handles would be implemented
> > like
> >
> > pg_basetype(pg_typeof(expression))
> >
>
> trying to do it this way.
> not sure the following error message is expected.
>
> SELECT pg_basetype(-1);
> ERROR:  cache lookup failed for type 4294967295

I think the error message should be fine.
even though
`select '-1'::oid::regtype;` return 4294967295.

I noticed psql \dD didn't return the basetype of a domain.
one of the usage of this feature would be in psql \dD.

now we can:
\dD mytext_child_2
   List of domains
 Schema |  Name  |  Type  | Basetype | Collation |
Nullable | Default | Check
+++--+---+--+-+---
 public | mytext_child_2 | mytext_child_1 | text |   |
 | |
(1 row)
From 5e60c542c52059cdcdb8a7a2b1cec561f43f7a66 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 28 Mar 2024 10:45:15 +0800
Subject: [PATCH v6 2/2] make psql \dD displays the domain's basetype.

previously psql \dD only shows the type that the domain is based on.
now add a column to display the primitive basetye (that's not a domain)
of a domain.
---
 src/bin/psql/describe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 6433497b..34fcaef8 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -,6 +,7 @@ listDomains(const char *pattern, bool verbose, bool showSystem)
 	  "SELECT n.nspname as \"%s\",\n"
 	  "   t.typname as \"%s\",\n"
 	  "   pg_catalog.format_type(t.typbasetype, t.typtypmod) as \"%s\",\n"
+	  "   pg_catalog.pg_basetype(t.typbasetype) as \"%s\",\n"
 	  "   (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type bt\n"
 	  "WHERE c.oid = t.typcollation AND bt.oid = t.typbasetype AND t.typcollation <> bt.typcollation) as \"%s\",\n"
 	  "   CASE WHEN t.typnotnull THEN 'not null' END as \"%s\",\n"
@@ -4454,6 +4455,7 @@ listDomains(const char *pattern, bool verbose, bool showSystem)
 	  gettext_noop("Schema"),
 	  gettext_noop("Name"),
 	  gettext_noop("Type"),
+	  gettext_noop("Basetype"),
 	  gettext_noop("Collation"),
 	  gettext_noop("Nullable"),
 	  gettext_noop("Default"),
-- 
2.34.1

From aad43e327e4f3b3cbc5bd5d4d1944e70446dc865 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 21 Mar 2024 10:23:04 +0800
Subject: [PATCH v6 1/2] Add pg_basetype(regtype) function to return the
 basetype of a domain

Currently obtaining the base type of a domain involves a complex SQL query,
this specially in the case of recursive domain types.

To solve this, use the already available getBaseType() function,
and expose it as the pg_basetype SQL function.

if the argument is not a doamin type, return the type of the argument as is.
if the argument is a type of doamin, pg_basetype will recurse the domain dependencies chain
and return the real inner base type of the domain.

discussion:  https://postgr.es/m/CAGRrpzZSX8j=MQcbCSEisFA=ic=k3bknvfnfjav1divjxfh...@mail.gmail.com
---
 doc/src/sgml/func.sgml   | 26 +
 src/backend/utils/adt/misc.c | 10 ++
 src/include/catalog/pg_proc.dat  |  3 +++
 src/test/regress/expected/domain.out | 29 
 src/test/regress/sql/domain.sql  | 14 ++
 5 files changed, 82 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93b0bc2b..e9db88f5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25129,6 +25129,32 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);

   
 
+  
+   
+
+ pg_basetype
+
+pg_basetype ( regtype )
+regtype
+   
+   
+   Returns the OID of the base type of a domain identified by its type OID.
+   If the argument is not the OID of a domain type, return the argument as is.
+   If there's a chain of domain dependencies, it will recurse until finding the base type.
+   
+   
+For

Re: Properly pathify the union planner

2024-03-27 Thread Richard Guo

On Wed, Mar 27, 2024 at 6:34 PM David Rowley  wrote:

> On Wed, 27 Mar 2024 at 22:47, David Rowley  wrote:
> > I did wonder when first working on this patch if subquery_planner()
> > should grow an extra parameter, or maybe consolidate some existing
> > ones by passing some struct that provides the planner with a bit more
> > context about the query.  A few of the existing parameters are likely
> > candidates for being in such a struct. e.g. hasRecursion and
> > tuple_fraction. A SetOperationStmt could go in there too.
>
> The attached is roughly what I had in mind.  I've not taken the time
> to see what comments need to be updated, so the attached aims only to
> assist discussion.


I like this idea.  And there may be future applications for having such
a struct if we want to pass down additional information to subquery
planning, such as ordering requirements from outer query.

Thanks
Richard

Re: Remove some redundant set_cheapest() calls

2024-03-27 Thread Richard Guo

On Wed, Mar 27, 2024 at 10:59 PM Tom Lane  wrote:

> Richard Guo  writes:
> > On Wed, Mar 27, 2024 at 4:06 AM Tom Lane  wrote:
> >> I'm less convinced about changing this.  I'd rather keep it consistent
> >> with mark_dummy_rel.
>
> > Hm, I wonder if we should revise the comment there that states "but not
> > when called from elsewhere", as it does not seem to be true.
>
> I'd be okay with wording like "This is redundant in current usage
> because set_rel_pathlist will do it later, but it's cheap so we keep
> it for consistency with mark_dummy_rel".  What do you think?


That works for me.  Thanks for the wording.

Thanks
Richard

Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)

2024-03-27 Thread Euler Taveira

On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote:
> Coverity has some reports in the new code
> pg_createsubscriber.c
> I think that Coverity is right.

It depends on your "right" definition. If your program execution is ephemeral
and the leak is just before exiting, do you think it's worth it?

> 1.
> CID 1542682: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable buf going out of scope leaks the storage it points 
> to.

It will exit in the next instruction.

> 2.
> CID 1542704: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable conn going out of scope leaks the storage it points 
> to.

The connect_database function whose exit_on_error is false is used in 2 
routines:

* cleanup_objects_atexit: that's about to exit;
* drop_primary_replication_slot: that will execute a few routines before 
exiting.

> 3.
> CID 1542691: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable str going out of scope leaks the storage it points 
> to.

It will exit in the next instruction.

Having said that, applying this patch is just a matter of style.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman

On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> On 27/03/2024 20:26, Melanie Plageman wrote:
> > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas  wrote:
> > > 
> > > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > > I need some way to modify the control flow or accounting such that I
> > > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > > And a way to consider freezing and do live tuple accounting for these
> > > > and HEAPTUPLE_LIVE tuples exactly once.
> > > 
> > > Just a quick update: I've been massaging this some more today, and I
> > > think I'm onto got something palatable. I'll send an updated patch later
> > > today, but the key is to note that for each item on the page, there is
> > > one point where we determine the fate of the item, whether it's pruned
> > > or not. That can happen in different points in in heap_page_prune().
> > > That's also when we set marked[offnum] = true. Whenever that happens, we
> > > all call one of the a heap_page_prune_record_*() subroutines. We already
> > > have those subroutines for when a tuple is marked as dead or unused, but
> > > let's add similar subroutines for the case that we're leaving the tuple
> > > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > > can ensure that it gets done exactly once for each tuple, and at that
> > > point we know what we are going to do to the tuple, so we can count it
> > > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > > ensures that each tuple is marked only once, and the subroutines update
> > > all the variables, add the item to the correct arrays etc. depending on
> > > what we're doing with it.
> > 
> > Yes, this would be ideal.
> 
> Well, that took me a lot longer than expected. My approach of "make sure you
> all the right heap_prune_record_*() subroutine in all cases didn't work out
> quite as easily as I thought. Because, as you pointed out, it's difficult to
> know if a non-DEAD tuple that is part of a HOT chain will be visited later
> as part of the chain processing, or needs to be counted at the top of
> heap_prune_chain().
> 
> The solution I came up with is to add a third phase to pruning. At the top
> of heap_prune_chain(), if we see a live heap-only tuple, and we're not sure
> if it will be counted later as part of a HOT chain, we stash it away and
> revisit it later, after processing all the hot chains. That's somewhat
> similar to your 'counted' array, but not quite.

I like this idea (the third phase). I've just started reviewing this but
I thought I would give the initial thoughts I had inline.

> One change with this is that live_tuples and many of the other fields are
> now again updated, even if the caller doesn't need them. It was hard to skip
> them in a way that would save any cycles, with the other refactorings.

I am worried we are writing checks that are going to have to come out of
SELECT queries' bank accounts, but I'll do some benchmarking when we're
all done with major refactoring.

> From 2f38628373ccfb6e8f8fd883955056030092569d Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Thu, 28 Mar 2024 00:16:09 +0200
> Subject: [PATCH v8 21/22] Add comment about a pre-existing issue
> 
> Not sure if we want to keep this, but I wanted to document it for
> discussion at least.
> ---
>  src/backend/access/heap/pruneheap.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index e37ba655a7d..2b720ab6aa1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -792,6 +792,23 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>* Note that we might first arrive at a dead heap-only 
> tuple
>* either here or while following a chain below.  
> Whichever path
>* gets there first will mark the tuple unused.
> +  *
> +  * FIXME: The next paragraph isn't new with these 
> patches, but
> +  * just something I realized while looking at this. But 
> maybe we should
> +  * add a comment like this? Or is it too much detail?

I think a comment is a good idea.

> +  *
> +  * Whether we arrive at the dead HOT tuple first here 
> or while
> +  * following a chain below affects whether preceding 
> RECENTLY_DEAD
> +  * tuples in the chain can be removed or not.  Imagine 
> that you
> +  * have a chain with two tuples: RECENTLY_DEAD -> DEAD. 
>  If we
> +  * reach the RECENTLY_DEAD tuple first, the 
> chain-following logic
> +  * will find the DEAD tuple and conclude that both 
> tuples are in
> +  * fact dead and can be removed.  But if

Re: add AVX2 support to simd.h

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 04:37:35PM -0500, Nathan Bossart wrote:
> On Wed, Mar 27, 2024 at 05:10:13PM -0400, Tom Lane wrote:
>> LGTM otherwise, and I like the fact that the #if structure
>> gets a lot less messy.
> 
> Thanks for reviewing.  I've attached a v2 that I intend to commit when I
> get a chance.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-27 Thread Masahiko Sawada

Hi,

On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov  
> wrote:
> >
> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> > wrote:
> > > On 2024-01-18 10:10, jian he wrote:
> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> > > > wrote:
> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> > > >> > "error")?
> > > >> > You will need a separate parameter anyway to specify the destination
> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> > > >> > looking.  I don't buy that one parameter that has some special values
> > > >> > while other values could be names will be a good design.  Moreover,
> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> > > >> > Trying to distinguish a file name from a table name without any other
> > > >> > context seems impossible.
> > > >>
> > > >> I've been thinking we can add more values to this option to log errors
> > > >> not only to the server logs but also to the error table (not sure
> > > >> details but I imagined an error table is created for each table on
> > > >> error), without an additional option for the destination name. The
> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > > >>
> > > >
> > > > another idea:
> > > > on_error {error|ignore|other_future_option}
> > > > if not specified then by default ERROR.
> > > > You can also specify ERROR or IGNORE for now.
> > > >
> > > > I agree, the parameter "error_action" is better than "location".
> > >
> > > I'm not sure whether error_action or on_error is better, but either way
> > > "error_action error" and "on_error error" seems a bit odd to me.
> > > I feel "stop" is better for both cases as Tom suggested.
> >
> > OK.  What about this?
> > on_error {stop|ignore|other_future_option}
> > where other_future_option might be compound like "file 'copy.log'" or
> > "table 'copy_log'".
>
> +1
>

I realized that ON_ERROR syntax synoposis in the documentation is not
correct. The option doesn't require the value to be quoted and the
value can be omitted. The attached patch fixes it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-doc-Fix-COPY-ON_ERROR-option-syntax-synopsis.patch
Description: Binary data

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro

On Thu, Mar 28, 2024 at 2:02 PM Thomas Munro  wrote:
> ... In practice on a non-toy system, that's always going to be
> io_combine_limit.  ...

And to be more explicit about that: you're right that we initialise
max_pinned_buffers such that it's usually at least io_combine_limit,
but then if you have a very small buffer pool it gets clobbered back
down again by LimitAdditionalBins() and may finish up as low as 1.
You're not allowed to pin more than 1/Nth of the whole buffer pool,
where N is approximately max connections (well it's not exactly that
but that's the general idea).  So it's a degenerate case, but it can
happen that max_pinned_buffers is lower than io_combine_limit and then
it's important not to set distance higher or you'd exceed the allowed
limits (or more likely the circular data structure would implode).

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro

On Mon, Mar 25, 2024 at 2:02 AM Thomas Munro  wrote:
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas  wrote:
> > >   /*
> > >* Skip the initial ramp-up phase if the caller says we're going to 
> > > be
> > >* reading the whole relation.  This way we start out doing 
> > > full-sized
> > >* reads.
> > >*/
> > >   if (flags & PGSR_FLAG_FULL)
> > >   pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, 
> > > pgsr->max_pinned_buffers);
> > >   else
> > >   pgsr->distance = 1;
> >
> > Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
> > pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
> > MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
> > perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?
>
> Right, done.

BTW I forgot to mention that in v10 I changed my mind and debugged my
way back to the original coding, which now looks like this:

/*
 * Skip the initial ramp-up phase if the caller says we're going to be
 * reading the whole relation.  This way we start out assuming we'll be
 * doing full io_combine_limit sized reads (behavior B).
 */
if (flags & READ_STREAM_FULL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
else
stream->distance = 1;

It's not OK for distance to exceed max_pinned_buffers.  But if
max_pinned_buffers is huge, remember that the goal here is to access
'behavior B' meaning wide read calls but no unnecessary extra
look-ahead beyond what is needed for that, so we also don't want to
exceed io_combine_limit.  Therefore we want the minimum of those two
numbers.  In practice on a non-toy system, that's always going to be
io_combine_limit.  But I'm not sure how many users of READ_STREAM_FULL
there will be, and I am starting to wonder if it's a good name for the
flag, or even generally very useful.  It's sort of saying "I expect to
do I/O, and it'll be sequential, and I won't give up until the end".
But how many users can really make those claims?  pg_prewarm is unsual
in that it contains an explicit assumption that the cache is cold and
we want to warm it up.  But maybe we should just let the adaptive
algorithm do its thing.  It only takes a few reads to go from 1 ->
io_combine_limit.

Thinking harder, if we're going to keep this and not just be fully
adaptive, perhaps there should be a flag READ_STREAM_COLD, where you
hint that the data is not expected to be cached, and you'd combine
that with the _SEQUENTIAL hint.  pg_prewarm hints _COLD | _SEQUENTIAL.
Then the initial distance would be something uses the flag
combinations to select initial behavior A, B, C (and we'll quickly
adjust if you're wrong):

if (!(flags & READ_STREAM_COLD))
stream->distance = 1;
else if (flags & READ_STREAM_SEQUENTIAL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
else
stream->distance = max_pinned_buffers;

But probably almost all users especially in the executor haven't
really got much of a clue what they're going to do so they'd use the
initial starting position of 1 (A) and we'd soo figure it out.  Maybe
overengineering for pg_prewarm is a waste of time and we should just
delete the flag instead and hard code 1.

Re: Add new error_action COPY ON_ERROR "log"

2024-03-27 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 2:49 AM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 27, 2024 at 7:42 AM Masahiko Sawada  wrote:
> >
> > I think that there are two options to handle it:
> >
> > 1. change COPY grammar to accept DEFAULT as an option value.
> > 2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
> > and update the doc too.
> >
> > As for the documentation, we can add single-quotes as follows:
> >
> >  ENCODING 'encoding_name'
> > +LOG_VERBOSITY [ 'mode' ]
> >
> > I thought the option (2) is better but there seems no precedent of
> > complementing a single-quoted string other than file names. So the
> > option (1) could be clearer.
> >
> > What do you think?
>
> There is another option to change log_verbosity to {none, verbose} or
> {none, skip_row_info} (discuseed here
> https://www.postgresql.org/message-id/Zelrqq-pkfkvsjPn%40paquier.xyz
> that we might extend this option to other use-cases in future). I tend
> to agree with you to support log_verbose to be set to default without
> quotes just to be consistent with other commands that allow that [1].
> And, thanks for quickly identifying where to change in the gram.y.
> With that change, now I have changed all the new tests added to use
> log_verbosity default without quotes.
>
> FWIW, a recent commit [2] did the same. Therefore, I don't see a
> problem supporting it that way for COPY log_verbosity.
>
> Please find the attached v13 patch with the change.

Thank you for updating the patch quickly, and sharing the reference.

I think {default, verbose} is a good start and more consistent with
other similar features. We can add other modes later.

Regarding the syntax change, since copy_generic_opt_arg rule is only
for COPY option syntax, the change doesn't affect other syntaxes. I've
confirmed the tab-completion works fine.

I'll push the patch, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Streaming read-ready sequential scan code

2024-03-27 Thread Melanie Plageman

On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
 wrote:
>
> On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> >  wrote:
> > >
> > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > >  wrote:
> > > > >
> > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > >  wrote:
> > > > > >
> > > > > > There is an outstanding question about where to allocate the
> > > > > > PgStreamingRead object for sequential scans
> > > > >
> > > > > I've written three alternative implementations of the actual streaming
> > > > > read user for sequential scan which handle the question of where to
> > > > > allocate the streaming read object and how to handle changing scan
> > > > > direction in different ways.
> > > > >
> > > > > Option A) 
> > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > > know the scan direction at this time, if the scan ends up not being a
> > > > > forwards scan, the streaming read object must later be freed -- so
> > > > > this will sometimes allocate a streaming read object it never uses.
> > > > > - Only supports ForwardScanDirection and once the scan direction
> > > > > changes, streaming is never supported again -- even if we return to
> > > > > ForwardScanDirection
> > > > > - Must maintain a "fallback" codepath that does not use the streaming 
> > > > > read API
> > > >
> > > > Attached is a version of this patch which implements a "reset"
> > > > function for the streaming read API which should be cheaper than the
> > > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > > work on any of my proposed implementations (A/B/C). I implemented it
> > > > on A as an example.
> > >
> > > Attached is the latest version of this patchset -- rebased in light of
> > > Thomas' updatees to the streaming read API [1]. I have chosen the
> > > approach I think we should go with. It is a hybrid of my previously
> > > proposed approaches.
> >
> > While investigating some performance concerns, Andres pointed out that
> > the members I add to HeapScanDescData in this patch push rs_cindex and
> > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> > v4's HeapScanDescData is as well-packed as on master and its members
> > are reordered so that rs_cindex and rs_ntuples are back on the second
> > cacheline of the struct's data.
>
> I did some additional profiling and realized that dropping the
> unlikely() from the places we check rs_inited frequently was negatively
> impacting performance. v5 adds those back and also makes a few other
> very minor cleanups.
>
> Note that this patch set has a not yet released version of Thomas
> Munro's Streaming Read API with a new ramp-up logic which seems to fix a
> performance issue I saw with my test case when all of the sequential
> scan's blocks are in shared buffers. Once he sends the official new
> version, I will rebase this and point to his explanation in that thread.

Attached v6 has the version of the streaming read API mentioned here
[1]. This resolved the fully-in-shared-buffers regressions
investigated in that thread by Andres, Bilal, and Thomas.

The one outstanding item for the sequential scan streaming read user
is deciding how the BAS_BULKREAD buffer access strategy should
interact with the streaming read infrastructure. We discussed a bit
off-list, and it seems clear that the ring must be at least as large
as io_combine_limit. This should be no problem for BAS_BULKREAD
because its ring is 16 MB. The question is whether or not we need to
do anything right now to ensure there aren't adverse interactions
between io_combine_limit, max_ios, and the buffer access strategy ring
buffer size.

- Melanie

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJTwrS7F%3DuJPx3SeigMiQiW%2BLJaOkjGyZdCntwyMR%3DuAw%40mail.gmail.com
From bed26d391190f4411eccc4533d188e5dba6e8f72 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v6 1/6] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread David G. Johnston

On Wed, Mar 27, 2024 at 5:43 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

>
> This section is also the main entry point for users into the configuration
> subsystem and hasn't been updated to reflect this new feature.  That seems
> like an oversight that needs to be corrected.
>
>
Shouldn't the "alter system" reference page receive an update as well?

David J.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread David G. Johnston

On Wed, Mar 27, 2024 at 5:17 PM Bruce Momjian  wrote:

> On Thu, Mar 28, 2024 at 12:43:29AM +0100, Jelte Fennema-Nio wrote:
> > +  xreflabel="allow_alter_system">
> > +  allow_alter_system (boolean)
> > +  
> > +   allow_alter_system configuration
> parameter
> > +  
> > +  
> > +  
> > +   
> > +When allow_alter_system is set to
> > +off, an error is returned if the
> ALTER
> > +SYSTEM command is used. This parameter can only be
> set in
>
> "command is used." -> "command is issued." ?
>

"command is executed" seems even better.  I'd take used over issued.

> > +the postgresql.conf file or on the server
> command
> > +line. The default value is on.
> > +   
> > +
> > +   
> > +Note that this setting cannot be regarded as a security
> feature. It
>
> "setting cannot be regarded" -> "setting should not be regarded"
>

"setting must not be regarded" is the correct option here.  Stronger than
should; we are unable to control whether someone can/does regard it
differently.

> > +
> > +   
> > +Turning this setting off is intended for environments where the
> > +configuration of PostgreSQL is
> managed by
> > +some outside mechanism.
> > +In such environments, a well intenioned superuser user might
> > +mistakenly use ALTER
> SYSTEM
> > +to change the configuration instead of using the outside
> mechanism.
> > +This might even appear to update the configuration as intended,
> but
>
> "This might even appear to update" -> "This might temporarily update"
>

I strongly prefer temporarily over may/might/could.

>
> > +then might be discarded at some point in the future when that
> outside
>
> "that outside" -> "the outside"
>

Feel like "external" is a more context appropriate term here than "outside".

External also has precedent.
https://www.postgresql.org/docs/current/config-setting.html#CONFIG-INCLUDES
"External tools may also modify postgresql.auto.conf. It is not recommended
to do this while the server is running,"

That suggests using "external tools" instead of "outside mechanisms"

This section is also the main entry point for users into the configuration
subsystem and hasn't been updated to reflect this new feature.  That seems
like an oversight that needs to be corrected.

> +   
> > +
> > +   
> > +This parameter only controls the use of ALTER
> SYSTEM.
> > +The settings stored in
> postgresql.auto.conf always
>
> "always" -> "still"
>

Neither qualifier is needed, nor does one seem clearly better than the
other.  Always is true so the weaker "still" seems like the worse choice.

The following is a complete and clear sentence.

The settings stored in postgresql.auto.conf take effect even if
allow_alter_system is set to off.

> Should this paragraph be moved after or as part of the paragraph about
> modifying postgresql.auto.conf?
>
>
I like it by itself.

David J.

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-27 Thread Alexander Korotkov

On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
 wrote:
> Thank you for your interest to the patch.
> I understand you questions, but I fully support Alexander Korotkov idea
> to commit the minimal required functionality. And then keep working on
> other improvements.

I did further improvements in the patch.

Notably, I decided to rename the procedure to
pg_wait_for_wal_replay_lsn().  This makes the name look consistent
with other WAL-related functions.  Also it clearly states that we're
waiting for lsn to be replayed (not received, written or flushed).

Also, I did implements in the docs, commit message and some minor code fixes.

I'm continuing to look at this patch.

--
Regards,
Alexander Korotkov

Re: DOCS: add helpful partitioning links

2024-03-27 Thread Robert Treat

On Mon, Mar 25, 2024 at 6:43 AM Ashutosh Bapat
 wrote:
> On Fri, Mar 22, 2024 at 10:58 PM Robert Treat  wrote:
>> v5 patch attached which I think further improves clarity/brevity of
>> this section. I've left the patch name the same for simplicity, but
>> I'd agree that the commit would now be more along the lines of editing
>> / improvements / copyrighting of "Partition Maintenance" docs.
>
>
> Right. Minor suggestions.
>
> - It is recommended to drop the now-redundant CHECK
> - constraint after the ATTACH PARTITION is complete.  
> If
> - the table being attached is itself a partitioned table, then each of its
> + As illustrated above, it is recommended to avoid this scan by creating a
> + CHECK constraint on the to be attached table that
>
> Instead of "to be attached table", "table to be attached" reads better. You 
> may want to add "as a partition" after that.
>

That sounds more awkward to me, but I've done some rewording to avoid both.

>   Similarly, if the partitioned table has a DEFAULT
>   partition, it is recommended to create a CHECK
>   constraint which excludes the to-be-attached partition's constraint.  If
> - this is not done then the DEFAULT partition will be
> + this is not done, the DEFAULT partition must be
>
> I am not sure whether replacing "will" by "must" is correct. Usually I have 
> seen "will" being used in such sentences, "must" seems appropriate given the 
> necessity.
>

OK

Updated patch attached.


Robert Treat
https://xzilla.net


improve-partition-links_v6.patch
Description: Binary data

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Thu, Mar 28, 2024 at 12:43:29AM +0100, Jelte Fennema-Nio wrote:
> +  xreflabel="allow_alter_system">
> +  allow_alter_system (boolean)
> +  
> +   allow_alter_system configuration 
> parameter
> +  
> +  
> +  
> +   
> +When allow_alter_system is set to
> +off, an error is returned if the ALTER
> +SYSTEM command is used. This parameter can only be set in

"command is used." -> "command is issued." ?

> +the postgresql.conf file or on the server 
> command
> +line. The default value is on.
> +   
> +
> +   
> +Note that this setting cannot be regarded as a security feature. It

"setting cannot be regarded" -> "setting should not be regarded"

> +only disables the ALTER SYSTEM command. It does 
> not
> +prevent a superuser from changing the configuration using other SQL
> +commands. A superuser has many ways of executing shell commands at
> +the operating system level, and can therefore modify
> +postgresql.auto.conf regardless of the value of
> +this setting.

I like that you explained how this can be bypassed.

> +
> +   
> +Turning this setting off is intended for environments where the
> +configuration of PostgreSQL is managed by
> +some outside mechanism.
> +In such environments, a well intenioned superuser user might
> +mistakenly use ALTER SYSTEM
> +to change the configuration instead of using the outside mechanism.
> +This might even appear to update the configuration as intended, but

"This might even appear to update" -> "This might temporarily update"

> +then might be discarded at some point in the future when that outside

"that outside" -> "the outside"

> +mechanism updates the configuration.
> +Setting this parameter to off can
> +help to avoid such mistakes.

"help to avoid" ->  "help avoid"

> +   
> +
> +   
> +This parameter only controls the use of ALTER 
> SYSTEM.
> +The settings stored in postgresql.auto.conf 
> always

"always" -> "still"

Should this paragraph be moved after or as part of the paragraph about
modifying postgresql.auto.conf?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Thu, Mar 28, 2024 at 12:47:46AM +0100, Jelte Fennema-Nio wrote:
> On Wed, 27 Mar 2024 at 23:06, Bruce Momjian  wrote:
> > > > > > +some outside mechanism. In such environments, using 
> > > > > > ALTER
> > > > > > +SYSTEM to make configuration changes might 
> > > > > > appear to work,
> > > > > > +but then may be discarded at some point in the future when 
> > > > > > that outside
> > > > >
> > > > > "might"
> > > >
> > > > This does not seem like a mistake to me. I'm not sure why you think it 
> > > > is.
> > >
> > > I also think the original sentence was correct, but I don't think it
> > > read very naturally. Changed it now in hopes to improve that.
> >
> > So, might means "possibility" while "may" means permission, so "might"
> > is clearer here.
> 
> Aaah, I misunderstood your original feedback then. I thought you
> didn't like the use of "might" in "might appear to work". But I now
> realize you meant "may be discarded" should be changed to "might be
> discarded". I addressed that in my latest version of the patch.

Thanks.  I did the may/might/can changes in the docs years ago so I
remember the distinction.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)

2024-03-27 Thread Ranier Vilela

Hi,

Per Coverity.

Coverity has some reports in the new code
pg_createsubscriber.c
I think that Coverity is right.

1.
CID 1542682: (#1 of 1): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable buf going out of scope leaks the storage it points
to.

2.
CID 1542704: (#1 of 1): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable conn going out of scope leaks the storage it
points to.

3.
CID 1542691: (#1 of 1): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable str going out of scope leaks the storage it points
to.

Patch attached.

best regards,
Ranier Vilela


fix-some-issues-pg_createsubscriber.patch
Description: Binary data

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Jelte Fennema-Nio

On Wed, 27 Mar 2024 at 23:06, Bruce Momjian  wrote:
> > > > > +some outside mechanism. In such environments, using 
> > > > > ALTER
> > > > > +SYSTEM to make configuration changes might appear 
> > > > > to work,
> > > > > +but then may be discarded at some point in the future when 
> > > > > that outside
> > > >
> > > > "might"
> > >
> > > This does not seem like a mistake to me. I'm not sure why you think it is.
> >
> > I also think the original sentence was correct, but I don't think it
> > read very naturally. Changed it now in hopes to improve that.
>
> So, might means "possibility" while "may" means permission, so "might"
> is clearer here.

Aaah, I misunderstood your original feedback then. I thought you
didn't like the use of "might" in "might appear to work". But I now
realize you meant "may be discarded" should be changed to "might be
discarded". I addressed that in my latest version of the patch.

Re: Crash on UNION with PG 17

2024-03-27 Thread David Rowley

On Thu, 28 Mar 2024 at 04:34, Regina Obe  wrote:
> The issue can be exercised without postgis installed as follows:
>
>
> CREATE TABLE edge_data AS
> SELECT i AS edge_id, i + 1 AS start_node, i + 2 As end_node
> FROM generate_series(1,10) AS i;
>
>   WITH edge AS (
> SELECT start_node, end_node
> FROM edge_data
> WHERE edge_id = 1
>   )
>   SELECT start_node id FROM edge UNION
>   SELECT end_node FROM edge;

Thanks for the report.

There's some discussion about this in [1] along with a proposed way to
fix it.  The proposed fix does alter the function signature of an
important and externally visible planner function, so will be waiting
for some feedback on that before moving ahead with fixing.

[1] 
https://www.postgresql.org/message-id/242fc7c6-a8aa-2daf-ac4c-0a231e261...@gmail.com

David

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Jelte Fennema-Nio

On Wed, 27 Mar 2024 at 20:10, Maciek Sakrejda  wrote:
>
> On Wed, Mar 27, 2024, 11:46 Robert Haas  wrote:
>>
>> On Wed, Mar 27, 2024 at 1:12 PM Isaac Morland  
>> wrote:
>> > On Wed, 27 Mar 2024 at 13:05, Greg Sabino Mullane  
>> > wrote:
>> >>> The purpose of the setting is to prevent accidental 
>> >>> modifications via ALTER SYSTEM in environments where
>> >> The emphasis on 'accidental' seems a bit heavy here, and odd. Surely, 
>> >> just "to prevent modifications via ALTER SYSTEM in environments where..." 
>> >> is enough?
>> > Not necessarily disagreeing, but it's very important nobody ever mistake 
>> > this for a security feature. I don't know if the extra word "accidental" 
>> > is necessary, but I think that's the motivation.
>>
>> I think the emphasis is entirely warranted in this case.
>
> +1. And while "non-malicious" may technically be more correct, I don't think 
> it's any clearer.

Attached is a new version of the patch with some sentences reworded. I
changed accidentally to mistakenly (which still has emphasis). And I
hope with the rewording it's now clearer to the reader why that
emphasis is there.


v9-0001-Add-allow_alter_system-GUC.patch
Description: Binary data

Re: Psql meta-command conninfo+

2024-03-27 Thread Imseih (AWS), Sami

Hi,

Thanks for working on this.

I have a few comments about the current patch.

1/ I looked through other psql-meta commands and the “+” adds details but
does not change output format. In this patch, conninfo and conninfo+
have completely different output. The former is a string with all the details
and the latter is a table. Should conninfo just be a table with the minimal info
and additional details can be displayed with "+" appended?

instead of \conninfo displaying a string

You are connected to database "postgres" as user "postgres" on host "127.0.01" 
(address "127.0.0.1") at port "5432".

It can instead display the same information in table format

Current Connection Information
-[ RECORD 1 ]---+---
Database  | postgres
Authenticated User| postgres
Socket Directory |
Host  | 127.0.0.1
Port   | 5432

and \conninfo+ can show the rest of the details

Current Connection Information
-[ RECORD 1 ]--+
Database | postgres
Authenticated User   | postgres
Socket Directory|
Host | 127.0.0.1
Port | 5432
Backend PID   | 1234
...
.

2/ GSSAPI details should show the full details, such as "principal",
"encrypted" and "credentials_delegated".

This also means that pg_stat_ssl will need to be joined with pg_stat_gssapi


FROM

pg_catalog.pg_stat_ssl ssl LEFT JOIN pg_catalog.pg_stat_gssapi gssapi

ON ssl.pid = gssapi.pid

ssl.pid = pg_catalog.pg_backend_pid()


3/ A very nice to have is "Application Name", in the case one
sets the application_name within a connection or in the connection string.

Regards,

Sami Imseih
Amazon Web Services (AWS)

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Jelte Fennema-Nio

On Wed, 27 Mar 2024 at 23:23, Bruce Momjian  wrote:
>
> On Wed, Mar 27, 2024 at 11:10:31AM -0400, Robert Haas wrote:
> > > Is this really a patch we think we can push into PG 17. I am having my
> > > doubts.
> >
> > If the worst thing that happens in PG 17 is that we push a patch that
> > needs a few documentation corrections, we're going to be doing
> > fabulously well.
>
> My point is that we are designing the user API in the last weeks of the
> commitfest, which usually ends badly for us, and the fact the docs were
> not even right in the patch just reenforces that concern.

This user API is exactly the same as the original patch from Gabriele
in September (apart from enable->allow). And we spent half a year
discussing other designs for the user API. So I disagree that we're
designing the user API in the last weeks of the commitfest.

RE: Popcount optimization using AVX512

2024-03-27 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Wednesday, March 27, 2024 3:00 PM
> To: Amonson, Paul D 
> 
> ...  (I realize that I'm essentially
> recanting much of my previous feedback, which I apologize for.)

It happens. LOL As long as the algorithm for AVX-512 is not altered I am 
confident that your new refactor will be fine. :)

Thanks,
Paul

Re: Crash on UNION with PG 17

2024-03-27 Thread Bruce Momjian

On Wed, Mar 27, 2024 at 11:33:55AM -0400, Regina Obe wrote:
> Our PostGIS bot that follows master branch has been crashing for past couple
> of days on one of our tests
> 
> https://trac.osgeo.org/postgis/ticket/5701
> 
> I traced the issue down to this commit:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=66c0185a3d14b
> bbf51d0fc9d267093ffec735231
> 
> 
> The issue can be exercised without postgis installed as follows:
> 
> 
> CREATE TABLE edge_data AS 
> SELECT i AS edge_id, i + 1 AS start_node, i + 2 As end_node
> FROM generate_series(1,10) AS i;
> 
>   WITH edge AS (
> SELECT start_node, end_node
> FROM edge_data
> WHERE edge_id = 1
>   )
>   SELECT start_node id FROM edge UNION
>   SELECT end_node FROM edge;

I can confirm the crash in git master.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Wed, Mar 27, 2024 at 03:20:38PM -0700, David G. Johnston wrote:
> On Wed, Mar 27, 2024 at 3:18 PM David G. Johnston 
> wrote:
> 
> On Wed, Mar 27, 2024 at 3:13 PM Bruce Momjian  wrote:
> 
> On Wed, Mar 27, 2024 at 06:09:02PM -0400, Bruce Momjian wrote:
> > On Wed, Mar 27, 2024 at 11:05:55AM -0400, Robert Haas wrote:
> > > On Wed, Mar 27, 2024 at 10:43 AM Jelte Fennema-Nio <
> postg...@jeltef.nl> wrote:
> > > > Alright, changed the GUC name to "allow_alter_system" since that
> seems
> > > > to have the most "votes". One other option would be to call it
> simply
> > > > "alter_system", just like "jit" is not called "allow_jit" or
> > > > "enable_jit".
> > > >
> > > > But personally I feel that the "allow_alter_system" is clearer
> than
> > > > plain "alter_system" for the GUC name.
> > >
> > > I agree, and have committed your 0001.
> >
> > So, I email "Is this really a patch we think we can push into PG 17.
> I
> > am having my doubts," and the patch is applied a few hours after my
> > email.  Wow.
> 
> Also odd is that I don't see the commit in git master, so now I am
> confused.
> 
> 
> The main feature being discussed is in the 0002 patch while Robert pushed 
> a
> doc section rename in the 0001 patch.
> 
> 
> 
> Well, the internal category name was changed though the docs did remain
> unchanged.

Yes, I figured that out, thank you.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Wed, Mar 27, 2024 at 11:10:31AM -0400, Robert Haas wrote:
> > Is this really a patch we think we can push into PG 17. I am having my
> > doubts.
> 
> If the worst thing that happens in PG 17 is that we push a patch that
> needs a few documentation corrections, we're going to be doing
> fabulously well.

My point is that we are designing the user API in the last weeks of the
commitfest, which usually ends badly for us, and the fact the docs were
not even right in the patch just reenforces that concern.

But, as I stated in another email, you said you committed the patch,
yet I don't see it committed in git master, so I am confused.

Ah, I figured it out.  You were talking about the GUC renaming:

commit de7e96bd0fc
Author: Robert Haas 
Date:   Wed Mar 27 10:45:28 2024 -0400

Rename COMPAT_OPTIONS_CLIENT to COMPAT_OPTIONS_OTHER.

The user-facing name is "Other Platforms and Clients", but the
internal name seems too focused on clients specifically, especially
given the plan to add a new setting to this session that is about
platform or deployment model compatibility rather than client
compatibility.

Jelte Fennema-Nio

Discussion: 
http://postgr.es/m/cageczqtfmbdim6w3av+3wesnhxjvpmutecjxvvst91sqbdo...@mail.gmail.com

Please ignore my complaints, and my apologies.

As far as the GUC change, let's just be careful since we have a bad
history of pushing things near the end that we regret.  I am not saying
that would be this feature, but let's be careful.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread David G. Johnston

On Wed, Mar 27, 2024 at 3:18 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Mar 27, 2024 at 3:13 PM Bruce Momjian  wrote:
>
>> On Wed, Mar 27, 2024 at 06:09:02PM -0400, Bruce Momjian wrote:
>> > On Wed, Mar 27, 2024 at 11:05:55AM -0400, Robert Haas wrote:
>> > > On Wed, Mar 27, 2024 at 10:43 AM Jelte Fennema-Nio <
>> postg...@jeltef.nl> wrote:
>> > > > Alright, changed the GUC name to "allow_alter_system" since that
>> seems
>> > > > to have the most "votes". One other option would be to call it
>> simply
>> > > > "alter_system", just like "jit" is not called "allow_jit" or
>> > > > "enable_jit".
>> > > >
>> > > > But personally I feel that the "allow_alter_system" is clearer than
>> > > > plain "alter_system" for the GUC name.
>> > >
>> > > I agree, and have committed your 0001.
>> >
>> > So, I email "Is this really a patch we think we can push into PG 17. I
>> > am having my doubts," and the patch is applied a few hours after my
>> > email.  Wow.
>>
>> Also odd is that I don't see the commit in git master, so now I am
>> confused.
>>
>
> The main feature being discussed is in the 0002 patch while Robert pushed
> a doc section rename in the 0001 patch.
>
>
Well, the internal category name was changed though the docs did remain
unchanged.

David J.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread David G. Johnston

On Wed, Mar 27, 2024 at 3:13 PM Bruce Momjian  wrote:

> On Wed, Mar 27, 2024 at 06:09:02PM -0400, Bruce Momjian wrote:
> > On Wed, Mar 27, 2024 at 11:05:55AM -0400, Robert Haas wrote:
> > > On Wed, Mar 27, 2024 at 10:43 AM Jelte Fennema-Nio 
> wrote:
> > > > Alright, changed the GUC name to "allow_alter_system" since that
> seems
> > > > to have the most "votes". One other option would be to call it simply
> > > > "alter_system", just like "jit" is not called "allow_jit" or
> > > > "enable_jit".
> > > >
> > > > But personally I feel that the "allow_alter_system" is clearer than
> > > > plain "alter_system" for the GUC name.
> > >
> > > I agree, and have committed your 0001.
> >
> > So, I email "Is this really a patch we think we can push into PG 17. I
> > am having my doubts," and the patch is applied a few hours after my
> > email.  Wow.
>
> Also odd is that I don't see the commit in git master, so now I am
> confused.
>

The main feature being discussed is in the 0002 patch while Robert pushed a
doc section rename in the 0001 patch.

David J.

Re: Built-in CTYPE provider

2024-03-27 Thread Jeff Davis

On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote:
> The patch set v27 is ok with me, modulo (a) discussion about initcap 
> semantics, and (b) what collation to assign to ucs_basic, which can
> be 
> revisited later.

I held off on the refactoring patch for lc_{ctype|collate}_is_c().
There's an explicit "NB: pg_newlocale_from_collation is only supposed
to be called on non-C-equivalent locales" comment in DefineCollation().

What I'd like to do is make it possible to create valid pg_locale_t
objects out of C locales, which can be used anywhere a real locale can
be used. Callers can still check lc_{collate|ctype}_is_c() for various
reasons; but if they did call pg_newlocale_from_collation on a C locale
it would at least work for the pg_locale.h APIs. That would be a
slightly simpler and safer API, and make it easier to do the collation
version check consistently.

That's not very complicated, but it's a bit invasive and probably out
of scope for v17. It might be part of another change I had intended for
a while, which is to make NULL an invalid pg_locale_t, and use a
different representation to mean "use the server environment". That
would clean up a lot of checks for NULL.

For now, we'd still like to add the version number to the builtin
collations, so that leaves us with two options:

(a) Perform the version check in lc_{collate|ctype}_is_c(), which
duplicates some code and creates some inconsistency in how the version
is checked for different providers.

(b) Don't worry about it and just commit the version change in v27-
0001. The version check is already performed correctly on the database
without changes, even if the locale is "C". And there are already three
built-in "C" collations: "C", "POSIX", and UCS_BASIC; so it's not clear
why someone would create even more of them. And even if they did, there
would be no reason to give them a warning because we haven't
incremented the version, so there's no chance of a mismatch.

I'm inclined toward (b). Thoughts?

Regards,
Jeff Davis

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Wed, Mar 27, 2024 at 06:09:02PM -0400, Bruce Momjian wrote:
> On Wed, Mar 27, 2024 at 11:05:55AM -0400, Robert Haas wrote:
> > On Wed, Mar 27, 2024 at 10:43 AM Jelte Fennema-Nio  
> > wrote:
> > > Alright, changed the GUC name to "allow_alter_system" since that seems
> > > to have the most "votes". One other option would be to call it simply
> > > "alter_system", just like "jit" is not called "allow_jit" or
> > > "enable_jit".
> > >
> > > But personally I feel that the "allow_alter_system" is clearer than
> > > plain "alter_system" for the GUC name.
> > 
> > I agree, and have committed your 0001.
> 
> So, I email "Is this really a patch we think we can push into PG 17. I
> am having my doubts," and the patch is applied a few hours after my
> email.  Wow.

Also odd is that I don't see the commit in git master, so now I am
confused.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Large block sizes support in Linux

2024-03-27 Thread Stephen Frost

Greetings,

* Pankaj Raghav (ker...@pankajraghav.com) wrote:
> On 23/03/2024 05:53, Thomas Munro wrote:
> > On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
> >  wrote:
> >> My team and I have been working on adding Large block size(LBS)
> >> support to XFS in Linux[1]. Once this feature lands upstream, we will be
> >> able to create XFS with FS block size > page size of the system on Linux.
> >> We also gave a talk about it in Linux Plumbers conference recently[2]
> >> for more context. The initial support is only for XFS but more FSs will
> >> follow later.
> > 
> > Very cool!

Yes, this is very cool sounding and could be a real difference for PG.

> > (I used XFS on IRIX in the 90s, and it had large blocks then, a
> > feature lost in the port to Linux AFAIK.)
> 
> Yes, I heard this also from the Maintainer of XFS that they had to drop
> this functionality when they did the port. :)

I also recall the days of XFS on IRIX...  Many moons ago.

> > The short version is that we (and MySQL, via a different scheme with
> > different tradeoffs) could avoid writing all our stuff out twice if we
> > could count on atomic writes of a suitable size on power failure, so
> > the benefits are very large.  As far as I know, there are two things
> > we need from the kernel and storage to do that on "overwrite"
> > filesystems like XFS:
> > 
> > 1.  The disk must promise that its atomicity-on-power-failure is a
> > multiple of our block size -- something like NVMe AWUPF, right?  My
> > devices seem to say 0 :-(  Or I guess the filesystem has to
> > compensate, but then it's not exactly an overwrite filesystem
> > anymore...
> 
> 0 means 1 logical block, which might be 4k in your case. Typically device
> vendors have to put extra hardware to guarantee bigger atomic block sizes.

If I'm following correctly, this would mean that PG with FPW=off
(assuming everything else works) would be safe on more systems if PG
supported a 4K block size than if PG only supports 8K blocks, right?

There's been discussion and even some patches posted around the idea of
having run-time support in PG for different block sizes.  Currently,
it's a compile-time option with the default being 8K, meaning that's the
only option on a huge number of the deployed PG environments out there.
Moving it to run-time has some challenges and there's concerns about the
performance ... but if it meant we could run safely with FPW=off, that's
a pretty big deal.  On the other hand, if the expectation is that
basically everything will support atomic 8K, then we might be able to
simply keep that and not deal with supporting different page sizes at
run-time (of course, this is only one of the considerations in play, but
it could be particularly key, if I'm following correctly).

Appreciate any insights you can share on this.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: incorrect results and different plan with 2 very similar queries

2024-03-27 Thread Dave Cramer

Dave Cramer


On Wed, 27 Mar 2024 at 17:57, David Rowley  wrote:

> On Thu, 28 Mar 2024 at 10:33, Dave Cramer  wrote:
> > There is a report on the pgjdbc github JDBC Driver shows erratic
> behavior when filtering on CURRENT_DATE · pgjdbc/pgjdbc · Discussion #3184 (
> github.com)
> >
> > Here are the plans.
> >
> > JDBC - Nested Loop (incorrect result)
> >
> > Index Cond: (mutation >= ((CURRENT_DATE -
> '1971-12-31'::date) - 28))
>
> > JDBC - Hash Right (correct result)
> >
> > Recheck Cond: (mutation >= ((CURRENT_DATE -
> '1971-12-31'::date) - 29))
>
> I don't see any version details or queries, but going by the
> conditions above, the queries don't appear to be the same, so
> different results aren't too surprising and not a demonstration that
> there's any sort of bug.
>

Sorry, you are correct. Version is 12.14. Here is the query

SELECT
  p.partseqno_i
, p.partno
, p.partmatch
, pfe.average_price
, pfe.sales_price
, pfe.purch_price
, pfe.average_price_2
, pfe.avg_repair_cost
, pfe.average_price_func
, pfe.fsv
, pfe.fsv_func
, p.status

FROM part p
LEFT JOIN part_fa_entity pfe ON (p.partno = pfe.partno)
WHERE 1=1
AND (p.mutation >= (CURRENT_DATE - '1971-12-31'::date)-27) ORDER BY p.partno

Dave


> David
>

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Wed, Mar 27, 2024 at 11:05:55AM -0400, Robert Haas wrote:
> On Wed, Mar 27, 2024 at 10:43 AM Jelte Fennema-Nio  wrote:
> > Alright, changed the GUC name to "allow_alter_system" since that seems
> > to have the most "votes". One other option would be to call it simply
> > "alter_system", just like "jit" is not called "allow_jit" or
> > "enable_jit".
> >
> > But personally I feel that the "allow_alter_system" is clearer than
> > plain "alter_system" for the GUC name.
> 
> I agree, and have committed your 0001.

So, I email "Is this really a patch we think we can push into PG 17. I
am having my doubts," and the patch is applied a few hours after my
email.  Wow.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro

On Thu, Mar 28, 2024 at 10:52 AM Thomas Munro  wrote:
> I think 1 is good, as a rescan is even more likely to find the pages
> in cache, and if that turns out to be wrong it'll very soon adjust.

Hmm, no I take that back, it probably won't be due to the
strategy/ring...  I see your point now... when I had a separate flag,
the old distance was remembered across but now I'm zapping it.  I was
trying to minimise the number of variables that have to be tested in
the fast path by consolidating.  Hmm, it is signed -- would it be too
weird if we used a negative number for "finished", so we can just flip
it on reset?

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Bruce Momjian

On Wed, Mar 27, 2024 at 04:50:27PM +0100, Jelte Fennema-Nio wrote:
> > This wording was suggested upthread. I think the point here is that if
> > the superuser is logging in from the local machine, it's obvious that
> > they can do whatever they want. The point is to emphasize that a
> > superuser without a local login can, too.
> 
> Changed this from "remotely using other means" to "using other SQL commands".

Yes, I like the SQL emphasis since "remote" just doesn't seem like the
right thing to highlight here.

> > > > +some outside mechanism. In such environments, using 
> > > > ALTER
> > > > +SYSTEM to make configuration changes might appear to 
> > > > work,
> > > > +but then may be discarded at some point in the future when 
> > > > that outside
> > >
> > > "might"
> >
> > This does not seem like a mistake to me. I'm not sure why you think it is.
> 
> I also think the original sentence was correct, but I don't think it
> read very naturally. Changed it now in hopes to improve that.

So, might means "possibility" while "may" means permission, so "might"
is clearer here.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Popcount optimization using AVX512

2024-03-27 Thread Nathan Bossart

On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote:
> On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote:
>> Ok, CI turned green after my re-post of the patches.  Can this please get
>> merged?
> 
> Thanks for the new patches.  I intend to take another look soon.

Thanks for your patience.  I spent most of my afternoon looking into the
latest patch set, but I needed to do a CHECKPOINT and take a break.  I am
in the middle of doing some rather heavy editorialization, but the core of
your changes will remain the same (and so I still intend to give you
authorship credit).  I've attached what I have so far, which is still
missing the configuration checks and the changes to make sure the extra
compiler flags make it to the right places.

Unless something pops up while I work on the remainder of this patch, I
think we'll end up going with a simpler approach.  I originally set out to
make this look like the CRC32C stuff (e.g., a file per implementation), but
that seemed primarily useful if we can choose which files need to be
compiled at configure-time.  However, the TRY_POPCNT_FAST macro is defined
at compile-time (AFAICT for good reason [0]), so we end up having to
compile all the files in many cases anyway, and we continue to need to
surround lots of code with "#ifdef TRY_POPCNT_FAST" or similar.  So, my
current thinking is that we should only move the AVX512 stuff to its own
file for the purposes of compiling it with special flags when possible.  (I
realize that I'm essentially recanting much of my previous feedback, which
I apologize for.)

[0] 
https://postgr.es/m/CAApHDvrONNcYxGV6C0O3ZmaL0BvXBWY%2BrBOCBuYcQVUOURwhkA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 031eb4a365665edd304f0281ad7e412341504749 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v13 1/1] AVX512 popcount support

---
 src/include/port/pg_bitutils.h | 16 +++
 src/port/Makefile  |  1 +
 src/port/meson.build   |  1 +
 src/port/pg_bitutils.c | 53 
 src/port/pg_popcount_avx512.c  | 88 ++
 5 files changed, 125 insertions(+), 34 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..4b1e4d92b4 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,22 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * We can also try to use the AVX512 popcount instruction on some systems.
+ * The implementation of that is located in its own file because it may
+ * require special compiler flags that we don't want to apply to any other
+ * files.
+ */
+#if defined(TRY_POPCNT_FAST) && \
+	defined(HAVE__IMMINTRIN) && \
+	defined(HAVE__AVX512_POPCNT)
+#if defined(HAVE__GET_CPUID_COUNT) || defined(HAVE__CPUIDEX)
+#define TRY_POPCNT_AVX512 1
+extern bool pg_popcount_avx512_available(void);
+extern uint64 pg_popcount_avx512(const char *buf, int bytes);
+#endif
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index dcc8737e68..eb1e56fe41 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 92b593e6ef..c77bbd3168 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..2f9a6690e0 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -142,20 +142,18 @@ pg_popcount_available(void)
 	return (exx[2] & (1 << 23)) != 0;	/* POPCNT */
 }
 
-/*
- * These functions get called on the first call to pg_popcount32 etc.
- * They detect whether we can use the asm implementations, and replace
- * the function pointers so that subsequent calls are routed directly to
- * the chosen implementation.
- */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
 {
 	if (pg_popcount_available())
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
 		pg_popcount = pg_popcount_fast;
+#ifdef TRY_POPCNT_AVX512
+		if (pg_popcount_avx512_available())
+			pg_popcount = pg_popcount_avx512;
+#endif
 	}
 	else
 	{
@@ -163,45 +161,32 @@ pg_popcount32_choose(uint32 word)
 		pg_popcount64 = pg_popcount64_slow;
 		pg_popcount = pg_popcount_slow;
 	}
+}
 
+/*
+ * These functions get called on the first call to pg_popcount32 etc.
+ * They detect

Re: incorrect results and different plan with 2 very similar queries

2024-03-27 Thread David Rowley

On Thu, 28 Mar 2024 at 10:33, Dave Cramer  wrote:
> There is a report on the pgjdbc github JDBC Driver shows erratic behavior 
> when filtering on CURRENT_DATE · pgjdbc/pgjdbc · Discussion #3184 (github.com)
>
> Here are the plans.
>
> JDBC - Nested Loop (incorrect result)
>
> Index Cond: (mutation >= ((CURRENT_DATE - 
> '1971-12-31'::date) - 28))

> JDBC - Hash Right (correct result)
>
> Recheck Cond: (mutation >= ((CURRENT_DATE - 
> '1971-12-31'::date) - 29))

I don't see any version details or queries, but going by the
conditions above, the queries don't appear to be the same, so
different results aren't too surprising and not a demonstration that
there's any sort of bug.

David

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Thomas Munro

On Thu, Mar 28, 2024 at 9:43 AM Melanie Plageman
 wrote:
> For sequential scan, I added a little reset function to the streaming
> read API (read_stream_reset()) that just releases all the buffers.
> Previously, it set finished to true before releasing the buffers (to
> indicate it was done) and then set it back to false after. Now, I'll
> set distance to 0 before releasing the buffers and !0 after. I could
> just restore whatever value distance had before I set it to 0. Or I
> could set it to 1. But, thinking about it, are we sure we want to ramp
> up in the same way on rescans? Maybe we want to use some information
> from the previous scan to determine what to set distance to? Maybe I'm
> overcomplicating it...

I think 1 is good, as a rescan is even more likely to find the pages
in cache, and if that turns out to be wrong it'll very soon adjust.

Re: add AVX2 support to simd.h

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 05:10:13PM -0400, Tom Lane wrote:
> Shouldn't "i" be declared uint32, since nelem is?

Yes, that's a mistake.

> BTW, I wonder why these functions don't declare their array
> arguments like "const uint32 *base".

They probably should.  I don't see any reason not to, and my compiler
doesn't complain, either.
 
> LGTM otherwise, and I like the fact that the #if structure
> gets a lot less messy.

Thanks for reviewing.  I've attached a v2 that I intend to commit when I
get a chance.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6f259917230bf70284f0ec1186bed17d9b4a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 13:50:17 -0500
Subject: [PATCH v2 1/1] improve style of pg_lfind32()

---
 src/include/port/pg_lfind.h | 62 -
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 33e8471b03..4b1431ed00 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,24 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+/*
+ * pg_lfind32_one_by_one_helper
+ *
+ * Searches the array of integers one-by-one.  The caller is responsible for
+ * ensuring that there are at least "nelem" integers in the array.
+ */
+static inline bool
+pg_lfind32_one_by_one_helper(uint32 key, const uint32 *base, uint32 nelem)
+{
+	for (uint32 i = 0; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
 #ifndef USE_NO_SIMD
 /*
  * pg_lfind32_simd_helper
@@ -88,7 +106,7 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
  * ensuring that there are at least 4-registers-worth of integers remaining.
  */
 static inline bool
-pg_lfind32_simd_helper(const Vector32 keys, uint32 *base)
+pg_lfind32_simd_helper(const Vector32 keys, const uint32 *base)
 {
 	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
 	Vector32	vals1,
@@ -132,11 +150,10 @@ pg_lfind32_simd_helper(const Vector32 keys, uint32 *base)
  * return false.
  */
 static inline bool
-pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+pg_lfind32(uint32 key, const uint32 *base, uint32 nelem)
 {
-	uint32		i = 0;
-
 #ifndef USE_NO_SIMD
+	uint32		i = 0;
 
 	/*
 	 * For better instruction-level parallelism, each loop iteration operates
@@ -150,25 +167,15 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
-	bool		assert_result = false;
-
-	/* pre-compute the result for assert checking */
-	for (int j = 0; j < nelem; j++)
-	{
-		if (key == base[j])
-		{
-			assert_result = true;
-			break;
-		}
-	}
+	bool		assert_result = pg_lfind32_one_by_one_helper(key, base, nelem);
 #endif
 
 	/*
-	 * If there aren't enough elements for the SIMD code, jump to the standard
+	 * If there aren't enough elements for the SIMD code, use the standard
 	 * one-by-one linear search code.
 	 */
 	if (nelem < nelem_per_iteration)
-		goto one_by_one;
+		return pg_lfind32_one_by_one_helper(key, base, nelem);
 
 	/*
 	 * Process as many elements as possible with a block of 4 registers.
@@ -193,27 +200,10 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	 */
 	Assert(assert_result == pg_lfind32_simd_helper(keys, [nelem - nelem_per_iteration]));
 	return pg_lfind32_simd_helper(keys, [nelem - nelem_per_iteration]);
-
-one_by_one:
-
-#endif			/* ! USE_NO_SIMD */
-
+#else
 	/* Process the elements one at a time. */
-	for (; i < nelem; i++)
-	{
-		if (key == base[i])
-		{
-#ifndef USE_NO_SIMD
-			Assert(assert_result == true);
+	return pg_lfind32_one_by_one_helper(key, base, nelem);
 #endif
-			return true;
-		}
-	}
-
-#ifndef USE_NO_SIMD
-	Assert(assert_result == false);
-#endif
-	return false;
 }
 
 #endif			/* PG_LFIND_H */
-- 
2.25.1

incorrect results and different plan with 2 very similar queries

2024-03-27 Thread Dave Cramer

Greetings,

There is a report on the pgjdbc github JDBC Driver shows erratic behavior
when filtering on CURRENT_DATE · pgjdbc/pgjdbc · Discussion #3184
(github.com) 

Here are the plans.

JDBC - Nested Loop (incorrect result)

Sort  (cost=1071.31..1071.60 rows=114 width=83) (actual time=2.894..2.912
rows=330 loops=1)
  Sort Key: p.partno
  Sort Method: quicksort  Memory: 70kB
  ->  Nested Loop Left Join  (cost=9.46..1067.42 rows=114 width=83) (actual
time=0.082..2.446 rows=330 loops=1)
->  Bitmap Heap Scan on part p  (cost=9.18..295.79 rows=114
width=29) (actual time=0.064..0.502 rows=330 loops=1)
  Recheck Cond: (mutation >= ((CURRENT_DATE -
'1971-12-31'::date) - 28))
  Heap Blocks: exact=181
  ->  Bitmap Index Scan on i_42609  (cost=0.00..9.15 rows=114
width=0) (actual time=0.041..0.041 rows=344 loops=1)
Index Cond: (mutation >= ((CURRENT_DATE -
'1971-12-31'::date) - 28))
->  Index Scan using i_39773 on part_fa_entity pfe
(cost=0.29..6.76 rows=1 width=65) (actual time=0.005..0.005 rows=1
loops=330)
  Index Cond: ((partno)::text = (p.partno)::text)
Planning Time: 0.418 ms
Execution Time: 2.971 ms

JDBC - Hash Right (correct result)

Sort  (cost=1352.35..1352.94 rows=238 width=83) (actual time=5.214..5.236
rows=345 loops=1)
  Sort Key: p.partno
  Sort Method: quicksort  Memory: 73kB
  ->  Hash Right Join  (cost=472.00..1342.95 rows=238 width=83) (actual
time=0.654..4.714 rows=345 loops=1)
Hash Cond: ((pfe.partno)::text = (p.partno)::text)
->  Seq Scan on part_fa_entity pfe  (cost=0.00..837.27 rows=12827
width=65) (actual time=0.009..2.191 rows=12827 loops=1)
->  Hash  (cost=469.03..469.03 rows=238 width=29) (actual
time=0.623..0.624 rows=345 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 30kB
  ->  Bitmap Heap Scan on part p  (cost=18.14..469.03 rows=238
width=29) (actual time=0.073..0.532 rows=345 loops=1)
Recheck Cond: (mutation >= ((CURRENT_DATE -
'1971-12-31'::date) - 29))
Heap Blocks: exact=186
->  Bitmap Index Scan on i_42609  (cost=0.00..18.08
rows=238 width=0) (actual time=0.049..0.049 rows=359 loops=1)
  Index Cond: (mutation >= ((CURRENT_DATE -
'1971-12-31'::date) - 29))
Planning Time: 0.304 ms
Execution Time: 5.292 ms

AppX - Nested Loop (correct result)

Sort  (cost=1071.31..1071.60 rows=114 width=83) (actual time=3.083..3.102
rows=330 loops=1)
  Output: p.partseqno_i, p.partno, p.partmatch, pfe.average_price,
pfe.sales_price, pfe.purch_price, pfe.average_price_2, pfe.avg_repair_cost,
pfe.average_price_func, pfe.fsv, pfe.fsv_func, p.status
  Sort Key: p.partno
  Sort Method: quicksort  Memory: 71kB
  ->  Nested Loop Left Join  (cost=9.46..1067.42 rows=114 width=83) (actual
time=0.069..2.471 rows=330 loops=1)
Output: p.partseqno_i, p.partno, p.partmatch, pfe.average_price,
pfe.sales_price, pfe.purch_price, pfe.average_price_2, pfe.avg_repair_cost,
pfe.average_price_func, pfe.fsv, pfe.fsv_func, p.status
->  Bitmap Heap Scan on .part p  (cost=9.18..295.79 rows=114
width=29) (actual time=0.054..0.308 rows=330 loops=1)
  Output: p.min_safety_stock, p.manual_safety_stock,
p.extended_stateno_i, p.partno, p.partmatch, p.partseqno_i, p.description,
p.remarks, p.specification, p.ata_chapter, p.vendor, p.weight, p.storetime,
p.alert_qty, p.measure_unit, p.waste_code, p.reord_level, p.safety_stock,
p.max_purch, p.ac_typ, p.mat_class, p.mat_type, p.country_origin,
p.reorder, p.tool, p.repairable, p.avg_ta_time, p.default_supplier,
p.default_repair, p.special_contract, p.fixed_asset,
p.reorder_last_mutator, p.reorder_last_mutation, p.max_shop_visit,
p.shop_visit_reset_condition, p.special_measure_unit, p.manufacturer,
p.pma, p.resource_type_id, p.counter_template_groupno_i, p.mutation,
p.mutator, p.status, p.mutation_time, p.created_by, p.created_date
  Recheck Cond: (p.mutation >= ((CURRENT_DATE -
`1971-12-31`::date) - 28))
  Heap Blocks: exact=181
  ->  Bitmap Index Scan on i_42609  (cost=0.00..9.15 rows=114
width=0) (actual time=0.033..0.034 rows=341 loops=1)
Index Cond: (p.mutation >= ((CURRENT_DATE -
`1971-12-31`::date) - 28))
->  Index Scan using i_39773 on .part_fa_entity pfe
(cost=0.29..6.76 rows=1 width=65) (actual time=0.005..0.006 rows=1
loops=330)
  Output: pfe.part_fa_entityno_i, pfe.partno, pfe.entityno_i,
pfe.average_price, pfe.sales_price, pfe.purch_price, pfe.average_price_2,
pfe.avg_repair_cost, pfe.avg_repair_cost_func, pfe.fa_qty,
pfe.fa_open_iv_qty, pfe.fa_start_qty, pfe.fa_start_price,
pfe.fa_start_price_2, pfe.mutation, pfe.mutator, pfe.status,
pfe.mutation_time, pfe.created_by, pfe.created_date,
pfe.average_price_func, pfe.fa_start_price_func, pfe.fsv, pfe.fsv_func
  Index Cond: ((pfe.partno)::text =

Re: [PATCH] plpython function causes server panic

2024-03-27 Thread Tom Lane

Robert Haas  writes:
> On Mon, Mar 25, 2024 at 11:36 AM Tom Lane  wrote:
>> ...  I don't see why this case is different,
>> especially when the added cost compared to HEAD is not much more than
>> one C function call.

> Well, I explained why *I* thought it was different, but obviously you
> don't agree.

After mulling it over for awhile, I still think the extra checking
is appropriate, especially since this patch is enlarging the set of
things that can happen in parallel mode.  How do you want to proceed?

regards, tom lane

Re: add AVX2 support to simd.h

2024-03-27 Thread Tom Lane

Nathan Bossart  writes:
> Here's what I had in mind.  My usual benchmark seems to indicate that this
> shouldn't impact performance.

Shouldn't "i" be declared uint32, since nelem is?

BTW, I wonder why these functions don't declare their array
arguments like "const uint32 *base".

LGTM otherwise, and I like the fact that the #if structure
gets a lot less messy.

regards, tom lane

Re: Streaming I/O, vectored I/O (WIP)

2024-03-27 Thread Melanie Plageman

On Wed, Mar 27, 2024 at 10:11 AM Thomas Munro  wrote:
>
> I got rid of "finished" (now represented by distance == 0, I was
> removing branches and variables).  I got rid of "started", which can
> now be deduced (used for suppressing advice when you're calling
> _next() because you need a block and we need to read it immediately),
> see the function argument suppress_advice.

I started rebasing the sequential scan streaming read user over this
new version, and this change (finished now represented with distance
== 0) made me realize that I'm not sure what to set distance to on
rescan.

For sequential scan, I added a little reset function to the streaming
read API (read_stream_reset()) that just releases all the buffers.
Previously, it set finished to true before releasing the buffers (to
indicate it was done) and then set it back to false after. Now, I'll
set distance to 0 before releasing the buffers and !0 after. I could
just restore whatever value distance had before I set it to 0. Or I
could set it to 1. But, thinking about it, are we sure we want to ramp
up in the same way on rescans? Maybe we want to use some information
from the previous scan to determine what to set distance to? Maybe I'm
overcomplicating it...

> Here is a new proposal for the names, updated in v10:
>
> read_stream_begin_relation()
> read_stream_next_buffer()
> void read_stream_end()

Personally, I'm happy with these.

- Melanie

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-03-27 Thread Dmitry Koval


Hi!

> I've fixed that by skipping a copy of the identity of another
> partition (remove CREATE_TABLE_LIKE_IDENTITY from
> TableLikeClause.options).

Thanks for correction!
Probably I should have looked at the code more closely after commit [1]. 
I'm also very glad that situation [2] was reproduced.


> When merging partitions you're creating a merged partition like the
> parent table.   But when splitting a partition you're creating new
> partitions like the partition being split.  What motivates this
> difference?

When splitting a partition, I planned to set parameters for each of the 
new partitions (for example, tablespace parameter).
It would make sense if we want to transfer part of the data of splitting 
partition to a slower (archive) storage device.
Right now I haven't seen any interest in this functionality, so it 
hasn't been implemented yet. But I think this will be needed in the future.


Special thanks for the hint that new structures should be added to the 
list src\tools\pgindent\typedefs.list.


Links.
[1] 
https://github.com/postgres/postgres/commit/699586315704a8268808e3bdba4cb5924a038c49


[2] 
https://www.postgresql.org/message-id/171085360143.2046436.7217841141682511557.pgcf%40coridan.postgresql.org


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: Use streaming read API in ANALYZE

2024-03-27 Thread Melanie Plageman

On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:
> Hi,
> 
> On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
> >
> >
> > The new version of the streaming read API [1] is posted. I updated the
> > streaming read API changes patch (0001), using the streaming read API
> > in ANALYZE patch (0002) remains the same. This should make it easier
> > to review as it can be applied on top of master
> >
> >
> 
> The new version of the streaming read API is posted [1]. I rebased the
> patch on top of master and v9 of the streaming read API.
> 
> There is a minimal change in the 'using the streaming read API in ANALYZE
> patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
> to copy exactly the same behavior as before. Also, some benchmarking
> results:
> 
> I created a 22 GB table and set the size of shared buffers to 30GB, the
> rest is default.
> 
> ╔═══╦═╦╗
> ║   ║  Avg Timings in ms  ║║
> ╠═══╬══╦══╬╣
> ║   ║  master  ║  patched ║ percentage ║
> ╠═══╬══╬══╬╣
> ║ Both OS cache and ║  ║  ║║
> ║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
> ╠═══╬══╬══╬╣
> ║   OS cache is loaded but  ║  ║  ║║
> ║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
> ╠═══╬══╬══╬╣
> ║ Shared buffers are loaded ║  ║  ║║
> ║   ║  89.2846 ║  84.6952 ║%5.1║
> ╚═══╩══╩══╩╝
> 
> Any kind of feedback would be appreciated.

Thanks for working on this!

A general review comment: I noticed you have the old streaming read
(pgsr) naming still in a few places (including comments) -- so I would
just make sure and update everywhere when you rebase in Thomas' latest
version of the read stream API.

> From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz 
> Date: Mon, 19 Feb 2024 14:30:47 +0300
> Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
>
> --- a/src/backend/commands/analyze.c
> +++ b/src/backend/commands/analyze.c
> @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node 
> *index_expr)
>   return stats;
>  }
>  
> +/*
> + * Prefetch callback function to get next block number while using
> + * BlockSampling algorithm
> + */
> +static BlockNumber
> +pg_block_sampling_streaming_read_next(StreamingRead *stream,
> +   void 
> *user_data,
> +   void 
> *per_buffer_data)

I don't think you need the pg_ prefix

> +{
> + BlockSamplerData *bs = user_data;
> + BlockNumber *current_block = per_buffer_data;

Why can't you just do BufferGetBlockNumber() on the buffer returned from
the read stream API instead of allocating per_buffer_data for the block
number?

> +
> + if (BlockSampler_HasMore(bs))
> + *current_block = BlockSampler_Next(bs);
> + else
> + *current_block = InvalidBlockNumber;
> +
> + return *current_block;


I think we'd like to keep the read stream code in heapam-specific code.
Instead of doing streaming_read_buffer_begin() here, you could put this
in heap_beginscan() or initscan() guarded by
scan->rs_base.rs_flags & SO_TYPE_ANALYZE

same with streaming_read_buffer_end()/heap_endscan().

You'd also then need to save the reference to the read stream in the
HeapScanDescData.

> + stream = streaming_read_buffer_begin(STREAMING_READ_MAINTENANCE,
> + 
>  vac_strategy,
> + 
>  BMR_REL(scan->rs_rd),
> + 
>  MAIN_FORKNUM,
> + 
>  pg_block_sampling_streaming_read_next,
> + 
>  ,
> + 
>  sizeof(BlockSamplerData));
>  
>   /* Outer loop over blocks to sample */

In fact, I think you could use this opportunity to get rid of the block
dependency in acquire_sample_rows() altogether.

Looking at the code now, it seems like you could just invoke
heapam_scan_analyze_next_block() (maybe rename it to
heapam_scan_analyze_next_buffer() or something) from
heapam_scan_analyze_next_tuple() and remove
table_scan_analyze_next_block() entirely.

Then table AMs can figure out how they want

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-27 Thread Ranier Vilela

Em qua., 27 de mar. de 2024 às 14:35, Nathan Bossart <
nathandboss...@gmail.com> escreveu:

> On Wed, Mar 27, 2024 at 01:47:38PM -0300, Ranier Vilela wrote:
> > Em qua., 27 de mar. de 2024 às 13:41, Nathan Bossart <
> > nathandboss...@gmail.com> escreveu:
> >> On Wed, Mar 27, 2024 at 01:21:23PM -0300, Ranier Vilela wrote:
> >> > I think that left an oversight in a commit d365ae7
> >> > <
> >>
> https://github.com/postgres/postgres/commit/d365ae705409f5d9c81da4b668f59c3598feb512
> >> >
> >> > If the admin_role is a NULL pointer, so, can be dereferenced
> >> > in the main loop of the function roles_is_member_of and
> >> > worst, IMO, can be destroying aleatory memory?
> >> >
> >> > First, is a better shortcut test to check if admin_role is NOT NULL.
> >> > Second, !OidIsValid(*admin_role), It doesn't seem necessary anymore.
> >> >
> >> > Or am I losing something?
> >>
> >> If admin_role is NULL, then admin_of is expected to be set to
> InvalidOid.
> >> See the assertion at the top of the function.  AFAICT the code that
> >> dereferences admin_role short-circuits if admin_of is invalid.
> >>
> > These conditions seem a little fragile and confusing to me.
> > When a simple test, it protects the pointer and avoids a series of tests,
> > which are unnecessary if the pointer is invalid.
>
> Maybe.  But that doesn't seem like an oversight in commit d365ae7.
>
Sorry for exceeding.

>
> -   if (otherid == admin_of && form->admin_option &&
> -   OidIsValid(admin_of) &&
> !OidIsValid(*admin_role))
> +   if (admin_role != NULL && otherid == admin_of &&
> form->admin_option &&
> +   OidIsValid(admin_of))
> *admin_role = memberid;
>
> I'm not following why it's safe to remove the !OidIsValid(*admin_role)
> check here.  We don't want to overwrite a previously-set value of
> *admin_role, as per the comment above roles_is_member_of():
>
>  * If admin_of is not InvalidOid, this function sets *admin_role, either
>  * to the OID of the first role in the result list that directly possesses
>  * ADMIN OPTION on the role corresponding to admin_of, or to InvalidOid if
>  * there is no such role.
>
Ok. If admin_role is NOT NULL, so *admin_role is InvalidOid, by
initialization
in the head of function.

I think that a cheap test *admin_role == InvalidOid, is enough?
What do you think?

v1 attached.

best regards,
Ranier Vilela


v1-avoid-dereference-a-null-pointer-acl.patch
Description: Binary data

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Maciek Sakrejda

On Wed, Mar 27, 2024, 11:46 Robert Haas  wrote:

> On Wed, Mar 27, 2024 at 1:12 PM Isaac Morland 
> wrote:
> > On Wed, 27 Mar 2024 at 13:05, Greg Sabino Mullane 
> wrote:
> >>> The purpose of the setting is to prevent
> accidental modifications via ALTER
> SYSTEM in environments where
> >> The emphasis on 'accidental' seems a bit heavy here, and odd. Surely,
> just "to prevent modifications via ALTER SYSTEM in environments where..."
> is enough?
> > Not necessarily disagreeing, but it's very important nobody ever mistake
> this for a security feature. I don't know if the extra word "accidental" is
> necessary, but I think that's the motivation.
>
> I think the emphasis is entirely warranted in this case.


+1. And while "non-malicious" may technically be more correct, I don't
think it's any clearer.

>

Re: add AVX2 support to simd.h

2024-03-27 Thread Nathan Bossart

On Tue, Mar 26, 2024 at 09:48:57PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> I just did the minimal fix for now, i.e., I moved the new label into the
>> SIMD section of the function.  I think it would be better stylistically to
>> move the one-by-one logic to an inline helper function, but I didn't do
>> that just in case it might negatively impact performance.  I'll look into
>> this and will follow up with another patch if it looks good.
> 
> Sounds like a plan.

Here's what I had in mind.  My usual benchmark seems to indicate that this
shouldn't impact performance.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c3f163753246c5ec82dd8c5dba70232cbeebbf2a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 13:50:17 -0500
Subject: [PATCH v1 1/1] improve style of pg_lfind32()

---
 src/include/port/pg_lfind.h | 58 +++--
 1 file changed, 24 insertions(+), 34 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 33e8471b03..5b76cc8937 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,24 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+/*
+ * pg_lfind32_one_by_one_helper
+ *
+ * Searches the array of integers one-by-one.  The caller is responsible for
+ * ensuring that there are at least "nelem" integers in the array.
+ */
+static inline bool
+pg_lfind32_one_by_one_helper(uint32 key, uint32 *base, uint32 nelem)
+{
+	for (int i = 0; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
 #ifndef USE_NO_SIMD
 /*
  * pg_lfind32_simd_helper
@@ -134,9 +152,8 @@ pg_lfind32_simd_helper(const Vector32 keys, uint32 *base)
 static inline bool
 pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
-	uint32		i = 0;
-
 #ifndef USE_NO_SIMD
+	uint32		i = 0;
 
 	/*
 	 * For better instruction-level parallelism, each loop iteration operates
@@ -150,25 +167,15 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
-	bool		assert_result = false;
-
-	/* pre-compute the result for assert checking */
-	for (int j = 0; j < nelem; j++)
-	{
-		if (key == base[j])
-		{
-			assert_result = true;
-			break;
-		}
-	}
+	bool		assert_result = pg_lfind32_one_by_one_helper(key, base, nelem);
 #endif
 
 	/*
-	 * If there aren't enough elements for the SIMD code, jump to the standard
+	 * If there aren't enough elements for the SIMD code, use the standard
 	 * one-by-one linear search code.
 	 */
 	if (nelem < nelem_per_iteration)
-		goto one_by_one;
+		return pg_lfind32_one_by_one_helper(key, base, nelem);
 
 	/*
 	 * Process as many elements as possible with a block of 4 registers.
@@ -193,27 +200,10 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	 */
 	Assert(assert_result == pg_lfind32_simd_helper(keys, [nelem - nelem_per_iteration]));
 	return pg_lfind32_simd_helper(keys, [nelem - nelem_per_iteration]);
-
-one_by_one:
-
-#endif			/* ! USE_NO_SIMD */
-
+#else
 	/* Process the elements one at a time. */
-	for (; i < nelem; i++)
-	{
-		if (key == base[i])
-		{
-#ifndef USE_NO_SIMD
-			Assert(assert_result == true);
+	return pg_lfind32_one_by_one_helper(key, base, nelem);
 #endif
-			return true;
-		}
-	}
-
-#ifndef USE_NO_SIMD
-	Assert(assert_result == false);
-#endif
-	return false;
 }
 
 #endif			/* PG_LFIND_H */
-- 
2.25.1

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-27 Thread Alvaro Herrera

On 2024-Mar-27, Alvaro Herrera wrote:

> I changed it so that the error messages are returned as translated
> phrases, and was bothered by the fact that if errors happen repeatedly,
> the memory for them might be leaked.  Maybe this is fine depending on
> the caller's memory context, but since it's only at most one string each
> time, it's quite easy to just keep track of it so that we can release it
> on the next.

(Actually this sounds clever but fails pretty obviously if the caller
does free the string, such as in a memory context reset.  So I guess we
have to just accept the potential leakage.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Robert Haas

On Wed, Mar 27, 2024 at 1:12 PM Isaac Morland  wrote:
> On Wed, 27 Mar 2024 at 13:05, Greg Sabino Mullane  wrote:
>>> The purpose of the setting is to prevent accidental 
>>> modifications via ALTER SYSTEM in environments where
>> The emphasis on 'accidental' seems a bit heavy here, and odd. Surely, just 
>> "to prevent modifications via ALTER SYSTEM in environments where..." is 
>> enough?
> Not necessarily disagreeing, but it's very important nobody ever mistake this 
> for a security feature. I don't know if the extra word "accidental" is 
> necessary, but I think that's the motivation.

I think the emphasis is entirely warranted in this case.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman

On Wed, Mar 27, 2024 at 2:26 PM Melanie Plageman
 wrote:
>
> On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas  wrote:
> >
> > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > I need some way to modify the control flow or accounting such that I
> > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > And a way to consider freezing and do live tuple accounting for these
> > > and HEAPTUPLE_LIVE tuples exactly once.
> >
> > Just a quick update: I've been massaging this some more today, and I
> > think I'm onto got something palatable. I'll send an updated patch later
> > today, but the key is to note that for each item on the page, there is
> > one point where we determine the fate of the item, whether it's pruned
> > or not. That can happen in different points in in heap_page_prune().
> > That's also when we set marked[offnum] = true. Whenever that happens, we
> > all call one of the a heap_page_prune_record_*() subroutines. We already
> > have those subroutines for when a tuple is marked as dead or unused, but
> > let's add similar subroutines for the case that we're leaving the tuple
> > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > can ensure that it gets done exactly once for each tuple, and at that
> > point we know what we are going to do to the tuple, so we can count it
> > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > ensures that each tuple is marked only once, and the subroutines update
> > all the variables, add the item to the correct arrays etc. depending on
> > what we're doing with it.
>
> Yes, this would be ideal.
>
> I was doing some experimentation with pageinspect today (trying to
> find that single place where live tuples fates are decided) and it
> seems like a heap-only tuple that is not HOT-updated will usually be
> the one at the end of the chain. Which seems like it would be covered
> by adding a record_live() type function call  in the loop of
> heap_prune_chain():
>
> /*
>  * If the tuple is not HOT-updated, then we are at the end of this
>  * HOT-update chain.
>  */
> if (!HeapTupleHeaderIsHotUpdated(htup))
> {
> heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);
> break;
> }
>
> but that doesn't end up producing the same results as
>
> if (HeapTupleHeaderIsHeapOnly(htup)
> && !HeapTupleHeaderIsHotUpdated(htup) &&
> presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
> heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);

sorry, that should say presult->htsv[rootoffnum] != HEAPTUPLE_DEAD.

The latter should be a subset of the former. But instead it seems
there are cases I missed by doing only the former.

- Melanie

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-27 Thread Alvaro Herrera

On 2024-Mar-22, Jelte Fennema-Nio wrote:

> On Thu, 21 Mar 2024 at 03:54, Noah Misch  wrote:
> >
> > On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote:
> > > I enabled the test again and also pushed the changes to dblink,
> > > isolationtester and fe_utils (which AFAICS is used by pg_dump,
> >
> > I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to
> > use from dblink and postgres_fdw.  pgxn modules calling PQcancel() from the
> > backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt
> > the new way.
> 
> Done

Nice, thanks.  I played with it a bit, mostly trying to figure out if
the chosen API is usable.  I toyed with making it return boolean success
and the error message as an output argument, because I was nervous about
what'd happen in OOM.  But since this is backend environment, what
actually happens is that we elog(ERROR) anyway, so we never return a
NULL error message.  So after the detour I think Jelte's API is okay.

I changed it so that the error messages are returned as translated
phrases, and was bothered by the fact that if errors happen repeatedly,
the memory for them might be leaked.  Maybe this is fine depending on
the caller's memory context, but since it's only at most one string each
time, it's quite easy to just keep track of it so that we can release it
on the next.

I ended up reducing the two PG_TRY blocks to a single one.  I see no
reason to split them up, and this way it looks more legible.

What do you think?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index edbc9ab02a..de858e165a 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -1347,25 +1347,16 @@ Datum
 dblink_cancel_query(PG_FUNCTION_ARGS)
 {
 	PGconn	   *conn;
-	PGcancelConn *cancelConn;
 	char	   *msg;
+	TimestampTz endtime;
 
 	dblink_init();
 	conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)));
-	cancelConn = PQcancelCreate(conn);
-
-	PG_TRY();
-	{
-		if (!PQcancelBlocking(cancelConn))
-			msg = pchomp(PQcancelErrorMessage(cancelConn));
-		else
-			msg = "OK";
-	}
-	PG_FINALLY();
-	{
-		PQcancelFinish(cancelConn);
-	}
-	PG_END_TRY();
+	endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+		  3);
+	msg = libpqsrv_cancel(conn, endtime);
+	if (msg == NULL)
+		msg = "OK";
 
 	PG_RETURN_TEXT_P(cstring_to_text(msg));
 }
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4931ebf591..2532e453c4 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -133,7 +133,7 @@ static void pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue);
 static void pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry);
 static void pgfdw_reset_xact_state(ConnCacheEntry *entry, bool toplevel);
 static bool pgfdw_cancel_query(PGconn *conn);
-static bool pgfdw_cancel_query_begin(PGconn *conn);
+static bool pgfdw_cancel_query_begin(PGconn *conn, TimestampTz endtime);
 static bool pgfdw_cancel_query_end(PGconn *conn, TimestampTz endtime,
    bool consume_input);
 static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
@@ -1315,36 +1315,31 @@ pgfdw_cancel_query(PGconn *conn)
 	endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
 		  CONNECTION_CLEANUP_TIMEOUT);
 
-	if (!pgfdw_cancel_query_begin(conn))
+	if (!pgfdw_cancel_query_begin(conn, endtime))
 		return false;
 	return pgfdw_cancel_query_end(conn, endtime, false);
 }
 
+/*
+ * Submit a cancel request to the given connection, waiting only until
+ * the given time.
+ *
+ * We sleep interruptibly until we receive confirmation that the cancel
+ * request has been accepted, and if it is, return true; if the timeout
+ * lapses without that, or the request fails for whatever reason, return
+ * false.
+ */
 static bool
-pgfdw_cancel_query_begin(PGconn *conn)
+pgfdw_cancel_query_begin(PGconn *conn, TimestampTz endtime)
 {
-	PGcancel   *cancel;
-	char		errbuf[256];
+	char	   *errormsg = libpqsrv_cancel(conn, endtime);
 
-	/*
-	 * Issue cancel request.  Unfortunately, there's no good way to limit the
-	 * amount of time that we might block inside PQgetCancel().
-	 */
-	if ((cancel = PQgetCancel(conn)))
-	{
-		if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-		{
-			ereport(WARNING,
-	(errcode(ERRCODE_CONNECTION_FAILURE),
-	 errmsg("could not send cancel request: %s",
-			errbuf)));
-			PQfreeCancel(cancel);
-			return false;
-		}
-		PQfreeCancel(cancel);
-	}
+	if (errormsg != NULL)
+		ereport(WARNING,
+errcode(ERRCODE_CONNECTION_FAILURE),
+errmsg("could not send cancel request: %s", errormsg));
 
-	return true;
+	return errormsg == NULL;
 }
 
 static bool
@@ -1685,7 +1680,11 @@ pgfdw_abort_cleanup_begin(ConnCacheEntry *entry, bool toplevel,
 	 */
 	if (PQtransactionStatus(entry->conn) ==

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman

On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas  wrote:
>
> On 27/03/2024 17:18, Melanie Plageman wrote:
> > I need some way to modify the control flow or accounting such that I
> > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > And a way to consider freezing and do live tuple accounting for these
> > and HEAPTUPLE_LIVE tuples exactly once.
>
> Just a quick update: I've been massaging this some more today, and I
> think I'm onto got something palatable. I'll send an updated patch later
> today, but the key is to note that for each item on the page, there is
> one point where we determine the fate of the item, whether it's pruned
> or not. That can happen in different points in in heap_page_prune().
> That's also when we set marked[offnum] = true. Whenever that happens, we
> all call one of the a heap_page_prune_record_*() subroutines. We already
> have those subroutines for when a tuple is marked as dead or unused, but
> let's add similar subroutines for the case that we're leaving the tuple
> unchanged. If we move all the bookkeeping logic to those subroutines, we
> can ensure that it gets done exactly once for each tuple, and at that
> point we know what we are going to do to the tuple, so we can count it
> correctly. So heap_prune_chain() decides what to do with each tuple, and
> ensures that each tuple is marked only once, and the subroutines update
> all the variables, add the item to the correct arrays etc. depending on
> what we're doing with it.

Yes, this would be ideal.

I was doing some experimentation with pageinspect today (trying to
find that single place where live tuples fates are decided) and it
seems like a heap-only tuple that is not HOT-updated will usually be
the one at the end of the chain. Which seems like it would be covered
by adding a record_live() type function call  in the loop of
heap_prune_chain():

/*
 * If the tuple is not HOT-updated, then we are at the end of this
 * HOT-update chain.
 */
if (!HeapTupleHeaderIsHotUpdated(htup))
{
heap_prune_record_live_or_recently_dead(dp, prstate,
offnum, presult);
break;
}

but that doesn't end up producing the same results as

if (HeapTupleHeaderIsHeapOnly(htup)
&& !HeapTupleHeaderIsHotUpdated(htup) &&
presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
heap_prune_record_live_or_recently_dead(dp, prstate,
offnum, presult);

at the top of heap_prune_chain().

- Melanie

Re: Add new error_action COPY ON_ERROR "log"

2024-03-27 Thread Bharath Rupireddy

On Wed, Mar 27, 2024 at 7:42 AM Masahiko Sawada  wrote:
>
> I think that there are two options to handle it:
>
> 1. change COPY grammar to accept DEFAULT as an option value.
> 2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
> and update the doc too.
>
> As for the documentation, we can add single-quotes as follows:
>
>  ENCODING 'encoding_name'
> +LOG_VERBOSITY [ 'mode' ]
>
> I thought the option (2) is better but there seems no precedent of
> complementing a single-quoted string other than file names. So the
> option (1) could be clearer.
>
> What do you think?

There is another option to change log_verbosity to {none, verbose} or
{none, skip_row_info} (discuseed here
https://www.postgresql.org/message-id/Zelrqq-pkfkvsjPn%40paquier.xyz
that we might extend this option to other use-cases in future). I tend
to agree with you to support log_verbose to be set to default without
quotes just to be consistent with other commands that allow that [1].
And, thanks for quickly identifying where to change in the gram.y.
With that change, now I have changed all the new tests added to use
log_verbosity default without quotes.

FWIW, a recent commit [2] did the same. Therefore, I don't see a
problem supporting it that way for COPY log_verbosity.

Please find the attached v13 patch with the change.

[1]
column_compression:
COMPRESSION ColId{ $$ = $2; }
| COMPRESSION DEFAULT{ $$ =
pstrdup("default"); }
;

column_storage:
STORAGE ColId{ $$ = $2; }
| STORAGE DEFAULT{ $$ =
pstrdup("default"); }
;

[2]
commit b9424d014e195386a83b0f1fe9f5a8e5727e46ea
Author: Tom Lane 
Date:   Thu Nov 10 18:20:49 2022 -0500

Support writing "CREATE/ALTER TABLE ... SET STORAGE DEFAULT".

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 26941c91ec7c2cfe92bf31eb7dc999f60137b809 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 27 Mar 2024 17:36:40 +
Subject: [PATCH v13] Add new COPY option LOG_VERBOSITY.

This commit adds a new COPY option LOG_VERBOSITY, which controls the
amount of messages emitted during processing. Valid values are
'default' and 'verbose'.

This is currently used in COPY FROM when ON_ERROR option is set to
ignore. If 'verbose' is specified, additional information for each
discarded row is emitted.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Atsushi Torikoshi, Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/CALj2ACUk700cYhx1ATRQyRw-fBM%2BaRo6auRAitKGff7XNmYfqQ%40mail.gmail.com
---
 doc/src/sgml/ref/copy.sgml   | 25 +++--
 src/backend/commands/copy.c  | 38 ++
 src/backend/commands/copyfrom.c  | 10 +++
 src/backend/commands/copyfromparse.c | 35 
 src/backend/parser/gram.y|  1 +
 src/bin/psql/tab-complete.c  |  6 +++-
 src/include/commands/copy.h  | 11 
 src/test/regress/expected/copy2.out  | 41 +++-
 src/test/regress/sql/copy2.sql   | 24 +++-
 src/tools/pgindent/typedefs.list |  1 +
 10 files changed, 181 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 6c83e30ed0..12ae49ce9f 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -45,6 +45,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 ON_ERROR 'error_action'
 ENCODING 'encoding_name'
+LOG_VERBOSITY [ mode ]
 
  
 
@@ -400,8 +401,12 @@ COPY { table_name [ ( FORMAT is text or csv.
  
  
-  A NOTICE message containing the ignored row count is emitted at the end
-  of the COPY FROM if at least one row was discarded.
+  A NOTICE message containing the ignored row count is
+  emitted at the end of the COPY FROM if at least one
+  row was discarded. When LOG_VERBOSITY option is set to
+  verbose, a NOTICE message
+  containing the line of the input file and the column name whose input
+  conversion has failed is emitted for each discarded row.
  
 

@@ -418,6 +423,22 @@ COPY { table_name [ ( 

 
+   
+LOG_VERBOSITY
+
+ 
+  Specify the amount of messages emitted by a COPY
+  command: default or verbose. If
+  verbose is specified, additional messages are emitted
+  during processing.
+ 
+ 
+  This is currently used in COPY FROM command when
+  ON_ERROR option is set to ignore.
+  
+
+   
+

 WHERE
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 28cf8b040a..67d5c3f7d0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -422,6 +422,36 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
 	return COPY_ON_ERROR_STOP;	/* keep

Re: Built-in CTYPE provider

2024-03-27 Thread Jeff Davis

On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote:
>  provider | isalpha | isdigit 
> --+-+-
>  ICU  | f   | t
>  glibc    | t   | f
>  builtin  | f   | f

The "ICU" above is really the behvior of the Postgres ICU provider as
we implemented it, it's not something forced on us by ICU.

For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and
pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are
defined by ICU to be equivalent to java.lang.Character.isLetter() and
java.lang.Character.isDigit().

ICU documents[3] how regex character classes should be implemented
using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite
being under the heading "...for C/POSIX character classes...", [3] says
it's based on the "Standard" variant of [4], rather than "POSIX
Compatible".

(Aside: the Postgres ICU provider doesn't match what [3] suggests for
the "alpha" class. For the character U+FF11 it doesn't matter, but I
suspect there are differences for other characters. This should be
fixed.)

The differences between PG_C_UTF8 and what ICU suggests are just
because the former uses the "POSIX Compatible" definitions and the
latter uses "Standard".

I implemented both the "Standard" and "POSIX Compatible" compatibility
properties in ad49994538, so it would be easy to change what PG_C_UTF8
uses.

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283
[2] 
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6
[3] 
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
[4] 
http://www.unicode.org/reports/tr18/#Compatibility_Properties

> Are we fine with pg_c_utf8 differing from both ICU's point of view
> (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is
> not
> digit, but it's alpha)?

Yes, some differences are to be expected.

But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as
we can point to something other than "glibc version 2.35 does it this
way".

Regards,
Jeff Davis

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread David G. Johnston

On Wed, Mar 27, 2024 at 10:12 AM Isaac Morland 
wrote:

> On Wed, 27 Mar 2024 at 13:05, Greg Sabino Mullane 
> wrote:
>
>> The purpose of the setting is to prevent accidental
>>> modifications via ALTER SYSTEM in environments where
>>
>>
>> The emphasis on 'accidental' seems a bit heavy here, and odd. Surely,
>> just "to prevent modifications via ALTER SYSTEM in environments where..."
>> is enough?
>>
>
> Not necessarily disagreeing, but it's very important nobody ever mistake
> this for a security feature. I don't know if the extra word "accidental" is
> necessary, but I think that's the motivation.
>

Prevent non-malicious modifications via ALTER SYSTEM in environments where
...

David J.

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 01:47:38PM -0300, Ranier Vilela wrote:
> Em qua., 27 de mar. de 2024 às 13:41, Nathan Bossart <
> nathandboss...@gmail.com> escreveu:
>> On Wed, Mar 27, 2024 at 01:21:23PM -0300, Ranier Vilela wrote:
>> > I think that left an oversight in a commit d365ae7
>> > <
>> https://github.com/postgres/postgres/commit/d365ae705409f5d9c81da4b668f59c3598feb512
>> >
>> > If the admin_role is a NULL pointer, so, can be dereferenced
>> > in the main loop of the function roles_is_member_of and
>> > worst, IMO, can be destroying aleatory memory?
>> >
>> > First, is a better shortcut test to check if admin_role is NOT NULL.
>> > Second, !OidIsValid(*admin_role), It doesn't seem necessary anymore.
>> >
>> > Or am I losing something?
>>
>> If admin_role is NULL, then admin_of is expected to be set to InvalidOid.
>> See the assertion at the top of the function.  AFAICT the code that
>> dereferences admin_role short-circuits if admin_of is invalid.
>>
> These conditions seem a little fragile and confusing to me.
> When a simple test, it protects the pointer and avoids a series of tests,
> which are unnecessary if the pointer is invalid.

Maybe.  But that doesn't seem like an oversight in commit d365ae7.

-   if (otherid == admin_of && form->admin_option &&
-   OidIsValid(admin_of) && 
!OidIsValid(*admin_role))
+   if (admin_role != NULL && otherid == admin_of && 
form->admin_option &&
+   OidIsValid(admin_of))
*admin_role = memberid;

I'm not following why it's safe to remove the !OidIsValid(*admin_role)
check here.  We don't want to overwrite a previously-set value of
*admin_role, as per the comment above roles_is_member_of():

 * If admin_of is not InvalidOid, this function sets *admin_role, either
 * to the OID of the first role in the result list that directly possesses
 * ADMIN OPTION on the role corresponding to admin_of, or to InvalidOid if
 * there is no such role.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Adding application_name to the error and notice message fields

2024-03-27 Thread Mitar

Hi!

Oh, I can use PQparameterStatus to obtain application_name of the
current connection. It seems then it is not needed to add this
information into notice message.


Mitar

On Wed, Mar 27, 2024 at 4:22 PM Mitar  wrote:
>
> Hi!
>
> We take care to always set application_name to improve our log lines
> where we use %a in log_line_prefix to log application name, per [1].
> But notices which are sent to the client do not have the application
> name and are thus hard to attribute correctly. Could "a" be added with
> the application name (when available) to the error and notice message
> fields [2]?
>
>
> Mitar
>
> [1] 
> https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-LINE-PREFIX
> [2] https://www.postgresql.org/docs/current/protocol-error-fields.html
>
> --
> https://mitar.tnode.com/
> https://twitter.com/mitar_m
> https://noc.social/@mitar



-- 
https://mitar.tnode.com/
https://twitter.com/mitar_m
https://noc.social/@mitar

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Peter Geoghegan

On Tue, Mar 19, 2024 at 9:36 PM Melanie Plageman
 wrote:
>  * "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
>  *
>  * Trackers used when heap_freeze_execute_prepared freezes, or when 
> there
>  * are zero freeze plans for a page.  It is always valid for 
> vacuumlazy.c
>  * to freeze any page, by definition.  This even includes pages that 
> have
>  * no tuples with storage to consider in the first place.  That way 
> the
>  * 'totally_frozen' results from heap_prepare_freeze_tuple can always 
> be
>  * used in the same way, even when no freeze plans need to be 
> executed to
>  * "freeze the page".  Only the "freeze" path needs to consider the 
> need
>  * to set pages all-frozen in the visibility map under this scheme.
>  *
>  * When we freeze a page, we generally freeze all XIDs < OldestXmin, 
> only
>  * leaving behind XIDs that are ineligible for freezing, if any.  And 
> so
>  * you might wonder why these trackers are necessary at all; why 
> should
>  * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
>  * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers?
>  *
>  * It is useful to use a definition of "freeze the page" that does not
>  * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
>  * generally prefers to remove Multis eagerly, but lazy processing is 
> used
>  * in cases where laziness allows VACUUM to avoid allocating a new 
> Multi.
>  * The "freeze the page" trackers enable this flexibility.
>  */
>
> So, I don't really know if it is right to just check presult->nfrozen >
> 0 when updating relminmxid. I have changed it to the way you suggested.
> But we can change it back.

I think that this is just about safe. I had to check, though. I see
that the FRM_NOOP case (within
FreezeMultiXactId/heap_prepare_freeze_tuple) will ratchet back both
sets of trackers (both the freeze and no freeze variants). However,
it's rather hard to see that this is true.

The intent here was that cases where "presult->nfrozen == 0" would
always take the "freeze" path. That seems more natural to me, at
least, since I think of the freeze path as the default choice. By
definition, lazy_scan_prune() can always take the freeze path -- even
when the page has no tuples with storage. But it cannot always take
the no-freeze path -- "disobeying" pagefrz.freeze_required creates the
risk that relfrozenxid/relminmxid will be advanced to unsafe values at
the end of the VACUUM. IMV you should stick with that approach now,
even if it is currently safe to do it the other way around.

-- 
Peter Geoghegan

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Isaac Morland

On Wed, 27 Mar 2024 at 13:05, Greg Sabino Mullane 
wrote:

> The purpose of the setting is to prevent accidental
>> modifications via ALTER SYSTEM in environments where
>
>
> The emphasis on 'accidental' seems a bit heavy here, and odd. Surely, just
> "to prevent modifications via ALTER SYSTEM in environments where..." is
> enough?
>

Not necessarily disagreeing, but it's very important nobody ever mistake
this for a security feature. I don't know if the extra word "accidental" is
necessary, but I think that's the motivation.

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Greg Sabino Mullane

>
> The purpose of the setting is to prevent accidental
> modifications via ALTER SYSTEM in environments where


The emphasis on 'accidental' seems a bit heavy here, and odd. Surely, just
"to prevent modifications via ALTER SYSTEM in environments where..." is
enough?

Cheers,
Greg

Re: Query Regarding frame options initialization in Window aggregate state

2024-03-27 Thread Tom Lane

Vallimaharajan G  writes:
> I am currently referring to the Postgres source code (psql (PostgreSQL) 14.3) 
> and came across a particular section related to window aggregate 
> initialization that has left me with a question. Specifically, I noticed a 
> conditional case in the initialization of per aggregate 
> (initialize_peraggregate in nodeWindowAgg.c) where the winstate frameOptions 
> is being checked, but it appears that frameOptions is not set before this 
> conditional case.

You are right, and that's a bug.  It's not of major consequence ---
it would just cause us to set up for moving-aggregate mode when
we won't ever actually move the frame head -- but it's at least a
little bit inefficient.

While I'm looking at it, there's a pretty obvious typo in the
adjacent comment:

- * and collect the right set of fields from the pg_attribute entry.
+ * and collect the right set of fields from the pg_aggregate entry.

regards, tom lane

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-27 Thread Ranier Vilela

Em qua., 27 de mar. de 2024 às 13:41, Nathan Bossart <
nathandboss...@gmail.com> escreveu:

> On Wed, Mar 27, 2024 at 01:21:23PM -0300, Ranier Vilela wrote:
> > Nathan Bossart  writes:
> >>Committed with that change. Thanks for the guidance on this one.
> >
> > I think that left an oversight in a commit d365ae7
> > <
> https://github.com/postgres/postgres/commit/d365ae705409f5d9c81da4b668f59c3598feb512
> >
> > If the admin_role is a NULL pointer, so, can be dereferenced
> > in the main loop of the function roles_is_member_of and
> > worst, IMO, can be destroying aleatory memory?
> >
> > First, is a better shortcut test to check if admin_role is NOT NULL.
> > Second, !OidIsValid(*admin_role), It doesn't seem necessary anymore.
> >
> > Or am I losing something?
>
> If admin_role is NULL, then admin_of is expected to be set to InvalidOid.
> See the assertion at the top of the function.  AFAICT the code that
> dereferences admin_role short-circuits if admin_of is invalid.
>
These conditions seem a little fragile and confusing to me.
When a simple test, it protects the pointer and avoids a series of tests,
which are unnecessary if the pointer is invalid.

best regards,
Ranier Vilela

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 01:21:23PM -0300, Ranier Vilela wrote:
> Nathan Bossart  writes:
>>Committed with that change. Thanks for the guidance on this one.
> 
> I think that left an oversight in a commit d365ae7
> 
> If the admin_role is a NULL pointer, so, can be dereferenced
> in the main loop of the function roles_is_member_of and
> worst, IMO, can be destroying aleatory memory?
> 
> First, is a better shortcut test to check if admin_role is NOT NULL.
> Second, !OidIsValid(*admin_role), It doesn't seem necessary anymore.
> 
> Or am I losing something?

If admin_role is NULL, then admin_of is expected to be set to InvalidOid.
See the assertion at the top of the function.  AFAICT the code that
dereferences admin_role short-circuits if admin_of is invalid.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

2024-03-27 Thread Ranier Vilela

Hi,

Nathan Bossart  writes:
>Committed with that change. Thanks for the guidance on this one.

I think that left an oversight in a commit d365ae7

If the admin_role is a NULL pointer, so, can be dereferenced
in the main loop of the function roles_is_member_of and
worst, IMO, can be destroying aleatory memory?

First, is a better shortcut test to check if admin_role is NOT NULL.
Second, !OidIsValid(*admin_role), It doesn't seem necessary anymore.

Or am I losing something?

best regards,
Ranier Vilela


avoid-dereference-a-null-pointer-acl.patch
Description: Binary data

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Heikki Linnakangas


On 27/03/2024 17:18, Melanie Plageman wrote:

I need some way to modify the control flow or accounting such that I
know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
And a way to consider freezing and do live tuple accounting for these
and HEAPTUPLE_LIVE tuples exactly once.


Just a quick update: I've been massaging this some more today, and I 
think I'm onto got something palatable. I'll send an updated patch later 
today, but the key is to note that for each item on the page, there is 
one point where we determine the fate of the item, whether it's pruned 
or not. That can happen in different points in in heap_page_prune(). 
That's also when we set marked[offnum] = true. Whenever that happens, we 
all call one of the a heap_page_prune_record_*() subroutines. We already 
have those subroutines for when a tuple is marked as dead or unused, but 
let's add similar subroutines for the case that we're leaving the tuple 
unchanged. If we move all the bookkeeping logic to those subroutines, we 
can ensure that it gets done exactly once for each tuple, and at that 
point we know what we are going to do to the tuple, so we can count it 
correctly. So heap_prune_chain() decides what to do with each tuple, and 
ensures that each tuple is marked only once, and the subroutines update 
all the variables, add the item to the correct arrays etc. depending on 
what we're doing with it.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-27 Thread Bertrand Drouvot

Hi,

On Wed, Mar 27, 2024 at 09:00:37PM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 27, 2024 at 6:54 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Wed, Mar 27, 2024 at 05:55:05PM +0530, Bharath Rupireddy wrote:
> > > On Wed, Mar 27, 2024 at 3:42 PM Bertrand Drouvot
> > > Please see the attached v28 patch.
> >
> > Thanks!
> >
> > 1 === sorry I missed it in the previous review
> >
> > if (!(RecoveryInProgress() && slot->data.synced))
> > +   {
> > now = GetCurrentTimestamp();
> > +   update_inactive_since = true;
> > +   }
> > +   else
> > +   update_inactive_since = false;
> >
> > I think update_inactive_since is not needed, we could rely on (now > 0) 
> > instead.
> 
> Thought of using it, but, at the expense of readability. I prefer to
> use a variable instead.

That's fine too.

> However, I changed the variable to be more meaningful to is_slot_being_synced.

Yeah makes sense and even easier to read.

v29-0001 LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Clean up role created in new subscription test.

2024-03-27 Thread Daniel Gustafsson

> On 27 Mar 2024, at 16:26, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> The only option is to make the check opt-in via a command-line parameter for
>> use it in the main regress tests, and not use it for the contrib tests.  The
>> attached v7 does that and passes CI, but I wonder if it's worth it all with
>> that restriction?
> 
> Yeah, that seems hardly worth it --- and it's only an accident of
> implementation that we don't run the main tests in parallel with
> something else, anyway.  Too bad, but ...

Agreed.  The excercise did catch the leftovers in 0001 so I'll go ahead with
those before closing the CF entry.

--
Daniel Gustafsson

Re: Flushing large data immediately in pqcomm

2024-03-27 Thread Robert Haas

On Wed, Mar 27, 2024 at 7:39 AM David Rowley  wrote:
> Robert, I understand you'd like a bit more from this patch. I'm
> wondering if you planning on blocking another committer from going
> ahead with this? Or if you have a reason why the current state of the
> patch is not a meaningful enough improvement that would justify
> possibly not getting any improvements in this area for PG17?

So, I think that the first version of the patch, when it got a big
chunk of data, would just flush whatever was already in the buffer and
then send the rest without copying. The current version, as I
understand it, only does that if the buffer is empty; otherwise, it
copies data as much data as it can into the partially-filled buffer. I
think that change addresses most of my concern about the approach; the
old way could, I believe, lead to an increased total number of flushes
with the right usage pattern, but I don't believe that's possible with
the revised approach. I do kind of wonder whether there is some more
fine-tuning of the approach that would improve things further, but I
realize that we have very limited time to figure this out, and there's
no sense letting the perfect be the enemy of the good.

So in short... no, I don't have big concerns at this point. Melih's
latest benchmarks look fairly promising to me, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-03-27 Thread Daniel Verite

Jeff Davis wrote:

> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
> 
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character before mapping to titlecase, and digits are not Cased. ICU
> has a configurable adjustment, and defaults in a way that produces
> '123abc'.

Even aside from ICU, there's a different behavior between glibc
and pg_c_utf8 glibc for codepoints in the decimal digit category 
outside of the US-ASCII range '0'..'9',

select initcap(concat(chr(0xff11), 'a') collate "C.utf8");   -- glibc 2.35
 initcap 
-
 １a

select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8");
 initcap 
-
 １A

Both collations consider that chr(0xff11) is not a digit
(isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8
says it's neither digit nor alpha.

AFAIU this is why in the above initcap() call, pg_c_utf8 considers
that 'a' is the first alphanumeric, whereas C.utf8 considers that '１'
is the first alphanumeric, leading to different capitalizations.

Comparing the 3 providers:

WITH v(provider,type,result) AS (values
 ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"),
 ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"),
 ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"),
 ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"),
 ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"),
 ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8")
 )
select * from v
\crosstabview

 provider | isalpha | isdigit 
--+-+-
 ICU  | f   | t
 glibc| t   | f
 builtin  | f   | f

Are we fine with pg_c_utf8 differing from both ICU's point of view
(U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not
digit, but it's alpha)?

Aside from initcap(), this is going to be significant for regular
expressions.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Can't compile PG 17 (master) from git under Msys2 autoconf

2024-03-27 Thread Regina Obe

I've been having trouble compiling PG 17 using msys2 / mingw64 (sorry my
meson setup is a bit broken at moment, so couldn't test that.).

Both my msys2 envs  (Rev2, Built by MSYS2 project) 13.2.0  and my older
setup (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0

Have the same issue:

The error is 

rm -f '../../src/include/storage/lwlocknames.h'
cp -pR ../../backend/storage/lmgr/lwlocknames.h
'../../src/include/storage/lwlocknames.h'
cp: cannot stat '../../backend/storage/lmgr/lwlocknames.h': No such file or
directory
make[1]: *** [Makefile:143: ../../src/include/storage/lwlocknames.h] Error 1
make[1]: Leaving directory '/projects/postgresql/postgresql-git/src/backend'
make: *** [../../src/Makefile.global:382: submake-generated-headers] Error 2


I traced the issue down to this change in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=721856ff24b

$(top_builddir)/src/include/storage/lwlocknames.h:
storage/lmgr/lwlocknames.h
-   prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
- cd '$(dir $@)' && rm -f $(notdir $@) && \
- $(LN_S) "$$prereqdir/$(notdir $<)" .
+   rm -f '$@'
+   $(LN_S) ../../backend/$< '$@'
 
 $(top_builddir)/src/include/utils/wait_event_types.h:
utils/activity/wait_event_types.h
-   prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
- cd '$(dir $@)' && rm -f $(notdir $@) && \
- $(LN_S) "$$prereqdir/$(notdir $<)" .
+   rm -f '$@'
+   $(LN_S) ../../backend/$< '$@'


Reverting just the above change fixes the issue.  I'm not sure what all that
does to be honest, so not sure the best way to move forward.
My linux autoconf build systems do not have this issue.

Thanks,
Regina

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Jelte Fennema-Nio

On Wed, 27 Mar 2024 at 16:10, Robert Haas  wrote:
>
> On Wed, Mar 27, 2024 at 11:01 AM Bruce Momjian  wrote:
> > Uh, the above is clearly wrong.  I think you mean "off" on the second line.
>
> Woops. When the name changed from externally_managed_configuration to
> allow_alter_system, the sense of it was reversed, and I guess Jelte
> missed flipping the documentation references around.

Yeah, that's definitely what happened. I did change a few, but I
indeed missed a few others (or maybe flipped some twice by accident,
or hadn't been flipped before when it reversed previously).

> > Why "remotely"?
>
> This wording was suggested upthread. I think the point here is that if
> the superuser is logging in from the local machine, it's obvious that
> they can do whatever they want. The point is to emphasize that a
> superuser without a local login can, too.

Changed this from "remotely using other means" to "using other SQL commands".

> > "its"?
>
> Yeah, that seems like an extra word.

Changed this to "the configuration of PostgreSQL"

> > > +some outside mechanism. In such environments, using 
> > > ALTER
> > > +SYSTEM to make configuration changes might appear to 
> > > work,
> > > +but then may be discarded at some point in the future when that 
> > > outside
> >
> > "might"
>
> This does not seem like a mistake to me. I'm not sure why you think it is.

I also think the original sentence was correct, but I don't think it
read very naturally. Changed it now in hopes to improve that.

> > > +mechanism updates the configuration. Setting this parameter to
> > > +on can help to avoid such mistakes.
> > > +   
> >
> > "off"
>
> This is another case that needs to be fixed now that the sense of the
> GUC is reversed. (We'd better make sure the code has the test the
> right way around, too.)

Fixed this one too, and the code is the right way around.


v8-0001-Rename-COMPAT_OPTIONS_CLIENT-to-COMPAT_OPTIONS_OT.patch
Description: Binary data


v8-0002-Add-allow_alter_system-GUC.patch
Description: Binary data

Crash on UNION with PG 17

2024-03-27 Thread Regina Obe

Our PostGIS bot that follows master branch has been crashing for past couple
of days on one of our tests

https://trac.osgeo.org/postgis/ticket/5701

I traced the issue down to this commit:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=66c0185a3d14b
bbf51d0fc9d267093ffec735231


The issue can be exercised without postgis installed as follows:


CREATE TABLE edge_data AS 
SELECT i AS edge_id, i + 1 AS start_node, i + 2 As end_node
FROM generate_series(1,10) AS i;

  WITH edge AS (
SELECT start_node, end_node
FROM edge_data
WHERE edge_id = 1
  )
  SELECT start_node id FROM edge UNION
  SELECT end_node FROM edge;


If I run using UNION ALL, this works fine:

  WITH edge AS (
SELECT start_node, end_node
FROM edge_data
WHERE edge_id = 1
  )
  SELECT start_node id FROM edge UNION ALL
  SELECT end_node FROM edge;


Thanks,
Regina

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-27 Thread Bharath Rupireddy

On Wed, Mar 27, 2024 at 6:54 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Mar 27, 2024 at 05:55:05PM +0530, Bharath Rupireddy wrote:
> > On Wed, Mar 27, 2024 at 3:42 PM Bertrand Drouvot
> > Please see the attached v28 patch.
>
> Thanks!
>
> 1 === sorry I missed it in the previous review
>
> if (!(RecoveryInProgress() && slot->data.synced))
> +   {
> now = GetCurrentTimestamp();
> +   update_inactive_since = true;
> +   }
> +   else
> +   update_inactive_since = false;
>
> I think update_inactive_since is not needed, we could rely on (now > 0) 
> instead.

Thought of using it, but, at the expense of readability. I prefer to
use a variable instead. However, I changed the variable to be more
meaningful to is_slot_being_synced.

> 2 ===
>
> +=item $node->get_slot_inactive_since_value(self, primary, slot_name, dbname)
> +
> +Get inactive_since column value for a given replication slot validating it
> +against optional reference time.
> +
> +=cut
> +
> +sub get_slot_inactive_since_value
> +{
>
> shouldn't be "=item $node->get_slot_inactive_since_value(self, slot_name, 
> reference_time)"
> instead?

Ugh. Changed.

> Apart from the above, LGTM.

Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
standby for sync slots. 0002 implementing inactive timeout GUC based
invalidation mechanism.

Please have a look.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v29-0001-Maintain-inactive_since-for-synced-slots-correct.patch
Description: Binary data


v29-0002-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data

Re: Adding OLD/NEW support to RETURNING

2024-03-27 Thread Jeff Davis

On Wed, 2024-03-27 at 13:19 +, Dean Rasheed wrote:

> What I'm most worried about now is that there are other areas of
> functionality like that, that I'm overlooking, and which will
> interact
> with this feature in non-trivial ways.

Agreed. I'm not sure exactly how we'd find those other areas (if they
exist) aside from just having more eyes on the code.

> 
> So on reflection, rather than trying to rush to get this into v17, I
> think it would be better to leave it to v18.

Thank you for letting me know. That allows some time for others to have
a look.

Regards,
Jeff Davis

Re: pgsql: Allow using syncfs() in frontend utilities.

2024-03-27 Thread Robert Haas

On Wed, Mar 27, 2024 at 11:25 AM Nathan Bossart
 wrote:
> Committed.  Again, I apologize this took so long.

No worries from my side; I only noticed recently. I guess Peter's been
waiting a while, though. Thanks for committing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-27 Thread Robert Haas

On Wed, Mar 27, 2024 at 11:06 AM Nathan Bossart
 wrote:
> IMHO the use-case where this doesn't work so well is when you have many,
> many servers to administer (e.g., a cloud provider).  In those cases,
> picking a default timeout to try to prevent wraparound is going to be much
> less accurate, as any reasonable value you pick is still going to be
> insufficient in some cases.  I think the XID-based parameter would be
> better here; if the server is at imminent risk of an outage due to
> wraparound, invalidating the slots is probably a reasonable course of
> action.

Well, I'm certainly willing to believe that your experience with
administering servers in the cloud is superior to mine. I don't really
understand why it makes a difference, though. Whether you have one
server or many, I agree that it is reasonable to invalidate slots when
XID wraparound looms. But also, whether you have one server or many,
by the time wraparound looms, you will typically have crippling bloat
as well. If preventing that bloat isn't important or there are other
defenses against it, then I see the value of the XID-based cutoff as a
backstop. And I will admit that in an on-prem installation, I've
occasionally seen situations where lots of XIDs got burned without
really causing a lot of bloat -- say, because there are heavily
updated staging tables which are periodically truncated, and very
little modification to long-lived data.

I'm not really strongly against having an XID-based threshold if smart
people such as yourself want it. I just think for a lot of users it's
going to be fiddly to get right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Clean up role created in new subscription test.

2024-03-27 Thread Tom Lane

Daniel Gustafsson  writes:
> The only option is to make the check opt-in via a command-line parameter for
> use it in the main regress tests, and not use it for the contrib tests.  The
> attached v7 does that and passes CI, but I wonder if it's worth it all with
> that restriction?

Yeah, that seems hardly worth it --- and it's only an accident of
implementation that we don't run the main tests in parallel with
something else, anyway.  Too bad, but ...

regards, tom lane

Re: pgsql: Allow using syncfs() in frontend utilities.

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 10:41:42AM -0400, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 12:34 PM Nathan Bossart
>  wrote:
>> Here's a first attempt at a patch based on Robert's suggestion from
>> upthread.
> 
> WFM.

Committed.  Again, I apologize this took so long.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Adding application_name to the error and notice message fields

2024-03-27 Thread Mitar

Hi!

We take care to always set application_name to improve our log lines
where we use %a in log_line_prefix to log application name, per [1].
But notices which are sent to the client do not have the application
name and are thus hard to attribute correctly. Could "a" be added with
the application name (when available) to the error and notice message
fields [2]?


Mitar

[1] 
https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-LINE-PREFIX
[2] https://www.postgresql.org/docs/current/protocol-error-fields.html

-- 
https://mitar.tnode.com/
https://twitter.com/mitar_m
https://noc.social/@mitar

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman

On Tue, Mar 26, 2024 at 5:46 PM Melanie Plageman
 wrote:
>
> On Mon, Mar 25, 2024 at 09:33:38PM +0200, Heikki Linnakangas wrote:
> > On 24/03/2024 18:32, Melanie Plageman wrote:
> > > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  
> > > wrote:
> > > >
> > > > In heap_page_prune_and_freeze(), we now do some extra work on each live
> > > > tuple, to set the all_visible_except_removable correctly. And also to
> > > > update live_tuples, recently_dead_tuples and hastup. When we're not
> > > > freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> > > > enough that it doesn't matter, but is it?
> > >
> > > Last year on an early version of the patch set I did some pgbench
> > > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > > that workload -- and I don't remember it being a showstopper. The code
> > > has changed a fair bit since then. However, I think it might be safer
> > > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > > the rest of the loop after heap_prune_satisifies_vacuum() when
> > > on-access pruning invokes it. I had avoided that because it felt ugly
> > > and error-prone, however it addresses a few other of your points as
> > > well.
> >
> > Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> > argument described what it does, rather than who it's for. For example,
> > 'need_all_visible'. If set to true, the function determines 'all_visible',
> > otherwise it does not.
>
> A very rough v7 is attached. The whole thing is rebased over master and
> then 0016 contains an attempt at the refactor we discussed in this
> email.
>
> Instead of just using the PruneReason to avoid doing the extra steps
> when on-access pruning calls heap_page_prune_and_freeze(), I've made an
> "actions" variable and defined different flags for it. One of them is
> a replacement for the existing mark_unused_now flag. I defined another
> one, PRUNE_DO_TRY_FREEZE, which could be used in place of checking if
> pagefrz is NULL.
>
> There is a whole group of activities that only the vacuum caller does
> outside of freezing -- setting hastup, counting live and recently dead
> tuples, determining whole page visibility and a snapshot conflict
> horizon for updating the VM. But I didn't want to introduce separate
> flags for each of them, because then I would have to check each of them
> before taking the action. That would be lots of extra branching and
> on-access pruning does none of those actions while vacuum does all of
> them.
>
> > I started to look closer at the loops in heap_prune_chain() and how they
> > update all the various flags and counters. There's a lot going on there. We
> > have:
> >
> > - live_tuples counter
> > - recently_dead_tuples counter
> > - all_visible[_except_removable]
> > - all_frozen
> > - visibility_cutoff_xid
> > - hastup
> > - prstate.frozen array
> > - nnewlpdead
> > - deadoffsets array
> >
> > And that doesn't even include all the local variables and the final
> > dead/redirected arrays.
> >
> > Some of those are set in the first loop that initializes 'htsv' for each
> > tuple on the page. Others are updated in heap_prune_chain(). Some are
> > updated in both. It's hard to follow which are set where.
> >
> > I think recently_dead_tuples is updated incorrectly, for tuples that are
> > part of a completely dead HOT chain. For example, imagine a hot chain with
> > two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the
> > chain, see the DEAD tuple at the end of the chain, and mark both tuples for
> > pruning. However, we already updated 'recently_dead_tuples' in the first
> > loop, which is wrong if we remove the tuple.
>
> Ah, yes, you are so right about this bug.
>
> > Maybe that's the only bug like this, but I'm a little scared. Is there
> > something we could do to make this simpler? Maybe move all the new work that
> > we added to the first loop, into heap_prune_chain() ? Maybe introduce a few
> > more helper heap_prune_record_*() functions, to update the flags and
> > counters also for live and insert/delete-in-progress tuples and for dead
> > line pointers? Something like heap_prune_record_live() and
> > heap_prune_record_lp_dead().
>
> I like the idea of a heap_prune_record_live_or_recently_dead() function.
> That's what I've attempted to implement in the attached 0016. I haven't
> updated and cleaned up everything (especially comments) in the refactor,
> but there are two major issues:
>
> 1) In heap_prune_chain(), a heap-only tuple which is not HOT updated may
> end up being a live tuple not part of any chain or it may end up the
> redirect target in a HOT chain. At the top of heap_prune_chain(), we
> return if (HeapTupleHeaderIsHeapOnly(htup)). We may come back to this
> tuple later if it is part of a chain. If we don't, we need to have
> called heap_prune_record_live_or_recently_dead(). However, there are
> other tuples that get redirected to which do not meet this criteria, so
>

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Robert Haas

On Wed, Mar 27, 2024 at 11:01 AM Bruce Momjian  wrote:
> Uh, the above is clearly wrong.  I think you mean "off" on the second line.

Woops. When the name changed from externally_managed_configuration to
allow_alter_system, the sense of it was reversed, and I guess Jelte
missed flipping the documentation references around. I likely would
have made the same mistake, but it's easily fixed.

> > +
> > +   
> > +Note that this setting cannot be regarded as a security feature. It
> > +only disables the ALTER SYSTEM command. It does 
> > not
> > +prevent a superuser from changing the configuration remotely using
>
> Why "remotely"?

This wording was suggested upthread. I think the point here is that if
the superuser is logging in from the local machine, it's obvious that
they can do whatever they want. The point is to emphasize that a
superuser without a local login can, too.

> "its"?

Yeah, that seems like an extra word.

> > +some outside mechanism. In such environments, using ALTER
> > +SYSTEM to make configuration changes might appear to 
> > work,
> > +but then may be discarded at some point in the future when that 
> > outside
>
> "might"

This does not seem like a mistake to me. I'm not sure why you think it is.

> > +mechanism updates the configuration. Setting this parameter to
> > +on can help to avoid such mistakes.
> > +   
>
> "off"

This is another case that needs to be fixed now that the sense of the
GUC is reversed. (We'd better make sure the code has the test the
right way around, too.)

> Is this really a patch we think we can push into PG 17. I am having my
> doubts.

If the worst thing that happens in PG 17 is that we push a patch that
needs a few documentation corrections, we're going to be doing
fabulously well.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote:
> Michael Banck  writes:
>> What is the status of this? In the commitfest, this patch is marked as
>> "Needs Review" with Nathan as reviewer - Nathan, were you going to take
>> another look at this or was your mail from January 12th a full review?
> 
> In my mind the ball is in Nathan's court.  I feel it's about
> committable, but he might not agree.

I'll prioritize another round of review on this one.  FWIW I don't remember
having any major concerns on a previous version of the patch set I looked
at.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Possibility to disable `ALTER SYSTEM`

2024-03-27 Thread Robert Haas

On Wed, Mar 27, 2024 at 10:43 AM Jelte Fennema-Nio  wrote:
> Alright, changed the GUC name to "allow_alter_system" since that seems
> to have the most "votes". One other option would be to call it simply
> "alter_system", just like "jit" is not called "allow_jit" or
> "enable_jit".
>
> But personally I feel that the "allow_alter_system" is clearer than
> plain "alter_system" for the GUC name.

I agree, and have committed your 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-27 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 10:33:28AM -0400, Robert Haas wrote:
> FWIW, I thought the time-based one sounded more useful. I think it
> would be poor planning to say "well, if the slot reaches an XID age of
> a billion, kill it so we don't wrap around," because while that likely
> will prevent me from getting into wraparound trouble, my database is
> likely to become horribly bloated long before the cutoff is reached. I
> thought it would be easier to reason in terms of time: I don't expect
> a slave to ever be down for more than X period of time, say an hour or
> whatever, so if it is, forget about it. Or alternatively, I know that
> if a slave does go down for more than X period of time, I start to get
> bloat, so cut it off at that point and I'll rebuild it later. I feel
> like these are things where people's intuition is going to be much
> stronger when reckoning in units of wall-clock time, which everyone
> deals with every day in one way or another, rather than in XID-based
> units that are, at least in my view, just a lot less intuitive.

I don't disagree with this point in the context of a user who is managing a
single server or just a handful of servers.  They are going to understand
their workload best and can reason about the right value for the timeout.
I think they'd still benefit from having an XID-based setting as a backstop
in case the timeout is still not sufficient to prevent wraparound, but it's
not nearly as important in that case.

IMHO the use-case where this doesn't work so well is when you have many,
many servers to administer (e.g., a cloud provider).  In those cases,
picking a default timeout to try to prevent wraparound is going to be much
less accurate, as any reasonable value you pick is still going to be
insufficient in some cases.  I think the XID-based parameter would be
better here; if the server is at imminent risk of an outage due to
wraparound, invalidating the slots is probably a reasonable course of
action.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

1 2 >

1 - 100 of 149 matches

Mail list logo