date:20240401

Re: Table AM Interface Enhancements

2024-04-01 Thread Jeff Davis

On Sat, 2024-03-30 at 23:33 +0200, Alexander Korotkov wrote:
> I've pushed 0001, 0002 and 0006.

Sorry to jump in to this discussion so late. I had worked on something
like the custom reloptions (0002) in the past, and there were some
complications that don't seem to be addressed in commit c95c25f9af.

* At minimum I think it needs some direction (comments, docs, tests)
that show how it's supposed to be used.

* The bytea returned by the reloptions() method is not in a trivial
format. It's a StdRelOptions struct with string values stored after the
end of the struct. To build the bytea internally, there's some
infrastructure like allocateRelOptStruct() and fillRelOptions(), and
it's not very easy to extend those to support a few custom options.

* If we ever decide to add a string option to StdRdOptions, I think the
design breaks, because the code that looks for those string values
wouldn't know how to skip over the custom options. Perhaps we can just
promise to never do that, but we should make it explicit somehow.

* Most existing heap reloptions (other than fillfactor) are used by
other parts of the system (like autovacuum) so should be considered
valid for any AM. Most AMs will just want to add a handful of their own
options on top, so it would be good to demonstrate how this should be
done.

* There are still places that are inappropriately calling
heap_reloptions directly. For instance, in ProcessUtilitySlow(), it
seems to assume that a toast table is a heap?

Regards,
Jeff Davis

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-04-01 Thread Tom Lane

Justin Pryzby  writes:
> On Sun, Mar 31, 2024 at 12:00:00PM +0300, Alexander Lakhin wrote:
>> I've stumbled upon a test failure caused by the test query added in that
>> commit:
>> +ERROR:  deadlock detected
>> +DETAIL:  Process 3076180 waits for AccessShareLock on relation 1259 of 
>> database 16386; blocked by process 3076181.
>> +Process 3076181 waits for AccessShareLock on relation 2601 of database 
>> 16386; blocked by process 3076180.

> I think means that, although it was cute to use pg_am in the reproducer
> given in the problem report, it's not a good choice to use here in the
> sql regression tests.

Another case here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sevengill=2024-04-02%2001%3A32%3A17

AFAICS, e2395cdbe posits that taking exclusive lock on pg_am in the
middle of a bunch of concurrent regression scripts couldn't possibly
cause any problems.  Really?

regards, tom lane

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bertrand Drouvot

Hi,

On Tue, Apr 02, 2024 at 04:24:49AM +, Zhijie Hou (Fujitsu) wrote:
> On Monday, April 1, 2024 9:28 PM Bertrand Drouvot 
>  wrote:
> > 
> > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote:
> > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
> > 
> > >
> > > > 2 ===
> > > >
> > > > +   {
> > > > +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> > > > +   {
> > > >
> > > > That could call SnapBuildSnapshotExists() multiple times for the
> > > > same "restart_lsn" (for example in case of multiple remote slots to 
> > > > sync).
> > > >
> > > > What if the sync worker records the last lsn it asks for
> > > > serialization (and serialized ? Then we could check that value first
> > > > before deciding to call (or not)
> > > > SnapBuildSnapshotExists() on it?
> > > >
> > > > It's not ideal because it would record "only the last one" but that
> > > > would be simple enough for now (currently there is only one sync
> > > > worker so that scenario is likely to happen).
> > > >
> > >
> > > Yeah, we could do that but I am not sure how much it can help. I guess
> > > we could do some tests to see if it helps.
> > 
> > Yeah not sure either. I just think it can only help and shouldn't make 
> > things
> > worst (but could avoid extra SnapBuildSnapshotExists() calls).
> 
> Thanks for the idea. I tried some tests based on Nisha's setup[1].

Thank you and Nisha and Shveta for the testing!

> I tried to
> advance the slots on the primary to the same restart_lsn before calling
> sync_replication_slots(), and reduced the data generated by pgbench.

Agree that this scenario makes sense to try to see the impact of
SnapBuildSnapshotExists().

> The SnapBuildSnapshotExists is still not noticeable in the profile.

SnapBuildSnapshotExists() number of calls are probably negligeable when 
compared to the IO calls generated by the fast forward logical decoding in this
scenario.

> So, I feel we
> could leave this as a further improvement once we encounter scenarios where
> the duplicate SnapBuildSnapshotExists call becomes noticeable.

Sounds reasonable to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Why is parula failing?

2024-04-01 Thread Tom Lane

"Tharakan, Robins"  writes:
>> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests 
>> stabilize now.

> So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of 
> which passed on subsequent-tries,
> of which some were even"signal 6: Aborted".

Ugh...

> Also, it'd be great if someone could point me to a way to update the 
> "Compiler" section in "System Detail" on
> the buildfarm page (it wrongly shows GCC as v7.3.1).

The update_personality.pl script in the buildfarm client distro
is what to use to adjust OS version or compiler version data.

regards, tom lane

RE: Why is parula failing?

2024-04-01 Thread Tharakan, Robins

> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests 
> stabilize now.

So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of 
which passed on subsequent-tries,
of which some were even"signal 6: Aborted".

FWIW, I compiled gcc v13.2 (default options) from source which IIUC shouldn't 
be to blame. Two other possible
reasons could be that the buildfarm doesn't have an aarch64 + gcc 13.2 
combination (quick check I couldn't
see any), or, that this hardware is flaky.

Either way, this instance is a throw-away so let me know if this isn't helping. 
I'll swap it out in case there's not
much benefit to be had.

Also, it'd be great if someone could point me to a way to update the "Compiler" 
section in "System Detail" on
the buildfarm page (it wrongly shows GCC as v7.3.1).

-
thanks
robins

Re: Synchronizing slots from primary to standby

2024-04-01 Thread shveta malik

On Mon, Apr 1, 2024 at 5:05 PM Amit Kapila  wrote:
>
> > 2 ===
> >
> > +   {
> > +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> > +   {
> >
> > That could call SnapBuildSnapshotExists() multiple times for the same
> > "restart_lsn" (for example in case of multiple remote slots to sync).
> >
> > What if the sync worker records the last lsn it asks for serialization (and
> > serialized ? Then we could check that value first before deciding to call 
> > (or not)
> > SnapBuildSnapshotExists() on it?
> >
> > It's not ideal because it would record "only the last one" but that would be
> > simple enough for now (currently there is only one sync worker so that 
> > scenario
> > is likely to happen).
> >
>
> Yeah, we could do that but I am not sure how much it can help. I guess
> we could do some tests to see if it helps.

I had a look at test-results conducted by Nisha, I did not find any
repetitive restart_lsn from primary being synced to standby for that
particular test of 100 slots. Unless we have some concrete test in
mind (having repetitive restart_lsn), I do not think that by using the
given tests, we can establish the benefit of suggested optimization.
Attached the log files of all slots test for reference,

thanks
Shveta
 slot_name | database |  xmin   | cxmin | restart_lsn |  flushlsn  | wal_status 
| conf | failover | synced | temp 
---+--+-+---+-+++--+--++--
 slot_1| newdb1   | |   772 | 0/AD0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_10   | newdb2   | |   772 | 0/A0002C8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_100  | newdb20  | |   772 | 0/A005F90   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_11   | newdb3   | |   772 | 0/A000300   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_12   | newdb3   | |   772 | 0/A000338   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_13   | newdb3   | |   772 | 0/A000370   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_14   | newdb3   | |   772 | 0/A0003A8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_15   | newdb3   | |   772 | 0/A0003E0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_16   | newdb4   | |   772 | 0/A000418   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_17   | newdb4   | |   772 | 0/A000450   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_18   | newdb4   | |   772 | 0/A000488   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_19   | newdb4   | |   772 | 0/A0004C0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_2| newdb1   | |   772 | 0/A000108   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_20   | newdb4   | |   772 | 0/A0004F8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_21   | newdb5   | |   772 | 0/A000530   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_22   | newdb5   | |   772 | 0/A000568   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_23   | newdb5   | |   772 | 0/A0005A0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_24   | newdb5   | |   772 | 0/A0005D8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_25   | newdb5   | |   772 | 0/A000610   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_26   | newdb6   | |   772 | 0/A000648   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_27   | newdb6   | |   772 | 0/A000680   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_28   | newdb6   | |   772 | 0/A0006B8   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_29   | newdb6   | |   772 | 0/A0006F0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_3| newdb1   | |   772 | 0/A000140   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_30   | newdb6   | |   772 | 0/A000728   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_31   | newdb7   | |   772 | 0/A000760   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_32   | newdb7   | |   772 | 0/A000798   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_33   | newdb7   | |   772 | 0/A0007D0   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_34   | newdb7   | |   772 | 0/A000808   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_35   | newdb7   | |   772 | 0/A000840   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_36   | newdb8   | |   772 | 0/A000878   | 0/E43E9240 | reserved   
| f| t| f  | f
 slot_37   |

RE: Synchronizing slots from primary to standby

2024-04-01 Thread Zhijie Hou (Fujitsu)

On Tuesday, April 2, 2024 8:43 AM Bharath Rupireddy 
 wrote:
> 
> On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > Attach the V4 patch which includes the optimization to skip the
> > decoding if the snapshot at the syncing restart_lsn is already
> > serialized. It can avoid most of the duplicate decoding in my test, and I am
> doing some more tests locally.
> 
> Thanks for the patch. I'm thinking if we can reduce the amount of work that we
> do for synced slots in each sync worker cycle. With that context in mind, why 
> do
> we need to create decoding context every time?
> Can't we create it once, store it in an in-memory structure and use it for 
> each
> sync worker cycle? Is there any problem with it? What do you think?

Thanks for the idea. I think the cost of decoding context seems to be
relatively minor when compared to the IO cost. After generating the profiles
for the tests shared by Nisha[1], it appears that the StartupDecodingContext is
not a issue. While the suggested refactoring is an option, I think
we can consider this as a future improvement and addressing it only if we
encounter scenarios where the creation of decoding context becomes a
bottleneck.

[1] 
https://www.postgresql.org/message-id/CALj2ACUeij5tFzJ1-cuoUh%2Bmhj33v%2BYgqD_gHYUpRdXSCSBbhw%40mail.gmail.com

Best Regards,
Hou zj
<>

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Amit Kapila

On Mon, Apr 1, 2024 at 6:58 PM Bertrand Drouvot
 wrote:
>
> On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote:
> > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
> >  wrote:
> > > Then there is no need to call WaitForStandbyConfirmation() as it could go 
> > > until
> > > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
> > > already
> > > know it).
> > >
> >
> > Won't it will normally return from the first check in
> > WaitForStandbyConfirmation() because standby_slot_names_config is not
> > set on standby?
>
> I think standby_slot_names can be set on a standby. One could want to set it 
> in
> a cascading standby env (though it won't have any real effects until the 
> standby
> is promoted).
>

Yeah, it is possible but doesn't seem worth additional checks for this
micro-optimization.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-01 Thread Andy Fan

Alexander Korotkov  writes:

Hello,

>  
>  Did you forget to attach the new patch?
>
> Yes, here it is. 
>
> --
> Regards,
> Alexander Korotkov 
>
> [4. text/x-diff; 
> v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...

+
+pg_wal_replay_wait (
+  target_lsn pg_lsn,
+  timeout bigint 
DEFAULT 0)
+void
+   

Should we return the millseconds of waiting time?  I think this
information may be useful for customer if they want to know how long
time it waits for for minitor purpose. 

-- 
Best Regards
Andy Fan

Re: Change GUC hashtable to use simplehash?

2024-04-01 Thread John Naylor

On Tue, Mar 5, 2024 at 5:30 PM John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 5:04 PM John Naylor  wrote:
> >
> > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
> > > But given that we know the data length and we have it in a register
> > > already, it's easy enough to just mask out data past the end with a
> > > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > > small test harness that just hashes and finalizes an array of strings,
> > > with a data dependency between consecutive hashes (next address
> > > depends on the previous hash output).
> >
> > Interesting work! I've taken this idea and (I'm guessing, haven't
> > tested) improved it by re-using an intermediate step for the
> > conditional, simplifying the creation of the mask, and moving the
> > bitscan out of the longest dependency chain.
>
> This needed a rebase, and is now 0001. I plan to push this soon.

I pushed but had to revert -- my version (and I believe both) failed
to keep the invariant that the aligned and unaligned must result in
the same hash. It's clear to me how to fix, but I've injured my strong
hand and won't be typing much in for a cuople days. I'll prioritize
the removal of strlen calls for v17, since the optimization can wait
and there is also a valgrind issue I haven't looked into.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-01 Thread Masahiko Sawada

On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
 wrote:
>
> On Fri, Mar 29, 2024 at 9:39 AM Amit Kapila  wrote:
> >
> > Commit message states: "why we can't just update inactive_since for
> > synced slots on the standby with the value received from remote slot
> > on the primary. This is consistent with any other slot parameter i.e.
> > all of them are synced from the primary."
> >
> > The inactive_since is not consistent with other slot parameters which
> > we copy. We don't perform anything related to those other parameters
> > like say two_phase phase which can change that property. However, we
> > do acquire the slot, advance the slot (as per recent discussion [1]),
> > and release it. Since these operations can impact inactive_since, it
> > seems to me that inactive_since is not the same as other parameters.
> > It can have a different value than the primary. Why would anyone want
> > to know the value of inactive_since from primary after the standby is
> > promoted?
>
> After thinking about it for a while now, it feels to me that the
> synced slots (slots on the standby that are being synced from the
> primary) can have their own inactive_sicne value. Fundamentally,
> inactive_sicne is set to 0 when slot is acquired and set to current
> time when slot is released, no matter who acquires and releases it -
> be it walsenders for replication, or backends for slot advance, or
> backends for slot sync using pg_sync_replication_slots, or backends
> for other slot functions, or background sync worker. Remember the
> earlier patch was updating inactive_since just for walsenders, but
> then the suggestion was to update it unconditionally -
> https://www.postgresql.org/message-id/CAJpy0uD64X%3D2ENmbHaRiWTKeQawr-rbGoy_GdhQQLVXzUSKTMg%40mail.gmail.com.
> Whoever syncs the slot, *acutally* acquires the slot i.e. makes it
> theirs, syncs it from the primary, and releases it. IMO, no
> differentiation is to be made for synced slots.

FWIW, coming to this thread late, I think that the inactive_since
should not be synchronized from the primary. The wall clocks are
different on the primary and the standby so having the primary's
timestamp on the standby can confuse users, especially when there is a
big clock drift. Also, as Amit mentioned, inactive_since seems not to
be consistent with other parameters we copy. The
replication_slot_inactive_timeout feature should work on the standby
independent from the primary, like other slot invalidation mechanisms,
and it should be based on its own local clock.

> Coming to a future patch for inactive timeout based slot invalidation,
> we can either allow invalidation without any differentiation for
> synced slots or restrict invalidation to avoid more sync work. For
> instance, if inactive timeout is kept low on the standby, the sync
> worker will be doing more work as it drops and recreates a slot
> repeatedly if it keeps getting invalidated. Another thing is that the
> standby takes independent invalidation decisions for synced slots.
> AFAICS, invalidation due to wal_removal is the only sole reason (out
> of all available invalidation reasons) for a synced slot to get
> invalidated independently of the primary. Check
> https://www.postgresql.org/message-id/CAA4eK1JXBwTaDRD_%3D8t6UB1fhRNjC1C%2BgH4YdDxj_9U6djLnXw%40mail.gmail.com
> for the suggestion on we better not differentiaing invalidation
> decisions for synced slots.
>
> The assumption of letting synced slots have their own inactive_since
> not only simplifies the code, but also looks less-confusing and more
> meaningful to the user. The only code that we put in on top of the
> committed code is to use InRecovery in place of
> RecoveryInProgress() in RestoreSlotFromDisk() to fix the issue raised
> by Shveta upthread.

If we want to invalidate the synced slots due to the timeout, I think
we need to define what is "inactive" for synced slots.

Suppose that the slotsync worker updates the local (synced) slot's
inactive_since whenever releasing the slot, irrespective of the actual
LSNs (or other slot parameters) having been updated. I think that this
idea cannot handle a slot that is not acquired on the primary. In this
case, the remote slot is inactive but the local slot is regarded as
active.  WAL files are piled up on the standby (and on the primary) as
the slot's LSNs don't move forward. I think we want to regard such a
slot as "inactive" also on the standby and invalidate it because of
the timeout.

>
> > Now, the other concern is that calling GetCurrentTimestamp()
> > could be costly when the values for the slot are not going to be
> > updated but if that happens we can optimize such that before acquiring
> > the slot we can have some minimal pre-checks to ensure whether we need
> > to update the slot or not.

If we use such pre-checks, another problem might happen; it cannot
handle a case where the slot is acquired on the primary but its LSNs
don't move forward. Imagine a logical replication conflict happened on
the

Re: Asymmetric partition-wise JOIN

2024-04-01 Thread Andrei Lepikhov


On 15/10/2023 13:25, Alexander Korotkov wrote:

Great!  I'm looking forward to the revised patch.
Revising the code and opinions before restarting this work, I found two 
different possible strategies mentioned in the thread:
1. 'Common Resources' shares the materialised result of the inner table 
scan (a hash table in the case of HashJoin) to join each partition one 
by one. It gives us a profit in the case of parallel append and possibly 
other cases, like the one shown in the initial message.
2. 'Individual strategies' - By limiting the AJ feature to cases when 
the JOIN clause contains a partitioning expression, we can push an 
additional scan clause into each copy of the inner table scan, reduce 
the number of tuples scanned, and even prune something because of proven 
zero input.


I see the pros and cons of both approaches. The first option is more 
straightforward, and its outcome is obvious in the case of parallel 
append. But how can we guarantee the same join type for each join? Why 
should we ignore the positive effect of different strategies for 
different partitions?
The second strategy is more expensive for the optimiser, especially in 
the multipartition case. But as I can predict, it is easier to implement 
and looks more natural for the architecture. What do you think about that?


--
regards,
Andrei Lepikhov
Postgres Professional

RE: Why is parula failing?

2024-04-01 Thread Tharakan, Robins

> ... in connection with which, I can't help noticing that parula is using a 
> very old compiler:
>
> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>
> From some quick checking around, that would have to be near the beginning of 
> aarch64
> support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture 
> until earlier
> that same year).  It's not exactly hard to believe that there were some 
> lingering compiler bugs.

> I wonder why parula is using that when its underlying system seems markedly 
> newer (the kernel at least has a recent build date).

Parula used GCC v7.3.1 since that's what came by default.
I've now switched to GCC v13.2 and triggered a run. Let's see if the tests 
stabilize now.
-
Robins

Re: Table AM Interface Enhancements

2024-04-01 Thread Japin Li



On Tue, 02 Apr 2024 at 07:59, Alexander Korotkov  wrote:
> On Mon, Apr 1, 2024 at 7:36 PM Tom Lane  wrote:
>> Coverity complained about what you did in RelationParseRelOptions
>> in c95c25f9a:
>>
>> *** CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
>> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 
>> 499 in RelationParseRelOptions()
>> 493
>> 494 /*
>> 495  * Fetch reloptions from tuple; have to use a hardwired 
>> descriptor because
>> 496  * we might not have any other for pg_class yet (consider 
>> executing this
>> 497  * code for pg_class itself)
>> 498  */
>> >>> CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
>> >>> Passing null pointer "tableam" to "extractRelOptions", which 
>> >>> dereferences it.
>> 499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
>> 500 
>> tableam, amoptsfn);
>> 501
>>
>> I see that extractRelOptions only uses the tableam argument for some
>> relkinds, and RelationParseRelOptions does set it up for those
>> relkinds --- but Coverity's complaint isn't without merit, because
>> those two switch statements are looking at *different copies of the
>> relkind*, which in theory could be different.  This all seems quite
>> messy and poorly factored.  Can't we do better?  Why do we need to
>> involve two copies of allegedly the same pg_class tuple, anyhow?
>
> I wasn't registered at Coverity yet.  Now I've registered and am
> waiting for approval to access the PostgreSQL analysis data.
>
> I wonder why Coverity complains about tableam, but not amoptsfn.
> Their usage patterns are very similar.
>
> It appears that relation->rd_rel isn't the full copy of pg_class tuple
> (see AllocateRelationDesc).  RelationParseRelOptions() is going to
> update relation->rd_options, and thus needs a full pg_class tuple to
> fetch options out of it.  However, it is really unnecessary to access
> both tuples at the same time.  We can use a full tuple, not
> relation->rd_rel, in both cases.  See the attached patch.
>
> --
> Regards,
> Alexander Korotkov


+   Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+;

There is an additional semicolon in the code.

--
Regards,
Japin Li

Re: Add new error_action COPY ON_ERROR "log"

2024-04-01 Thread Masahiko Sawada

On Mon, Apr 1, 2024 at 10:03 AM Masahiko Sawada  wrote:
>
> On Sat, Mar 30, 2024 at 11:05 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 6:31 PM Masahiko Sawada  
> > wrote:
> > >
> > > That is,
> > > since the LOG_VERBOSITY option is an enum parameter, it might make
> > > more sense to require the value, instead of making the value optional.
> > > For example, the following command could not be obvious for users:
> > >
> > > COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);
> >
> > Agreed. Please see the attached v14 patch.
>
> Thank you for updating the patch!
>
> >  The LOG_VERBOSITY now needs
> > a value to be specified. Note that I've not added any test for this
> > case as there seemed to be no such tests so far generating "ERROR:
> > <> requires a parameter". I don't mind adding one for
> > LOG_VERBOSITY though.
>
> +1
>
> One minor point:
>
>  ENCODING 'encoding_name'
> +LOG_VERBOSITY [ mode ]
>  
>
> '[' and ']' are not necessary because the value is no longer optional.
>
> I've attached the updated patch. I'll push it, barring any objections.
>

Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bharath Rupireddy

On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Attach the V4 patch which includes the optimization to skip the decoding if
> the snapshot at the syncing restart_lsn is already serialized. It can avoid 
> most
> of the duplicate decoding in my test, and I am doing some more tests locally.

Thanks for the patch. I'm thinking if we can reduce the amount of work
that we do for synced slots in each sync worker cycle. With that
context in mind, why do we need to create decoding context every time?
Can't we create it once, store it in an in-memory structure and use it
for each sync worker cycle? Is there any problem with it? What do you
think?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Synchronizing slots from primary to standby

2024-04-01 Thread Zhijie Hou (Fujitsu)

On Monday, April 1, 2024 7:30 PM Amit Kapila  wrote:
> 
> On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Friday, March 29, 2024 2:50 PM Amit Kapila 
> wrote:
> > >
> >
> > >
> > >
> > > 2.
> > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr
> moveto,
> > > +   bool *found_consistent_point);
> > > +
> > >
> > > This API looks a bit awkward as the functionality doesn't match the
> > > name. How about having a function with name
> > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > > ready_for_decoding) with the same functionality as your patch has
> > > for
> > > pg_logical_replication_slot_advance() and then invoke it both from
> > > pg_logical_replication_slot_advance and slotsync.c. The function
> > > name is too big, we can think of a shorter name. Any ideas?
> >
> > How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> > LogicalSlotAdvanceAndCheckDecoding()?
> >
> 
> It is about snapbuild state, so how about naming the function as
> LogicalSlotAdvanceAndCheckSnapState()?

It looks better to me, so changed.

> 
> I have made quite a few cosmetic changes in comments and code. See
> attached. This is atop your latest patch. Can you please review and include
> these changes in the next version?

Thanks, I have reviewed and merged them.
Attach the V5 patch set which addressed above comments and ran pgindent.

I will think and test the improvement suggested by Bertrand[1] and reply after 
that.

[1] 
https://www.postgresql.org/message-id/Zgp8n9QD5nYSESnM%40ip-10-97-1-34.eu-west-3.compute.internal

Best Regards,
Hou zj


v5-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v5-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch

Re: Why is parula failing?

2024-04-01 Thread Tom Lane

David Rowley  writes:
> On Sat, 30 Mar 2024 at 09:17, Tom Lane  wrote:
>> ... in connection with which, I can't help noticing that parula
>> is using a very old compiler:
>> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>> I wonder why parula is using that when its underlying system seems
>> markedly newer (the kernel at least has a recent build date).

> It could be, but wouldn't the bug have to relate to the locking code
> and be caused by some other backend stomping on the memory?
> Otherwise, shouldn't it be failing consistently every run rather than
> sporadically?

Your guess is as good as mine ... but we still haven't seen this
class of failure on any other animal, so the idea that it's strictly
a chance timing issue is getting weaker all the time.

regards, tom lane

Confusing #if nesting in hmac_openssl.c

2024-04-01 Thread Tom Lane

I noticed that buildfarm member batfish has been complaining like
this for awhile:

hmac_openssl.c:90:1: warning: unused function 'ResourceOwnerRememberHMAC' 
[-Wunused-function]
hmac_openssl.c:95:1: warning: unused function 'ResourceOwnerForgetHMAC' 
[-Wunused-function]

Looking at the code, this is all from commit e6bdfd970, and apparently
batfish is our only animal that doesn't HAVE_HMAC_CTX_NEW.  I tried to
understand the #if nesting and soon got very confused.  I don't think
it is helpful to put the resource owner manipulations inside #ifdef
HAVE_HMAC_CTX_NEW and HAVE_HMAC_CTX_FREE --- probably, it would never
be the case that only one of those is defined, but it just seems
messy.  What do you think of rearranging it as attached?

regards, tom lane

diff --git a/src/common/hmac_openssl.c b/src/common/hmac_openssl.c
index d10f7e5af7..4a62c8832f 100644
--- a/src/common/hmac_openssl.c
+++ b/src/common/hmac_openssl.c
@@ -41,6 +41,7 @@
  */
 #ifndef FRONTEND
 #ifdef HAVE_HMAC_CTX_NEW
+#define USE_RESOWNER_FOR_HMAC
 #define ALLOC(size) MemoryContextAlloc(TopMemoryContext, size)
 #else
 #define ALLOC(size) palloc(size)
@@ -67,13 +68,13 @@ struct pg_hmac_ctx
 	pg_hmac_errno error;
 	const char *errreason;
 
-#ifndef FRONTEND
+#ifdef USE_RESOWNER_FOR_HMAC
 	ResourceOwner resowner;
 #endif
 };
 
 /* ResourceOwner callbacks to hold HMAC contexts */
-#ifndef FRONTEND
+#ifdef USE_RESOWNER_FOR_HMAC
 static void ResOwnerReleaseHMAC(Datum res);
 
 static const ResourceOwnerDesc hmac_resowner_desc =
@@ -138,10 +139,12 @@ pg_hmac_create(pg_cryptohash_type type)
 	 * previous runs.
 	 */
 	ERR_clear_error();
-#ifdef HAVE_HMAC_CTX_NEW
-#ifndef FRONTEND
+
+#ifdef USE_RESOWNER_FOR_HMAC
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 #endif
+
+#ifdef HAVE_HMAC_CTX_NEW
 	ctx->hmacctx = HMAC_CTX_new();
 #else
 	ctx->hmacctx = ALLOC(sizeof(HMAC_CTX));
@@ -159,14 +162,14 @@ pg_hmac_create(pg_cryptohash_type type)
 		return NULL;
 	}
 
-#ifdef HAVE_HMAC_CTX_NEW
-#ifndef FRONTEND
+#ifndef HAVE_HMAC_CTX_NEW
+	memset(ctx->hmacctx, 0, sizeof(HMAC_CTX));
+#endif			/* HAVE_HMAC_CTX_NEW */
+
+#ifdef USE_RESOWNER_FOR_HMAC
 	ctx->resowner = CurrentResourceOwner;
 	ResourceOwnerRememberHMAC(CurrentResourceOwner, ctx);
 #endif
-#else
-	memset(ctx->hmacctx, 0, sizeof(HMAC_CTX));
-#endif			/* HAVE_HMAC_CTX_NEW */
 
 	return ctx;
 }
@@ -327,15 +330,16 @@ pg_hmac_free(pg_hmac_ctx *ctx)
 
 #ifdef HAVE_HMAC_CTX_FREE
 	HMAC_CTX_free(ctx->hmacctx);
-#ifndef FRONTEND
-	if (ctx->resowner)
-		ResourceOwnerForgetHMAC(ctx->resowner, ctx);
-#endif
 #else
 	explicit_bzero(ctx->hmacctx, sizeof(HMAC_CTX));
 	FREE(ctx->hmacctx);
 #endif
 
+#ifdef USE_RESOWNER_FOR_HMAC
+	if (ctx->resowner)
+		ResourceOwnerForgetHMAC(ctx->resowner, ctx);
+#endif
+
 	explicit_bzero(ctx, sizeof(pg_hmac_ctx));
 	FREE(ctx);
 }
@@ -375,7 +379,7 @@ pg_hmac_error(pg_hmac_ctx *ctx)
 
 /* ResourceOwner callbacks */
 
-#ifndef FRONTEND
+#ifdef USE_RESOWNER_FOR_HMAC
 static void
 ResOwnerReleaseHMAC(Datum res)
 {

Re: Table AM Interface Enhancements

2024-04-01 Thread Alexander Korotkov

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane  wrote:
> Coverity complained about what you did in RelationParseRelOptions
> in c95c25f9a:
>
> *** CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 
> 499 in RelationParseRelOptions()
> 493
> 494 /*
> 495  * Fetch reloptions from tuple; have to use a hardwired 
> descriptor because
> 496  * we might not have any other for pg_class yet (consider 
> executing this
> 497  * code for pg_class itself)
> 498  */
> >>> CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> >>> Passing null pointer "tableam" to "extractRelOptions", which 
> >>> dereferences it.
> 499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
> 500 
> tableam, amoptsfn);
> 501
>
> I see that extractRelOptions only uses the tableam argument for some
> relkinds, and RelationParseRelOptions does set it up for those
> relkinds --- but Coverity's complaint isn't without merit, because
> those two switch statements are looking at *different copies of the
> relkind*, which in theory could be different.  This all seems quite
> messy and poorly factored.  Can't we do better?  Why do we need to
> involve two copies of allegedly the same pg_class tuple, anyhow?

I wasn't registered at Coverity yet.  Now I've registered and am
waiting for approval to access the PostgreSQL analysis data.

I wonder why Coverity complains about tableam, but not amoptsfn.
Their usage patterns are very similar.

It appears that relation->rd_rel isn't the full copy of pg_class tuple
(see AllocateRelationDesc).  RelationParseRelOptions() is going to
update relation->rd_options, and thus needs a full pg_class tuple to
fetch options out of it.  However, it is really unnecessary to access
both tuples at the same time.  We can use a full tuple, not
relation->rd_rel, in both cases.  See the attached patch.

--
Regards,
Alexander Korotkov

fix-RelationParseRelOptions-Coverity-complain.patch
Description: Binary data

Re: On disable_cost

2024-04-01 Thread Robert Haas

On Mon, Apr 1, 2024 at 5:00 PM Tom Lane  wrote:
> Very interesting, thanks for the summary.  So the fact that
> disable_cost is additive across plan nodes is actually a pretty
> important property of the current setup.  I think this is closely
> related to one argument you made against my upthread idea of using
> IEEE Infinity for disable_cost: that'd mask whether more than one
> of the sub-plans had been disabled.

Yes, exactly. I just hadn't quite put the pieces together.

> Yeah, that seems like the next thing to try if anyone plans to pursue
> this further.  That'd essentially do what we're doing now except that
> disable_cost is its own "order of infinity", entirely separate from
> normal costs.

Right. I think that's actually what I had in mind in the last
paragraph of 
http://postgr.es/m/CA+TgmoY+Ltw7B=1FSFSN4yHcu2roWrz-ijBovj-99LZU=9h...@mail.gmail.com
but that was a while ago and I'd lost track of why it actually
mattered. But I also have questions about whether that's really the
right approach.

I think the approach of just not generating paths we don't want in the
first place merits more consideration. We do that in some cases
already, but not in others, and I'm not clear why. Like, if
index-scans, index-only scans, sorts, nested loops, and hash joins are
disabled, something is going to have to give, because the only
remaining join type is a merge join yet we've ruled out every possible
way of getting the day into some order, but I'm not sure whether
there's some reason that we need exactly the behavior that we have
right now rather than anything else. Maybe it would be OK to just
insist on at least one unparameterized, non-partial path at the
baserel level, and then if that ends up forcing us to ignore the
join-type restrictions higher up, so be it. Or maybe that's not OK and
after I try that out I'll end up writing another email about how I was
a bit clueless about all of this. I don't know. But I feel like it
merits more investigation, because I'm having trouble shaking the
theory that what we've got right now is pretty arbitrary.

And also ... looking at the regression tests, and also thinking about
the kinds of problems that I think people run into in real
deployments, I can't help feeling like we've somehow got this whole
thing backwards. enable_wunk imagines that you want to plan as normal
except with one particular plan type excluded from consideration. And
maybe that makes sense if the point of the enable_wunk GUC is that the
planner feature might be buggy and you might therefore want to turn it
off to protect yourself, or if the planner feature might be expensive
and you might want to turn it off to save cycles. But surely that's
not the case with something like enable_seqscan or enable_indexscan.
What I think we're mostly doing in the regression tests is shutting
off every relevant type of plan except one. I theorize that what we
actually want to do is tell the planner what we do want to happen,
rather than what we don't want to happen, but we've got this weird set
of GUCs that do the opposite of that and we're super-attached to them
because they've existed forever. I don't really have a concrete
proposal here, but I wonder if what we're actually talking about here
is spending time and energy polishing a mechanism that nobody likes in
the first place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Silence Meson warning on HEAD

2024-04-01 Thread Tristan Partin


A warning was recently[0] introduced into the Meson build:

WARNING: Project targets '>=0.54' but uses feature introduced in '0.55.0': 
Passing executable/found program object to script parameter of add_dist_script

There are 3 way to solve the issue that I have laid out in 3 separate 
patches that you pick at your leisure:


1. version-check.diff

Wrap the offending line in a Meson version check.

2. perl-string.diff

Pass the perl program as a string via its .path() method.

3. meson-bump.diff

Bump the minimum meson version from 0.54 to 0.55, at least.

I chose to do some analysis on option 3. The other 2 options are 
perfectly valid. I looked at all the systems in the buildfarm, and found 
the following results:


First column is the list of systems we test HEAD against in the 
buildfarm. Second column is the Meson version available to those 
systems, based on what I could find. Third column is whether or not we 
test Meson on that system.


System   Meson Version  Tested  
AlmaLinux 8  0.58.2 
AlmaLinux 9  0.63.3 
Alpine Linux 3.19.1  1.3.0  
Amazon Linux 2   0.55.1 
Amazon Linux 20230.62.2 
ArchLinux1.4.0  
CentOS 7 0.55.1 
CentOS 7.1   0.55.1 
CentOS 7.4   0.55.1 
CentOS 7.9.2009  0.55.1 
CentOS Stream 8  0.58.2 
Debian 100.56.1 
Debian 110.56.2 
Debian 11.5  0.56.2 
Debian 121.0.1  
Debian 7.0  
Debian 8.11 
Debian 9
Debian 9.3  
Debian unstable  1.4.0  x   
DragonFly BSD 6.2.2  0.63.2 
Fedora 381.0.1  
Fedora 391.3.2  x   
Fedora 401.3.2  
FreeBSD 12.2
FreeBSD 13.1 0.57.1 
FreeBSD 14   1.4.0  
Loongnix-Server 8.4.0   
macOS 13.6  
macOS 14.0  
macOS 14.0 / MacPorts1.4.0  
macOS 14.3   1.4.0  x   
NetBSD 10.0  1.2.3  
NetBSD 9.2   1.2.3  
OmniOS / illumos r151038
OpenBSD 6.8 
OpenBSD 6.9 
OpenBSD 7.3 
OpenBSD 7.4  1.2.1  
OpenIndiana/illumos hipster rolling release  1.4.0  
openSUSE 15.00.46.0 
openSUSE 15.30.54.2 
Oracle Linux 9   0.63.3 
Photon 2.0  
Photon 3.0  
Raspbian 7.8 0.56.2 
Raspbian 8.0 1.0.1  
Red Hat Enterprise Linux 7   0.55.1 
Red Hat Enterprise Linux 7.1 0.55.1 
Red Hat Enterprise Linux 7.9 0.55.1 
Red Hat Enterprise Linux 9.2 0.63.3 
Solaris 11.4.42 CBE  0.59.2 
SUSE Linux Enterprise Server 12 SP5 
SUSE Linux Enterprise Server 15 SP2 
Ubuntu 16.04 0.40.1 
Ubuntu 16.04.3   0.40.1 
Ubuntu 18.04 0.45.1 
Ubuntu 18.04.5

Re: Why is parula failing?

2024-04-01 Thread David Rowley

On Sat, 30 Mar 2024 at 09:17, Tom Lane  wrote:
>
> I wrote:
> > I'd not looked closely enough at the previous failure, because
> > now that I have, this is well out in WTFF territory: how can
> > reltuples be greater than zero when relpages is zero?  This can't
> > be a state that autovacuum would have left behind, unless it's
> > really seriously broken.  I think we need to be looking for
> > explanations like "memory stomp" or "compiler bug".
>
> ... in connection with which, I can't help noticing that parula
> is using a very old compiler:
>
> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>
> From some quick checking around, that would have to be near the
> beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
> to a primary architecture until earlier that same year).  It's not
> exactly hard to believe that there were some lingering compiler bugs.
> I wonder why parula is using that when its underlying system seems
> markedly newer (the kernel at least has a recent build date).

It could be, but wouldn't the bug have to relate to the locking code
and be caused by some other backend stomping on the memory?
Otherwise, shouldn't it be failing consistently every run rather than
sporadically?

David

Re: Crash on UNION with PG 17

2024-04-01 Thread David Rowley

On Thu, 28 Mar 2024 at 04:34, Regina Obe  wrote:
> CREATE TABLE edge_data AS
> SELECT i AS edge_id, i + 1 AS start_node, i + 2 As end_node
> FROM generate_series(1,10) AS i;
>
>   WITH edge AS (
> SELECT start_node, end_node
> FROM edge_data
> WHERE edge_id = 1
>   )
>   SELECT start_node id FROM edge UNION
>   SELECT end_node FROM edge;

As of d5d2205c8, this query should work as expected.

Thank you for letting us know about this.

David

Re: Properly pathify the union planner

2024-04-01 Thread David Rowley

On Fri, 29 Mar 2024 at 08:53, Tom Lane  wrote:
> On third thought, I'm not at all convinced that we even want to
> invent this struct as compared to just adding another parameter
> to subquery_planner.  The problem with a struct is what happens
> the next time we need to add a parameter?  If we add yet another
> function parameter, we can count on the compiler to complain
> about call sites that didn't get the memo.  Adding a field
> within an existing struct provokes no such warning, leading
> to situations with uninitialized fields that might accidentally
> work during testing, but fail the minute they get to the field.

I agree it's best to break callers that don't update their code to
consider passing or not passing a SetOperationStmt. I've just
committed a fix to do it that way.  This also seems to be the path of
least resistance, which also appeals.

I opted to add a new test alongside the existing tests which validate
set operations with an empty SELECT list work. The new tests include
the variation that the set operation has both a materialized and
non-materialized CTE as a child.  This was only a problem with a
materialized CTE, but I opted to include a non-materialized one as I
don't expect that we'll get this exact problem again. I was just keen
on getting more coverage with a couple of cheap tests.

Thanks for your input on this. I'll review your other comments shortly.

David

Re: Reports on obsolete Postgres versions

2024-04-01 Thread Bruce Momjian

On Wed, Mar 13, 2024 at 02:04:16PM -0400, Robert Treat wrote:
> I tend to agree with Bruce, and major/minor seems to be the more
> common usage within the industry; iirc, debian, ubuntu, gnome, suse,
> and mariadb all use that nomenclature; and ISTR some distro's who
> release packaged versions of postgres with custom patches applied (ie
> 12.4-2 for postgres 12.4 patchlevel 2).
> 
> BTW, as a reminder, we do have this statement, in bold, in the
> "upgrading" section of the versioning page:
> "We always recommend that all users run the latest available minor
> release for whatever major version is in use."  There is actually
> several other phrases and wording on that page that could probably be
> propagated as replacement language in some of these other areas.

I ended up writing the attached doc patch.  I found that some or our
text was overly-wordy, causing the impact of what we were trying to say
to be lessened.  We might want to go farther than this patch, but I
think it is an improvement.

I also moved the  text to the bottom of the section ---
previously, our  wording referenced minor releases, then we
talked about major releases, and then minor releases.  This gives a more
natural flow.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/templates/support/versioning.html b/templates/support/versioning.html
index d48e11e0..cee06954 100644
--- a/templates/support/versioning.html
+++ b/templates/support/versioning.html
@@ -45,15 +45,8 @@ number, e.g. 9.5.3 to 9.5.4.
 Upgrading
 
 
-  
-We always recommend that all users run the latest available minor
-release for whatever major version is in use.
-  
-
-
-
-Major versions usually change the internal format of system tables and data
-files.  These changes are often complex, so we do not maintain backward
+Major versions usually change the internal format of the system tables.
+These changes are often complex, so we do not maintain backward
 compatibility of all stored data.  A dump/reload of the database or use of the
 pg_upgrade module is required
 for major upgrades. We also recommend reading the
@@ -65,18 +58,25 @@ versions prior to doing so.
 
 
 
-Upgrading to a minor release does not normally require a dump and restore;  you
-can stop the database server, install the updated binaries, and restart the
-server.  For some releases, manual changes may be required to complete the
-upgrade, so always read the release notes before upgrading.
+Minor release upgrades do not require a dump and restore;  you simply stop
+the database server, install the updated binaries, and restart the server.
+Such upgrades might require manual changes to complete so always read
+the release notes first.
 
 
 
-While upgrading will always contain some level of risk, PostgreSQL minor releases
-fix only frequently-encountered bugs, security
-issues, and data corruption problems to reduce the risk associated with
-upgrading. For minor releases, the community considers not upgrading to be
-riskier than upgrading.
+Minor releases only fix frequently-encountered bugs, security issues, and data corruption
+problems, so such upgrades are very low risk.  The community
+considers performing minor upgrades to be less risky than continuing to
+run superseded minor versions.
+
+
+
+  
+We recommend that users always run the latest minor release associated
+with their major version.
+  
 
 
 Releases

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> Interestingly the Intel software developer manual is not exactly
> crystal clear on how memory faults with masks work, but volume 2A
> chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
> that supports memory fault suppression on page fault.

Perhaps Paul or Akash could chime in here...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

Here is a v19 of the patch set.  I moved out the refactoring of the
function pointer selection code to 0001.  I think this is a good change
independent of $SUBJECT, and I plan to commit this soon.  In 0002, I
changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
instead.  This is standard practice elsewhere where the popcount functions
are unlikely to win.  I'll probably commit this one soon, too, as it's even
more trivial than 0001.

0003 is the AVX512 POPCNT patch.  Besides refactoring out 0001, there are
no changes from v18.  0004 is an early proof-of-concept for using AVX512
for the visibility map code.  The code is missing comments, and I haven't
performed any benchmarking yet, but I figured I'd post it because it
demonstrates how it's possible to build upon 0003 in other areas.

AFAICT the main open question is the function call overhead in 0003 that
Alvaro brought up earlier.  After 0002 is committed, I believe the only
in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
not sure it's worth expending too much energy to make sure there are
absolutely no regressions there.  However, I'm happy to do so if folks feel
that it is necessary, and I'd be grateful for thoughts on how to proceed on
this one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cedad23b7b35e77fde164b1d577c37fb07a578c6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 1 Apr 2024 16:37:53 -0500
Subject: [PATCH v19 1/4] refactor popcount function choosing

---
 src/port/pg_bitutils.c | 37 +
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 1197696e97..28312f3dd9 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -148,8 +148,8 @@ pg_popcount_available(void)
  * the function pointers so that subsequent calls are routed directly to
  * the chosen implementation.
  */
-static int
-pg_popcount32_choose(uint32 word)
+static inline void
+choose_popcount_functions(void)
 {
 	if (pg_popcount_available())
 	{
@@ -163,45 +163,26 @@ pg_popcount32_choose(uint32 word)
 		pg_popcount64 = pg_popcount64_slow;
 		pg_popcount = pg_popcount_slow;
 	}
+}
 
+static int
+pg_popcount32_choose(uint32 word)
+{
+	choose_popcount_functions();
 	return pg_popcount32(word);
 }
 
 static int
 pg_popcount64_choose(uint64 word)
 {
-	if (pg_popcount_available())
-	{
-		pg_popcount32 = pg_popcount32_fast;
-		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
-	}
-	else
-	{
-		pg_popcount32 = pg_popcount32_slow;
-		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
-	}
-
+	choose_popcount_functions();
 	return pg_popcount64(word);
 }
 
 static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
-	if (pg_popcount_available())
-	{
-		pg_popcount32 = pg_popcount32_fast;
-		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
-	}
-	else
-	{
-		pg_popcount32 = pg_popcount32_slow;
-		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
-	}
-
+	choose_popcount_functions();
 	return pg_popcount(buf, bytes);
 }
 
-- 
2.25.1

>From 038b74045b006c5d8a5470364f2041370ec0b083 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 1 Apr 2024 16:47:22 -0500
Subject: [PATCH v19 2/4] use pg_number_of_ones instead of pg_popcount for
 single byte

---
 src/backend/postmaster/syslogger.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 08efe74cc9..437947dbb9 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer)
 		if (p.nuls[0] == '\0' && p.nuls[1] == '\0' &&
 			p.len > 0 && p.len <= PIPE_MAX_PAYLOAD &&
 			p.pid != 0 &&
-			pg_popcount((char *) _flags, 1) == 1)
+			pg_number_of_ones[dest_flags] == 1)
 		{
 			List	   *buffer_list;
 			ListCell   *cell;
-- 
2.25.1

>From 73ee8d6018b047856e63ad075641a0dcfe889417 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v19 3/4] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   7 +-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed,

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma

On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

Interestingly the Intel software developer manual is not exactly
crystal clear on how memory faults with masks work, but volume 2A
chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
that supports memory fault suppression on page fault.

Regards,
Ants Aasma

[1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-04-01 Thread Jacob Champion

On Thu, Mar 28, 2024 at 3:34 PM Daniel Gustafsson  wrote:
> In jsonapi.c, makeJsonLexContextCstringLen initialize a JsonLexContext with
> palloc0 which would need to be ported over to use ALLOC for frontend code.

Seems reasonable (but see below, too).

> On
> that note, the errorhandling in parse_oauth_json() for content-type checks
> attempts to free the JsonLexContext even before it has been created.  Here we
> can just return false.

Agreed. They're zero-initialized, so freeJsonLexContext() is safe
IIUC, but it's clearer not to call the free function at all.

But for these additions:

> -   makeJsonLexContextCstringLen(, resp->data, resp->len, PG_UTF8, true);
> +   if (!makeJsonLexContextCstringLen(, resp->data, resp->len, PG_UTF8, 
> true))
> +   {
> +   actx_error(actx, "out of memory");
> +   return false;
> +   }

...since we're using the stack-based API as opposed to the heap-based
API, they shouldn't be possible to hit. Any failures in createStrVal()
are deferred to parse time on purpose.

> -  echo 'libpq must not be calling any function which invokes exit'; exit 1; \
> +  echo 'libpq must not be calling any function which invokes exit'; \
> The offending codepath in libcurl was in the NTLM_WB module, a very old and
> obscure form of NTLM support which was replaced (yet remained in the tree) a
> long time ago by a full NTLM implementatin.  Based on the findings in this
> thread it was deprecated with a removal date set to April 2024 [0].  A bug in
> the 8.4.0 release however disconnected NTLM_WB from the build and given the
> lack of complaints it was decided to leave as is, so we can base our libcurl
> requirements on 8.4.0 while keeping the exit() check intact.

Of the Cirrus machines, it looks like only FreeBSD has a new enough
libcurl for that. Ubuntu won't until 24.04, Debian Bookworm doesn't
have it unless you're running backports, RHEL 9 is still on 7.x... I
think requiring libcurl 8 is effectively saying no one will be able to
use this for a long time. Is there an alternative?

> +  else if (strcasecmp(content_type, "application/json") != 0)
> This needs to handle parameters as well since it will now fail if the charset
> parameter is appended (which undoubtedly will be pretty common).  The easiest
> way is probably to just verify the mediatype and skip the parameters since we
> know it can only be charset?

Good catch. application/json no longer defines charsets officially
[1], so I think we should be able to just ignore them. The new
strncasecmp needs to handle a spurious prefix, too; I have that on my
TODO list.

> +  /* TODO investigate using conn->Pfdebug and CURLOPT_DEBUGFUNCTION here */
> +  CHECK_SETOPT(actx, CURLOPT_VERBOSE, 1L, return false);
> +  CHECK_SETOPT(actx, CURLOPT_ERRORBUFFER, actx->curl_err, return false);
> CURLOPT_ERRORBUFFER is the old and finicky way of extracting error messages, 
> we
> should absolutely move to using CURLOPT_DEBUGFUNCTION instead.

This new way doesn't do the same thing. Here's a sample error:

connection to server at "127.0.0.1", port 56619 failed: failed to
fetch OpenID discovery document: Weird server reply (  Trying
127.0.0.1:36647...
Connected to localhost (127.0.0.1) port 36647 (#0)
Mark bundle as not supporting multiuse
HTTP 1.0, assume close after body
Invalid Content-Length: value
Closing connection 0
)

IMO that's too much noise. Prior to the change, the same error would have been

connection to server at "127.0.0.1", port 56619 failed: failed to
fetch OpenID discovery document: Weird server reply (Invalid
Content-Length: value)

The error buffer is finicky for sure, but it's also a generic one-line
explanation of what went wrong... Is there an alternative API for that
I'm missing?

> +  /* && response_code != 401 TODO */ )
> Why is this marked with a TODO, do you remember?

Yeah -- I have a feeling that 401s coming back are going to need more
helpful hints to the user, since it implies that libpq itself hasn't
authenticated correctly as opposed to some user-related auth failure.
I was hoping to find some sample behaviors in the wild and record
those into the suite.

> +  print("# OAuth provider (PID $pid) is listening on port $port\n");
> Code running under Test::More need to use diag() for printing non-test output
> like this.

Ah, thanks.

> +#if LIBCURL_VERSION_MAJOR <= 8 && LIBCURL_VERSION_MINOR < 4

I don't think this catches versions like 7.76, does it? Maybe
`LIBCURL_VERSION_MAJOR < 8 || (LIBCURL_VERSION_MAJOR == 8 &&
LIBCURL_VERSION_MINOR < 4)`, or else `LIBCURL_VERSION_NUM < 0x080400`?

> my $pid = open(my $read_fh, "-|", $ENV{PYTHON}, "t/oauth_server.py")
> -   // die "failed to start OAuth server: $!";
> +   or die "failed to start OAuth server: $!";
>
> -   read($read_fh, $port, 7) // die "failed to read port number: $!";
> +   read($read_fh, $port, 7) or die "failed to read port number: $!";

The first hunk here looks good (thanks for the catch!) but I think the
second

Re: Security lessons from liblzma

2024-04-01 Thread Andrew Dunstan




On 2024-03-31 Su 17:12, Andres Freund wrote:

Hi,

On 2024-03-31 12:18:29 +0200, Michael Banck wrote:

If you ask where they are maintained, the answer is here:

https://salsa.debian.org/postgresql/postgresql/-/tree/17/debian/patches?ref_type=heads

the other major versions have their own branch.

Luckily these are all quite small, leaving little space to hide stuff.  I'd
still like to get rid of at least some of them.

I've previously proposed a patch to make pkglibdir configurable, I think we
should just go for that.

For the various defines, ISTM we should just make them #ifndef guarded, then
they could be overridden by defining them at configure time. Some of them,
like DEFAULT_PGSOCKET_DIR seem to be overridden by just about every
distro. And others would be nice to easily override anyway, I e.g. dislike the
default DEFAULT_PAGER value.



+1 to this proposal.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Heikki Linnakangas

On 01/04/2024 20:22, Melanie Plageman wrote:

 From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 29 Mar 2024 19:43:09 -0400
Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions

We will eventually take additional actions in heap_page_prune() at the
discretion of the caller. For now, introduce these PRUNE_DO_* macros and
turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_

paramter -> parameter

action.
---
  src/backend/access/heap/pruneheap.c  | 51 ++--
  src/backend/access/heap/vacuumlazy.c | 11 --
  src/include/access/heapam.h  | 13 ++-
  3 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c 
b/src/backend/access/heap/pruneheap.c
index fb0ad834f1b..30965c3c5a1 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -29,10 +29,11 @@
  /* Working data for heap_page_prune and subroutines */
  typedef struct
  {
+   /* PRUNE_DO_* arguments */
+   uint8   actions;

I wasn't sure if actions is a good name. What do you think?

Committed this part, with the name 'options'. There's some precedent for 
that in heap_insert().

I decided to keep it a separate bool field here in the PruneState 
struct, though, and only changed it in the heap_page_prune() function 
signature. It didn't feel worth the code churn here, and 
'prstate.mark_unused_now' is a shorter than "(prstate.options & 
HEAP_PRUNE_PAGE_MARK_UNUSED_NOW) != 0" anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
> A boolean is what we had before, I'm quite comfortable with that, and it
> addresses my silent-failure concerns.

WFM.

regards, tom lane

Re: pg_combinebackup --copy-file-range

2024-04-01 Thread Thomas Munro

On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra
 wrote:
> And I think he's right, and my tests confirm this. I did a trivial patch
> to align the blocks to 8K boundary, by forcing the header to be a
> multiple of 8K (I think 4K alignment would be enough). See the 0001
> patch that does this.
>
> And if I measure the disk space used by pg_combinebackup, and compare
> the results with results without the patch ([1] from a couple days
> back), I see this:
>
>pct  not alignedaligned
>   -
> 1% 689M19M
>10%3172M22M
>20%   13797M27M
>
> Yes, those numbers are correct. I didn't believe this at first, but the
> backups are valid/verified, checksums are OK, etc. BTRFS has similar
> numbers (e.g. drop from 20GB to 600MB).

Fantastic.

> I think we absolutely need to align the blocks in the incremental files,
> and I think we should do that now. I think 8K would work, but maybe we
> should add alignment parameter to basebackup & manifest?
>
> The reason why I think maybe this should be a basebackup parameter is
> the recent discussion about large fs blocks - it seems to be in the
> works, so maybe better to be ready and not assume all fs have 4K.
>
> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

+1, and I think BLCKSZ is the right choice.

> I was very puzzled by the awful performance on ZFS. When every other fs
> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
> advice I could think of, with almost no effect.
>
> Ultimately I decided that it probably is the "no readahead" behavior
> I've observed on ZFS. I assume it's because it doesn't use the page
> cache where the regular readahead is detected etc. And there's no
> prefetching in pg_combinebackup, so I decided to an experiment and added
> a trivial explicit prefetch when reconstructing the file - every time
> we'd read data from a file, we do posix_fadvise for up to 128 blocks
> ahead (similar to what bitmap heap scan code does). See 0002.
>
> And tadaaa - the duration dropped from 900-1000 seconds to only about
> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
> pretty massive.

Interesting.  ZFS certainly has its own prefetching heuristics with
lots of logic and settings, but it could be that it's using
strict-next-block detection of access pattern (ie what I called
effective_io_readahead_window=0 in the streaming I/O thread) instead
of a window (ie like the Linux block device level read ahead where,
AFAIK, if you access anything in that sliding window it is triggered),
and perhaps your test has a lot of non-contiguous but close-enough
blocks?  (Also reminds me of the similar discussion on the BHS thread
about distinguishing sequential access from
mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
fine line between them.)

You could double-check this and related settings (for example I think
it might disable itself automatically if you're on a VM with small RAM
size):

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable

> There's a couple more interesting ZFS details - the prefetching seems to
> be necessary even when using copy_file_range() and don't need to read
> the data (to calculate checksums). This is why the "manifest=off" chart
> has the strange group of high bars at the end - the copy cases are fast
> because prefetch happens, but if we switch to copy_file_range() there
> are no prefetches and it gets slow.

Hmm, at a guess, it might be due to prefetching the dnode (root object
for a file) and block pointers, ie the structure but not the data
itself.

> This is a bit bizarre, especially because the manifest=on cases are
> still fast, exactly because the pread + prefetching still happens. I'm
> sure users would find this puzzling.
>
> Unfortunately, the prefetching is not beneficial for all filesystems.
> For XFS it does not seem to make any difference, but on BTRFS it seems
> to cause a regression.
>
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
interesting research topic for a potential ZFS patch:
prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
reading the relevant source today, but I'll figure it out soonish...)

> So I took a stab

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> What about using the masking capabilities of AVX-512 to handle the
> tail in the same code path? Masked out portions of a load instruction
> will not generate an exception. To allow byte level granularity
> masking, -mavx512bw is needed. Based on wikipedia this will only
> disable this fast path on Knights Mill (Xeon Phi), in all other cases
> VPOPCNTQ implies availability of BW.

Sounds promising.  IMHO we should really be sure that these kinds of loads
won't generate segfaults and the like due to the masked-out portions.  I
searched around a little bit but haven't found anything that seemed
definitive.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Psql meta-command conninfo+

2024-04-01 Thread Imseih (AWS), Sami

> Here I considered your suggestion (Sami and Álvaro's). However, I haven't yet
> added the links for the functions system_user(), current_user(), and 
> session_user().
> 'm not sure how to do it. Any suggestion on how to create/add the link?

Here is an example [1] where the session information functions are referenced.
The public doc for this example is in [2].

Something similar can be done here. For example:

The session user's name; see the session_user() function in
 for more details.


[1] 
https://github.com/postgres/postgres/blob/master/doc/src/sgml/system-views.sgml#L1663-L1664
[2] https://www.postgresql.org/docs/16/view-pg-locks.html

Regards,

Sami

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> If we are envisioning that the function might emit multiple warnings
> per call, a useful definition could be to return the number of
> warnings (so zero is good, not-zero is bad).  But I'm not sure that's
> really better than a boolean result.  pg_dump/pg_restore won't notice
> anyway, but perhaps other programs using these functions would care.
>

A boolean is what we had before, I'm quite comfortable with that, and it
addresses my silent-failure concerns.

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma

On Mon, 1 Apr 2024 at 18:53, Nathan Bossart  wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> +popcnt = _mm512_reduce_add_epi64(accum);
> >> +return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here?  Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does.  I did another benchmark on very small arrays and can see the
> overhead.  This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
> size (bytes)  HEAD  AVX512-POPCNT
> 1 1707.685  3480.424
> 2 1926.694  4606.182
> 4 3210.412  5284.506
> 8 1920.703  3640.968
> 162936.91   4045.586
> 323627.956  5538.418
> 645347.213  3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.

What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.

Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..7fb2ada16c9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,27 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
+	__mmask64	remaining_mask;
 	__m512i		accum = _mm512_setzero_si512();
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+	#pragma clang loop unroll(disable)
+	for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i))
 	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+		val = _mm512_loadu_si512((const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 
 		accum = _mm512_add_epi64(accum, cnt);
 		buf += sizeof(__m512i);
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes);
+	val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
> Any thoughts about going back to having a return value, a caller could then
> see that the function returned NULL rather than whatever the expected value
> was (example: TRUE)?

If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

regards, tom lane

Re: Security lessons from liblzma

2024-04-01 Thread Tom Lane

Bruce Momjian  writes:
> On Mon, Apr  1, 2024 at 03:17:55PM -0400, Tom Lane wrote:
>> AFAIK, every open-source distro makes all the pieces needed to
>> rebuild their packages available to users.  It wouldn't be much
>> of an open-source situation otherwise.  You do have to learn
>> their package build process.

> I wasn't clear if all the projects provide a source tree that can be
> verified against the project's source tree, and then independent
> patches, or if the patches were integrated and therefore harder to
> verify against the project source tree.

In the systems I'm familiar with, an SRPM-or-equivalent includes the
pristine upstream tarball and then some patch files to apply to it.
The patch files have to be maintained anyway, and if you don't ship
them then you're not shipping "source".

regards, tom lane

Re: On disable_cost

2024-04-01 Thread Tom Lane

Robert Haas  writes:
> One of the things I realized relatively early is that the patch does
> nothing to propagate disable_cost upward through the plan tree.
> ...
> After straining my brain over various plan changes for a long time,
> and hacking on the code somewhat, I realized that just propagating the
> Boolean upward is insufficient to set things right. That's basically
> because I was being dumb when I said this:
>> I don't think we should care how MANY disabled nodes appear in a
>> plan, particularly.

Very interesting, thanks for the summary.  So the fact that
disable_cost is additive across plan nodes is actually a pretty
important property of the current setup.  I think this is closely
related to one argument you made against my upthread idea of using
IEEE Infinity for disable_cost: that'd mask whether more than one
of the sub-plans had been disabled.

> ... And while there's probably more than one way
> to make it work, the easiest thing seems to be to just have a
> disabled-counter in every node that gets initialized to the total
> disabled-counter values of all of its children, and then you add 1 if
> that node is itself doing something that is disabled, i.e. the exact
> opposite of what I said in the quote above.

Yeah, that seems like the next thing to try if anyone plans to pursue
this further.  That'd essentially do what we're doing now except that
disable_cost is its own "order of infinity", entirely separate from
normal costs.

regards, tom lane

Re: Security lessons from liblzma

2024-04-01 Thread Bruce Momjian

On Mon, Apr  1, 2024 at 03:17:55PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I was more asking if users have access to patches so they could recreate
> > the build by using the Postgres git tree and supplied OS-specific
> > patches.
> 
> AFAIK, every open-source distro makes all the pieces needed to
> rebuild their packages available to users.  It wouldn't be much
> of an open-source situation otherwise.  You do have to learn
> their package build process.

I wasn't clear if all the projects provide a source tree that can be
verified against the project's source tree, and then independent
patches, or if the patches were integrated and therefore harder to
verify against the project source tree.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Security lessons from liblzma

2024-04-01 Thread Bruce Momjian

On Fri, Mar 29, 2024 at 06:37:24PM -0400, Bruce Momjian wrote:
> You might have seen reports today about a very complex exploit added to
> recent versions of liblzma.  Fortunately, it was only enabled two months
> ago and has not been pushed to most stable operating systems like Debian
> and Ubuntu.  The original detection report is:
> 
> https://www.openwall.com/lists/oss-security/2024/03/29/4

I was watching this video about the exploit:

https://www.youtube.com/watch?v=bS9em7Bg0iU

and at 2:29, they mention "hero software developer", our own Andres
Freund as the person who discovered the exploit.  I noticed the author's
name at the openwall email link above, but I assumed it was someone else
with the same name.  They mentioned it was found while researching
Postgres performance, and then I noticed the email address matched!

I thought the analogy he uses at the end of the video is very clear.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: On disable_cost

2024-04-01 Thread Robert Haas

On Tue, Mar 12, 2024 at 1:32 PM Tom Lane  wrote:
> > 1. You stated that it changes lots of plans in the regression tests,
> > but you haven't provided any sort of analysis of why those plans
> > changed. I'm kind of surprised that there would be "tons" of plan
> > changes. You (or someone) should look into why that's happening.
>
> I've not read the patch, but given this description I would expect
> there to be *zero* regression changes --- I don't think we have any
> test cases that depend on disable_cost being finite.  If there's more
> than zero changes, that very likely indicates a bug in the patch.
> Even if there are places where the output legitimately changes, you
> need to justify each one and make sure that you didn't invalidate the
> intent of that test case.

I spent some more time poking at this patch. It's missing a ton of
important stuff and is wrong in a whole bunch of really serious ways,
and I'm not going to try to mention all of them in this email. But I
do want to talk about some of the more interesting realizations that
came to me as I was working my way through this.

One of the things I realized relatively early is that the patch does
nothing to propagate disable_cost upward through the plan tree. That
means that if you have a choice between, say,
Sort-over-Append-over-SeqScan and MergeAppend-over-IndexScan, as we do
in the regression tests, disabling IndexScan doesn't change the plan
with the patch applied, as it does in master. That's because only the
IndexScan node ends up flagged as disabled. Once we start stacking
other plan nodes on top of disabled plan nodes, the resultant plans
are at no disadvantage compared to plans containing no disabled nodes.
The IndexScan plan survives initially, despite being disabled, because
it's got a sort order. That lets us use it to build a MergeAppend
path, and that MergeAppend path is not disabled, and compares
favorably on cost.

After straining my brain over various plan changes for a long time,
and hacking on the code somewhat, I realized that just propagating the
Boolean upward is insufficient to set things right. That's basically
because I was being dumb when I said this:

> I don't think we should care how MANY disabled nodes appear in a
> plan, particularly.

Suppose we try to plan a Nested Loop with SeqScan disabled, but
there's no alternative to a SeqScan for the outer side of the join. If
we suppose an upward-propagating Boolean, every path for the join is
disabled because every path for the outer side is disabled. That means
that we have no reason to avoid paths where the inner side also uses a
disabled path. When we loop over the inner rel's pathlist looking for
ways to build a path for the join, we may find some disabled paths
there, and the join paths we build from those paths are disabled, but
so are the join paths where we use a non-disabled path on the inner
side. So those paths are just competing with each other on cost, and
the path type that is supposedly disabled on the outer side of the
join ends up not really being very disabled at all. More precisely, if
disabling a plan type causes paths to be discarded completely before
the join paths are constructed, then they actually do get removed from
consideration. But if those paths make it into inner rel's path list,
even way out towards the end, then paths derived from them can jump to
the front of the joinrel's path list.

The same kind of problem happens with Append or MergeAppend nodes. The
regression tests expect that we'll avoid disabled plan types whenever
possible even if we can't avoid them completely; for instance, the
matest0 table intentionally omits an index on one child table.
Disabling sequential scans is expected to disable them for all of the
other child tables even though for that particular child table there
is no other option. But that behavior is hard to achieve if every path
for the parent rel is "equally disabled". You either want the path
that uses only the one required SeqScan to be not-disabled even though
one of its children is disabled ... or you want the disabled flag to
be more than a Boolean. And while there's probably more than one way
to make it work, the easiest thing seems to be to just have a
disabled-counter in every node that gets initialized to the total
disabled-counter values of all of its children, and then you add 1 if
that node is itself doing something that is disabled, i.e. the exact
opposite of what I said in the quote above.

I haven't done enough work to know whether that would match the
current behavior, let alone whether it would have acceptable
performance, and I'm not at all convinced that's the right direction
anyway. Actually, at the moment, I don't have a very good idea at all
what the right direction is. I do have a feeling that this patch is
probably not going in the right direction, but I might be wrong about
that, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Allow non-superuser to cancel superuser tasks.

2024-04-01 Thread Nathan Bossart

On Mon, Apr 01, 2024 at 02:29:29PM +, Leung, Anthony wrote:
> I've attached the updated patch with some improvements.

Thanks!

+  
+   pg_signal_autovacuum
+   Allow terminating backend running autovacuum
+  

I think we should be more precise here by calling out the exact types of
workers:

"Allow signaling autovacuum worker processes to..."

-if ((!OidIsValid(proc->roleId) || superuser_arg(proc->roleId)) &&
-!superuser())
-return SIGNAL_BACKEND_NOSUPERUSER;
-
-/* Users can signal backends they have role membership in. */
-if (!has_privs_of_role(GetUserId(), proc->roleId) &&
-!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_BACKEND))
-return SIGNAL_BACKEND_NOPERMISSION;
+if (!superuser())
+{
+if (!OidIsValid(proc->roleId))
+{
+/*
+ * We only allow user with pg_signal_autovacuum role to terminate
+ * autovacuum worker as an exception. 
+ */
+if (!(pg_stat_is_backend_autovac_worker(proc->backendId) &&
+has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM)))
+return SIGNAL_BACKEND_NOSUPERUSER;
+}
+else
+{
+if (superuser_arg(proc->roleId))
+return SIGNAL_BACKEND_NOSUPERUSER;
+
+/* Users can signal backends they have role membership in. */
+if (!has_privs_of_role(GetUserId(), proc->roleId) &&
+!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_BACKEND))
+return SIGNAL_BACKEND_NOPERMISSION;
+}
+}

I don't think we should rely on !OidIsValid(proc->roleId) for signaling
autovacuum workers.  That might not always be true, and I don't see any
need to rely on that otherwise.  IMHO we should just add a few lines before
the existing code, which doesn't need to be changed at all:

if (pg_stat_is_backend_autovac_worker(proc->backendId) &&
!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM))
return SIGNAL_BACKEND_NOAUTOVACUUM;

I also think we need to return something besides SIGNAL_BACKEND_NOSUPERUSER
in this case.  Specifically, we probably need to introduce a new value and
provide the relevant error messages in pg_cancel_backend() and
pg_terminate_backend().

+/* --
+ * pg_stat_is_backend_autovac_worker() -
+ *
+ * Return whether the backend of the given backend id is of type autovacuum 
worker.
+ */
+bool
+pg_stat_is_backend_autovac_worker(BackendId beid)
+{
+PgBackendStatus *ret;
+
+Assert(beid != InvalidBackendId);
+
+ret = pgstat_get_beentry_by_backend_id(beid);
+
+if (!ret)
+return false;
+
+return ret->st_backendType == B_AUTOVAC_WORKER;
+}

Can we have this function return the backend type so that we don't have to
create a new function for every possible type?  That might be handy in the
future.

I haven't looked too closely, but I'm pretty skeptical that the test suite
in your patch would be stable.  Unfortunately, I don't have any better
ideas at the moment besides not adding a test for this new role.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Streaming read-ready sequential scan code

2024-04-01 Thread Melanie Plageman

On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote:
> On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
>  wrote:
> >
> > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> > >  wrote:
> > > >
> > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > > >  wrote:
> > > > > > >
> > > > > > > There is an outstanding question about where to allocate the
> > > > > > > PgStreamingRead object for sequential scans
> > > > > >
> > > > > > I've written three alternative implementations of the actual 
> > > > > > streaming
> > > > > > read user for sequential scan which handle the question of where to
> > > > > > allocate the streaming read object and how to handle changing scan
> > > > > > direction in different ways.
> > > > > >
> > > > > > Option A) 
> > > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > > > know the scan direction at this time, if the scan ends up not being 
> > > > > > a
> > > > > > forwards scan, the streaming read object must later be freed -- so
> > > > > > this will sometimes allocate a streaming read object it never uses.
> > > > > > - Only supports ForwardScanDirection and once the scan direction
> > > > > > changes, streaming is never supported again -- even if we return to
> > > > > > ForwardScanDirection
> > > > > > - Must maintain a "fallback" codepath that does not use the 
> > > > > > streaming read API
> > > > >
> > > > > Attached is a version of this patch which implements a "reset"
> > > > > function for the streaming read API which should be cheaper than the
> > > > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > > > work on any of my proposed implementations (A/B/C). I implemented it
> > > > > on A as an example.
> > > >
> > > > Attached is the latest version of this patchset -- rebased in light of
> > > > Thomas' updatees to the streaming read API [1]. I have chosen the
> > > > approach I think we should go with. It is a hybrid of my previously
> > > > proposed approaches.
> > >
> > > While investigating some performance concerns, Andres pointed out that
> > > the members I add to HeapScanDescData in this patch push rs_cindex and
> > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> > > v4's HeapScanDescData is as well-packed as on master and its members
> > > are reordered so that rs_cindex and rs_ntuples are back on the second
> > > cacheline of the struct's data.
> >
> > I did some additional profiling and realized that dropping the
> > unlikely() from the places we check rs_inited frequently was negatively
> > impacting performance. v5 adds those back and also makes a few other
> > very minor cleanups.
> >
> > Note that this patch set has a not yet released version of Thomas
> > Munro's Streaming Read API with a new ramp-up logic which seems to fix a
> > performance issue I saw with my test case when all of the sequential
> > scan's blocks are in shared buffers. Once he sends the official new
> > version, I will rebase this and point to his explanation in that thread.
> 
> Attached v6 has the version of the streaming read API mentioned here
> [1]. This resolved the fully-in-shared-buffers regressions
> investigated in that thread by Andres, Bilal, and Thomas.

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I noticed that 0001 in the set posed a small regression from master for
a sequential scan of a relation already in shared buffers. While
investigating this, I saw that heapfetchbuf() was still not being
inlined (compiled at -O2) and when I promoted heapfetchbuf() from static
inline to static pg_attribute_always_inline, most of the very small
regression I saw went away. I don't know if I squashed the issue
entirely, though.

- Melanie
>From db6219d9ea689fdd0150aa0fbbeaa2ee11364aa8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v7 1/4] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
>
> I still think that we could just declare the function strict, if we
> use the variadic-any approach.  Passing a null in any position is
> indisputable caller error.  However, if you're allergic to silently
> doing nothing in such a case, we could have pg_set_attribute_stats
> check each argument and throw an error.  (Or warn and keep going;
> but according to the design principle I posited earlier, this'd be
> the sort of thing we don't need to tolerate.)
>

Any thoughts about going back to having a return value, a caller could then
see that the function returned NULL rather than whatever the expected value
was (example: TRUE)?

RE: Psql meta-command conninfo+

2024-04-01 Thread Maiquel Grassi

Amigos, boa tarde!

(v25)

>if (pset.version >= 14)
>  one thing;
>else if (pset.version > 90500)
>  second thing;
>   else
>  third thing;
>
>This also appears where you add the GSSAPI columns; and it's also in the
>final part where you append the FROM clause, though it looks a bit
>different there.
>
>- You have three lines to output a semicolon at the end of the query
>based on version number.  Remove the first two, and just have a final
>one where the semicolon is added unconditionally.

Adjustment made.

>  - I don't think one  for each item in the docs is reasonable.
>  There's too much vertical whitespace in the output.  Maybe do this
>  instead:
>
>[...]
>database connection. When + is appended,
>more details about the connection are displayed in table
>format:
>
>
> 
>  Database: The name of the current
>  database on this connection.
> 
>
> 
>   Authenticated User: The authenticated
>   user at the time of psql connection with the server.
> 
>
> ...
>

Adjustment made. But I think it needs a review glance.

>  - This text is wrong to start with "Returns the":
>
>  System User: Returns the authentication method and the identity (if
>  any) that the user presented during the authentication cycle before
>  they were assigned a database role. It is represented as
>  auth_method:identity or NULL if the user has not been authenticated.
>
> That minor point aside, I disagree with Sami about repeating the docs
> for system_user() here.  I would just say "The authentication data
> provided for this connection; see the function system_user() for more
> details." with a link to the appropriate section of the docs.  Making
> us edit this doc if we ever modify the behavior of the function is not
> great.

Here I considered your suggestion (Sami and Álvaro's). However, I haven't yet
added the links for the functions system_user(), current_user(), and 
session_user().
I'm not sure how to do it. Any suggestion on how to create/add the link?

Regards,
Maiquel Grassi.


v25-0001-psql-meta-command-conninfo-plus.patch
Description: v25-0001-psql-meta-command-conninfo-plus.patch

Re: Broken error detection in genbki.pl

2024-04-01 Thread Andrew Dunstan




On 2024-03-20 We 12:44, Tom Lane wrote:

David Wheeler complained over in [1] that genbki.pl fails to produce a
useful error message if there's anything wrong in a catalog data file.
He's right about that, but the band-aid patch he proposed doesn't
improve the situation much.  The actual problem is that key error
messages in genbki.pl expect $bki_values->{line_number} to be set,
which it is not because we're invoking Catalog::ParseData with
$preserve_formatting = 0, and that runs a fast path that doesn't do
line-by-line processing and hence doesn't/can't fill that field.

I'm quite sure that those error cases worked as-intended when first
coded.  I did not check the git history, but I suppose that somebody
added the non-preserve_formatting fast path later without any
consideration for the damage it was doing to error reporting ability.
I'm of the opinion that this change was penny-wise and pound-foolish.
On my machine, the time to re-make the bki files with the fast path
enabled is 0.597s, and with it disabled (measured with the attached
quick-hack patch) it's 0.612s.  Is that savings worth future hackers
having to guess what they broke and where in a large .dat file?

As you can see from the quick-hack patch, it's kind of a mess to use
the $preserve_formatting = 1 case, because there are a lot of loops
that have to be taught to ignore comment lines, which we don't really
care about except in reformat_dat_file.pl.  What I suggest doing, but
have not yet coded up, is to nuke the fast path in Catalog::ParseData
and reinterpret the $preserve_formatting argument as controlling
whether comments and whitespace are entered in the output data
structure, but not whether we parse it line-by-line.  That should fix
this problem with zero change in the callers, and also buy back a
little bit of the cost compared to this quick hack.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/60EF4E11-BC1C-4034-B37B-695662D28AD2%40justatheory.com



Makes sense


cheers


andrew

--

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-04-01 Thread Nathan Bossart

On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> The one design point that worries me a little is the non-configurability of
>> --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
>> or something, but given how often I've had to fiddle with
>> max_locks_per_transaction, I'm wondering if we might regret hard-coding it.
> 
> Well, we could add a command-line switch to pg_upgrade, but I'm
> unconvinced that it'd be worth the trouble.  I think a very large
> fraction of users invoke pg_upgrade by means of packager-supplied
> scripts that are unlikely to provide a way to pass through such
> a switch.  I'm inclined to say let's leave it as-is until we get
> some actual field requests for a switch.

Okay.  I'll let you know if I see anything.  IIRC usually the pg_dump side
of pg_upgrade is more prone to lock exhaustion, so you may very well be
right that this is unnecessary.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-04-01 Thread Tom Lane

Nathan Bossart  writes:
> Sorry for taking so long to get back to this one.  Overall, I think the
> code is in decent shape.

Thanks for looking at it!

> The one design point that worries me a little is the non-configurability of
> --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
> or something, but given how often I've had to fiddle with
> max_locks_per_transaction, I'm wondering if we might regret hard-coding it.

Well, we could add a command-line switch to pg_upgrade, but I'm
unconvinced that it'd be worth the trouble.  I think a very large
fraction of users invoke pg_upgrade by means of packager-supplied
scripts that are unlikely to provide a way to pass through such
a switch.  I'm inclined to say let's leave it as-is until we get
some actual field requests for a switch.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
> So what's the behavior when the user fails to supply a parameter that is
> currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

regards, tom lane

RE: Psql meta-command conninfo+

2024-04-01 Thread Maiquel Grassi

Database:

The database name of the connection.



Authenticated User:

The authenticated database user of the connection.



Socket Directory:

The Unix socket directory of the connection. NULL if host or hostaddr are 
specified.



Host:

The host name or address of the connection. NULL if a Unix socket is used.



Server Port:

The IP address of the connection. NULL if a Unix socket is used.



Server Address:

The IP address of the host name. NULL if a Unix socket is used.



Client Address:

The IP address of the client connected to this backend. NULL if a Unix socket 
is used.



Client Port:

The port of the client connected to this backend. NULL if a Unix socket is used.



Backend PID:

The process id of the backend for the connection.



Application name:

psql is the default for a psql connection

unless otherwise specified in the 

configuration parameter.



-//-

Hi Sami,
I considered your suggestions and Álvaro's suggestions
Regards,
Maiquel Grassi.

Re: pg_upgrade failing for 200+ million Large Objects

2024-04-01 Thread Nathan Bossart

On Wed, Mar 27, 2024 at 10:08:26AM -0500, Nathan Bossart wrote:
> On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote:
>> Michael Banck  writes:
>>> What is the status of this? In the commitfest, this patch is marked as
>>> "Needs Review" with Nathan as reviewer - Nathan, were you going to take
>>> another look at this or was your mail from January 12th a full review?
>> 
>> In my mind the ball is in Nathan's court.  I feel it's about
>> committable, but he might not agree.
> 
> I'll prioritize another round of review on this one.  FWIW I don't remember
> having any major concerns on a previous version of the patch set I looked
> at.

Sorry for taking so long to get back to this one.  Overall, I think the
code is in decent shape.  Nothing stands out after a couple of passes.  The
small amount of runtime improvement cited upthread is indeed a bit
disappointing, but IIUC this at least sets the stage for additional
parallelism in the future, and the memory/disk usage improvements are
nothing to sneeze at, either.

The one design point that worries me a little is the non-configurability of
--transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
or something, but given how often I've had to fiddle with
max_locks_per_transaction, I'm wondering if we might regret hard-coding it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Security lessons from liblzma

2024-04-01 Thread Tom Lane

Bruce Momjian  writes:
> I was more asking if users have access to patches so they could recreate
> the build by using the Postgres git tree and supplied OS-specific
> patches.

AFAIK, every open-source distro makes all the pieces needed to
rebuild their packages available to users.  It wouldn't be much
of an open-source situation otherwise.  You do have to learn
their package build process.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> Ah, yeah, you could change the function to have more parameters,
> given the assumption that all calls will be named-parameter style.
> I still suggest that my proposal is more robust for the case where
> the dump lists parameters that the receiving system doesn't have.
>

So what's the behavior when the user fails to supply a parameter that is
currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> I think pg_upgrade could check for the existence of extended statistics
> in any database and adjust the analyze recommdnation wording
> accordingly.
>

+1

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Melanie Plageman

On Mon, Apr 1, 2024 at 1:37 PM Heikki Linnakangas  wrote:
>
> Committed the first of the remaining patches with those changes. And
> also this, which is worth calling out:
>
> >   if (prstate.htsv[offnum] == HEAPTUPLE_DEAD)
> >   {
> >   ItemId  itemid = PageGetItemId(page, offnum);
> >   HeapTupleHeader htup = (HeapTupleHeader) 
> > PageGetItem(page, itemid);
> >
> >   if (likely(!HeapTupleHeaderIsHotUpdated(htup)))
> >   {
> >   HeapTupleHeaderAdvanceConflictHorizon(htup,
> > 
> > _xid_removed);
> >   heap_prune_record_unused(, offnum, 
> > true);
> >   }
> >   else
> >   {
> >   /*
> >* This tuple should've been processed and 
> > removed as part of
> >* a HOT chain, so something's wrong.  To 
> > preserve evidence,
> >* we don't dare to remove it.  We cannot 
> > leave behind a DEAD
> >* tuple either, because that will cause 
> > VACUUM to error out.
> >* Throwing an error with a distinct error 
> > message seems like
> >* the least bad option.
> >*/
> >   elog(ERROR, "dead heap-only tuple (%u, %d) is 
> > not linked to from any HOT chain",
> >blockno, offnum);
> >   }
> >   }
> >   else
> >   heap_prune_record_unchanged_lp_normal(page, , 
> > offnum);
>
> As you can see, I turned that into a hard error. Previously, that code
> was at the top of heap_prune_chain(), and it was normal to see DEAD
> heap-only tuples there, because they would presumably get processed
> later as part of a HOT chain. But now this is done after all the HOT
> chains have already been processed.
>
> Previously if there was a dead heap-only tuple like that on the page for
> some reason, it was silently not processed by heap_prune_chain()
> (because it was assumed that it would be processed later as part of a
> HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the
> pruning was done as part of VACUUM, VACUUM would fail with "ERROR:
> unexpected HeapTupleSatisfiesVacuum result". Or am I missing something?

I think you are right. I wasn't sure if there was some way for a HOT,
DEAD tuple to be not HOT-updated, but that doesn't make much sense.

> Now you get that above error also on on-access pruning, which is not
> ideal. But I don't remember hearing about corruption like that, and
> you'd get the error on VACUUM anyway.

Yea, that makes sense. One thing I don't really understand is why
vacuum has its own system for saving and restoring error information
for context messages (LVSavedErrInfo and
update/restore_vacuum_err_info()). I'll confess I don't know much
about how error cleanup works in any sub-system. But it stuck out to
me that vacuum has its own. I assume it is okay to add new error
messages and they somehow will work with the existing system?

- Melanie

Re: Statistics Import and Export

2024-04-01 Thread Corey Huinker

>
> That's not what I suggested at all.  The function parameters would
> be declared anyarray, but the values passed to them would be coerced
> to the correct concrete array types.  So as far as the coercion rules
> are concerned this'd be equivalent to the variadic-any approach.
>

+1



>
> > That's pretty persuasive. It also means that we need to trap for error in
> > the array_in() calls, as that function does not yet have a _safe() mode.
>
> Well, the approach I'm advocating for would have the array input and
> coercion done by the calling query before control ever reaches
> pg_set_attribute_stats, so that any incorrect-for-the-data-type values
> would result in hard errors.  I think that's okay for the same reason
> you probably figured you didn't have to trap array_in: it's the fault
> of the originating pg_dump if it offers a value that doesn't coerce to
> the datatype it claims the value is of.


+1

Re: [plpython] Add missing volatile qualifier.

2024-04-01 Thread Tom Lane

Nathan Bossart  writes:
> On Mon, Apr 01, 2024 at 11:57:07AM -0400, Tom Lane wrote:
>> Good catch!  It looks like the consequences of a failure would be
>> pretty minimal --- AFAICS, no worse than a possible failure to remove
>> a refcount on Py_None --- but that's still a bug.

> Huh.  I seem to have dropped that "volatile" shortly before committing for
> some reason [0].

Oh, I'd forgotten that discussion.  Given that we were both confused
about the need for it, all the more reason to try to avoid using a
within-PG_TRY assignment.

> Your fix seems fine to me.

Thanks for looking, I'll push it shortly.

regards, tom lane

Re: [plpython] Add missing volatile qualifier.

2024-04-01 Thread Nathan Bossart

On Mon, Apr 01, 2024 at 11:57:07AM -0400, Tom Lane wrote:
> Xing Guo  writes:
>> I'm playing a toy static analysis checker with PostgreSQL and found a
>> variable is missing volatile qualifier.
> 
> Good catch!  It looks like the consequences of a failure would be
> pretty minimal --- AFAICS, no worse than a possible failure to remove
> a refcount on Py_None --- but that's still a bug.

Huh.  I seem to have dropped that "volatile" shortly before committing for
some reason [0].

> I don't care for your proposed fix though.  I think the real problem
> here is schizophrenia about when to set up pltargs, and we could
> fix it more nicely as attached.  (Perhaps the Asserts are overkill
> though.)

Your fix seems fine to me.

[0] https://postgr.es/m/20230504234235.GA2419591%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Jeff Davis  writes:
> On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
>> What happens when
>> somebody adds a new stakind (and hence new pg_stats column)?

> Why would you need to overload in this case? Wouldn't we just define a
> new function with more optional named parameters?

Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

regards, tom lane

Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)

2024-04-01 Thread Ranier Vilela

Em seg., 1 de abr. de 2024 às 14:52, Tom Lane  escreveu:

> "Euler Taveira"  writes:
> > On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote:
> >> Coverity has some reports in the new code
> >> pg_createsubscriber.c
> >> I think that Coverity is right.
>
> > It depends on your "right" definition. If your program execution is
> ephemeral
> > and the leak is just before exiting, do you think it's worth it?
>
> I agree with Ranier, actually.  The functions in question don't
> exit() themselves, but return control to a caller that might or
> might not choose to exit.  I don't think they should be assuming
> that an exit will happen promptly, even if that's true today.
>
> The community Coverity run found a few additional leaks of the same
> kind in that file.  I pushed a merged fix for all of them.
>
Thanks Tom, for the commit.

best regards,
Ranier Vilela

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Jeff Davis  writes:
> On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
>> Reality check --- are we still targeting this feature for PG 17?

> I see a few useful pieces here:

> 1. Support import of statistics (i.e.
> pg_set_{relation|attribute}_stats()).

> 2. Support pg_dump of stats

> 3. Support pg_upgrade with stats

> It's possible that not all of them make it, but let's not dismiss the
> entire feature yet.

The unresolved questions largely have to do with the interactions
between these pieces.  I think we would seriously regret setting
any one of them in stone before all three are ready to go.

regards, tom lane

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-01 Thread Daniel Verite

Laurenz Albe wrote:

> Here is the code review for patch number 2:

> +static void
> +CloseGOutput(FILE *gfile_fout, bool is_pipe)
> 
> It makes sense to factor out this code.
> But shouldn't these functions have a prototype at the beginning of the file?

Looking at the other static functions in psql/common.c,  there
are 22 of them but only 3 have prototypes at the top of the file.
These 3 functions are called before being defined, so these prototypes
are mandatory.
The other static functions that are defined before being called happen
not to have forward declarations, so SetupGOutput() and CloseGOutput()
follow that model.

> Here is a suggestion for a consolidated comment:
> 
> Fetch the result in chunks if FETCH_COUNT is set.  We don't enable chunking
> if SHOW_ALL_RESULTS is false, since that requires us to accumulate all rows
>   before we can tell what should be displayed, which would counter the idea
>   of FETCH_COUNT.  Chunk fetching is also disabled if \gset, \crosstab,
>   \gexec and \watch are used.

OK, done like that.

> > +   if (fetch_count > 0 && result_status == PGRES_TUPLES_CHUNK)
> 
> Could it be that result_status == PGRES_TUPLES_CHUNK, but fetch_count is 0?
> if not, perhaps there should be an Assert that verifies that, and the "if"
> statement should only check for the latter condition.

Good point. In fact it can be simplified to
 if (result_status == PGRES_TUPLES_CHUNK),
and fetch_count as a variable can be removed from the function.
Done that way.


> > --- a/src/bin/psql/t/001_basic.pl
> > +++ b/src/bin/psql/t/001_basic.pl
> > @@ -184,10 +184,10 @@ like(
> > "\\set FETCH_COUNT 1\nSELECT error;\n\\errverbose",
> > on_error_stop => 0))[2],
> > qr/\A^psql::2: ERROR:  .*$
> > -^LINE 2: SELECT error;$
> > +^LINE 1: SELECT error;$

> >  ^ *^.*$
> >  ^psql::3: error: ERROR:  [0-9A-Z]{5}: .*$
> > -^LINE 2: SELECT error;$
> > +^LINE 1: SELECT error;$
> 
> Why does the output change?  Perhaps there is a good and harmless
> explanation, but the naïve expectation would be that it doesn't.

Unpatched, psql builds this query:
DECLARE _psql_cursor NO SCROLL CURSOR FOR  \n
   
therefore the user query starts at line 2.

With the patch, the user query is sent as-is, starting at line1,
hence the different error location.


> After fixing the problem manually, it builds without warning.
> The regression tests pass, and the feature works as expected.

Thanks for testing.
Updated patches are attached.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 0fefaee7d5b3003ad0d089ea9e92675c6f50245f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 1 Apr 2024 19:46:20 +0200
Subject: [PATCH v7 1/2] Implement retrieval of results in chunks with libpq.

This mode is similar to the single-row mode except that chunks
of results contain up to N rows instead of a single row.
It is meant to reduce the overhead of the row-by-row allocations
for large result sets.
The mode is selected with PQsetChunkedRowsMode(int maxRows) and results
have the new status code PGRES_TUPLES_CHUNK.
---
 doc/src/sgml/libpq.sgml   |  98 +++
 .../libpqwalreceiver/libpqwalreceiver.c   |   1 +
 src/bin/pg_amcheck/pg_amcheck.c   |   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-exec.c| 117 +++---
 src/interfaces/libpq/libpq-fe.h   |   4 +-
 src/interfaces/libpq/libpq-int.h  |   7 +-
 7 files changed, 185 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d3e87056f2..1814921d5a 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3545,7 +3545,20 @@ ExecStatusType PQresultStatus(const PGresult *res);
 The PGresult contains a single result 
tuple
 from the current command.  This status occurs only when
 single-row mode has been selected for the query
-(see ).
+(see ).
+   
+  
+ 
+
+ 
+  PGRES_TUPLES_CHUNK
+  
+   
+The PGresult contains several tuples
+from the current command. The count of tuples cannot exceed
+the maximum passed to .
+This status occurs only when the chunked mode has been selected
+for the query (see ).

   
  
@@ -5197,8 +5210,8 @@ PGresult *PQgetResult(PGconn *conn);
   
Another frequently-desired feature that can be obtained with
 and 
-   is retrieving large query results a row at a time.  This is discussed
-   in .
+   is retrieving large query results a limited number of rows at a time.  This 
is discussed
+   in .
   
 
   
@@ -5562,12 +5575,13 @@ int PQflush(PGconn *conn);
 
 
 
- To enter single-row mode, call PQsetSingleRowMode
- before

Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)

2024-04-01 Thread Tom Lane

"Euler Taveira"  writes:
> On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote:
>> Coverity has some reports in the new code
>> pg_createsubscriber.c
>> I think that Coverity is right.

> It depends on your "right" definition. If your program execution is ephemeral
> and the leak is just before exiting, do you think it's worth it?

I agree with Ranier, actually.  The functions in question don't
exit() themselves, but return control to a caller that might or
might not choose to exit.  I don't think they should be assuming
that an exit will happen promptly, even if that's true today.

The community Coverity run found a few additional leaks of the same
kind in that file.  I pushed a merged fix for all of them.

regards, tom lane

Re: Psql meta-command conninfo+

2024-04-01 Thread Imseih (AWS), Sami

> That minor point aside, I disagree with Sami about repeating the docs
> for system_user() here. I would just say "The authentication data
> provided for this connection; see the function system_user() for more
> details." 

+1. FWIW; Up the thread [1], I did mention we should link to the functions page,
but did not have a very strong opinion on that. 

I do like your suggestion and the same should be done for "Current User" 
and "Session User" as well. 

[1] 
https://www.postgresql.org/message-id/640B2586-EECF-44C0-B474-CA8510F8EAFC%40amazon.com

Regards,

Sami

Re: Statistics Import and Export

2024-04-01 Thread Jeff Davis

On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
> What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.

Why would you need to overload in this case? Wouldn't we just define a
new function with more optional named parameters?

>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.

I agree on this point; I found this annoying when testing the feature.

> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:

I didn't consider this and I think it has a lot of advantages. It's
slightly unfortunate that we can't make them explicitly name/value
pairs, but pg_dump can use whitespace or even SQL comments to make it
more readable.

Regards,
Jeff Davis

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Heikki Linnakangas


On 01/04/2024 19:08, Melanie Plageman wrote:

On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:

diff --git a/src/backend/access/heap/pruneheap.c 
b/src/backend/access/heap/pruneheap.c
@@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer,
tup.t_tableOid = RelationGetRelid(relation);
  
  	/*

-* Determine HTSV for all tuples.
+* Determine HTSV for all tuples, and queue them up for processing as 
HOT
+* chain roots or as a heap-only items.


Reading this comment now as a whole, I would add something like
"Determining HTSV for all tuples once is required for correctness" to
the start of the second paragraph. The new conjunction on the first
paragraph sentence followed by the next paragraph is a bit confusing
because it sounds like queuing them up for processing is required for
correctness (which, I suppose it is on some level). Basically, I'm just
saying that it is now less clear what is required for correctness.


Fixed.


While I recognize this is a matter of style and not important, I
personally prefer this for reverse looping:

for (int i = prstate.nheaponly_items; i --> 0;)


I don't think we use that style anywhere in the Postgres source tree 
currently. (And I don't like it ;-) )



I do think a comment about the reverse order would be nice. I know it
says something above the first loop to this effect:

* Processing the items in reverse order (and thus the tuples in
* increasing order) increases prefetching efficiency significantly /
* decreases the number of cache misses.

So perhaps we could just say "as above, process the items in reverse
order"


I'm actually not sure why it makes a difference. I would assume all the 
data to already be in CPU cache at this point, since the first loop 
already accessed it, so I think there's something else going on. But I 
didn't investigate it deeper. Anyway, added a comment.


Committed the first of the remaining patches with those changes. And 
also this, which is worth calling out:



if (prstate.htsv[offnum] == HEAPTUPLE_DEAD)
{
ItemId  itemid = PageGetItemId(page, offnum);
HeapTupleHeader htup = (HeapTupleHeader) 
PageGetItem(page, itemid);

if (likely(!HeapTupleHeaderIsHotUpdated(htup)))
{
HeapTupleHeaderAdvanceConflictHorizon(htup,

  _xid_removed);
heap_prune_record_unused(, offnum, 
true);
}
else
{
/*
 * This tuple should've been processed and 
removed as part of
 * a HOT chain, so something's wrong.  To 
preserve evidence,
 * we don't dare to remove it.  We cannot leave 
behind a DEAD
 * tuple either, because that will cause VACUUM 
to error out.
 * Throwing an error with a distinct error 
message seems like
 * the least bad option.
 */
elog(ERROR, "dead heap-only tuple (%u, %d) is not 
linked to from any HOT chain",
 blockno, offnum);
}
}
else
heap_prune_record_unchanged_lp_normal(page, , 
offnum);


As you can see, I turned that into a hard error. Previously, that code 
was at the top of heap_prune_chain(), and it was normal to see DEAD 
heap-only tuples there, because they would presumably get processed 
later as part of a HOT chain. But now this is done after all the HOT 
chains have already been processed.


Previously if there was a dead heap-only tuple like that on the page for 
some reason, it was silently not processed by heap_prune_chain() 
(because it was assumed that it would be processed later as part of a 
HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the 
pruning was done as part of VACUUM, VACUUM would fail with "ERROR: 
unexpected HeapTupleSatisfiesVacuum result". Or am I missing something?


Now you get that above error also on on-access pruning, which is not 
ideal. But I don't remember hearing about corruption like that, and 
you'd get the error on VACUUM anyway.


With the next patches, heap_prune_record_unchanged() will do more, and 
will also throw an error on a HEAPTUPLE_LIVE tuple, so even though in 
the first patch we could print just a WARNING and move on, it gets more 
awkward with the rest of the patches.


(Continuing with the remaining patches..)

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Statistics Import and Export

2024-04-01 Thread Bruce Momjian

On Sun, Mar 31, 2024 at 07:04:47PM -0400, Tom Lane wrote:
> Corey Huinker  writes:
> >> I can't quibble with that view of what has priority.  I'm just
> >> suggesting that redesigning what pg_upgrade does in this area
> >> should come later than doing something about extended stats.
> 
> > I mostly agree, with the caveat that pg_upgrade's existing message saying
> > that optimizer stats were not carried over wouldn't be 100% true anymore.
> 
> I think we can tweak the message wording.  I just don't want to be
> doing major redesign of the behavior, nor adding fundamentally new
> monitoring capabilities.

I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-04-01 Thread Jeff Davis

On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
> Reality check --- are we still targeting this feature for PG 17?

I see a few useful pieces here:

1. Support import of statistics (i.e.
pg_set_{relation|attribute}_stats()).

2. Support pg_dump of stats

3. Support pg_upgrade with stats

It's possible that not all of them make it, but let's not dismiss the
entire feature yet.

Regards,
Jeff Davis

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Bruce Momjian  writes:
> Reality check --- are we still targeting this feature for PG 17?

I'm not sure.  I think if we put our heads down we could finish
the changes I'm suggesting and resolve the other issues this week.
However, it is starting to feel like the sort of large, barely-ready
patch that we often regret cramming in at the last minute.  Maybe
we should agree that the first v18 CF would be a better time to
commit it.

regards, tom lane

Re: Security lessons from liblzma

2024-04-01 Thread Bruce Momjian

On Sun, Mar 31, 2024 at 02:12:57PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-03-31 12:18:29 +0200, Michael Banck wrote:
> > If you ask where they are maintained, the answer is here:
> >
> > https://salsa.debian.org/postgresql/postgresql/-/tree/17/debian/patches?ref_type=heads
> >
> > the other major versions have their own branch.
> 
> Luckily these are all quite small, leaving little space to hide stuff.  I'd
> still like to get rid of at least some of them.
> 
> I've previously proposed a patch to make pkglibdir configurable, I think we
> should just go for that.
> 
> For the various defines, ISTM we should just make them #ifndef guarded, then
> they could be overridden by defining them at configure time. Some of them,
> like DEFAULT_PGSOCKET_DIR seem to be overridden by just about every
> distro. And others would be nice to easily override anyway, I e.g. dislike the
> default DEFAULT_PAGER value.

I realize we can move some changes into our code, but packagers are
still going to need a way to do immediate adjustments to match their OS
in time frames that don't match the Postgres release schedule.

I was more asking if users have access to patches so they could recreate
the build by using the Postgres git tree and supplied OS-specific
patches.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Melanie Plageman

On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
> On 30/03/2024 07:57, Melanie Plageman wrote:
> 
> > The final state of the code could definitely use more cleanup. I've been
> > staring at it for awhile, so I could use some thoughts/ideas about what
> > part to focus on improving.
> 
> Committed some of the changes. I plan to commit at least the first of these
> remaining patches later today. I'm happy with it now, but I'll give it a
> final glance over after dinner.
> 
> I'll continue to review the rest after that, but attached is what I have
> now.

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.

> From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 29 Mar 2024 19:43:09 -0400
> Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions
> 
> We will eventually take additional actions in heap_page_prune() at the
> discretion of the caller. For now, introduce these PRUNE_DO_* macros and
> turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_

paramter -> parameter

> action.
> ---
>  src/backend/access/heap/pruneheap.c  | 51 ++--
>  src/backend/access/heap/vacuumlazy.c | 11 --
>  src/include/access/heapam.h  | 13 ++-
>  3 files changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index fb0ad834f1b..30965c3c5a1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -29,10 +29,11 @@
>  /* Working data for heap_page_prune and subroutines */
>  typedef struct
>  {
> + /* PRUNE_DO_* arguments */
> + uint8   actions;

I wasn't sure if actions is a good name. What do you think?

> +
>   /* tuple visibility test, initialized for the relation */
>   GlobalVisState *vistest;
> - /* whether or not dead items can be set LP_UNUSED during pruning */
> - boolmark_unused_now;
>  
>   TransactionId new_prune_xid;/* new prune hint value for page */
>   TransactionId snapshotConflictHorizon;  /* latest xid removed */

> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 32a3fbce961..35b8486c34a 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -191,6 +191,17 @@ typedef struct HeapPageFreeze
>  
>  } HeapPageFreeze;
>  
> +/*
> + * Actions that can be taken during pruning and freezing. By default, we will
> + * at least attempt regular pruning.
> + */
> +
> +/*
> + * PRUNE_DO_MARK_UNUSED_NOW indicates whether or not dead items can be set
> + * LP_UNUSED during pruning.
> + */
> +#define  PRUNE_DO_MARK_UNUSED_NOW (1 << 1)


No reason for me to waste the zeroth bit here. I just realized that I
did this with XLHP_IS_CATALOG_REL too.

#define XLHP_IS_CATALOG_REL (1 << 1)

> From 083690b946e19ab5e536a9f2689772e7b91d2a70 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 29 Mar 2024 21:22:14 -0400
> Subject: [PATCH v11 4/7] Prepare freeze tuples in heap_page_prune()
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 35b8486c34a..ac129692c13 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
>  
> +/*
> + * Prepare to freeze if advantageous or required and try to advance
> + * relfrozenxid and relminmxid. To attempt freezing, we will need to 
> determine
> + * if the page is all frozen. So, if this action is set, we will also inform
> + * the caller if the page is all-visible and/or all-frozen and calculate a

I guess we don't inform the caller if the page is all-visible, so this
is not quite right.

> + * snapshot conflict horizon for updating the visibility map.
> + */
> +#define  PRUNE_DO_TRY_FREEZE (1 << 2)

> From ef8cb2c089ad9474a6da309593029c08f71b0bb9 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 29 Mar 2024 21:36:37 -0400
> Subject: [PATCH v11 5/7] Set hastup in heap_page_prune
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index 8bdd6389b25..65b0ed185ff 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -66,6 +66,9 @@ typedef struct
>   boolprocessed[MaxHeapTuplesPerPage + 1];
>  
>   int ndeleted;   /* Number of tuples 
> deleted from the page */
> +
> + /* Whether or not the page makes rel truncation unsafe */
> + boolhastup;
>  } PruneState;
>  
>  /* Local functions */
> @@ -271,6 +274,7 @@ heap_page_prune(Relation relation, Buffer buffer,
>   prstate.ndeleted = 0;
>   prstate.nroot_items = 0;
>   prstate.nheaponly_items = 0;
> + prstate.hastup = false;
>  
>   /*
>* If we will prepare to

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Jeff Davis  writes:
> On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
>> I haven't looked at the details, but I'm really a bit surprised
>> by Jeff's assertion that CREATE INDEX destroys statistics on the
>> base table.  That seems wrong from here, and maybe something we
>> could have it not do.  (I do realize that it recalculates reltuples
>> and relpages, but so what?  If it updates those, the results should
>> be perfectly accurate.)

> In the v15 of the patch I was looking at, "pg_dump -s" included the
> statistics. The stats appeared first in the dump, followed by the
> CREATE INDEX commands. The latter overwrote the relpages/reltuples set
> by the former.

> While zeros are the right answers for a schema-only dump, it defeated
> the purpose of including relpages/reltuples stats in the dump, and
> caused the pg_upgrade TAP test to fail.

> You're right that there are a number of ways this could be resolved --
> I don't think it's an inherent problem.

I'm inclined to call it not a problem at all.  While I do agree there
are use-cases for injecting false statistics with these functions,
I do not think that pg_dump has to cater to such use-cases.

In any case, I remain of the opinion that stats are data and should
not be included in a -s dump (with some sort of exception for
pg_upgrade).  If the data has been loaded, then a subsequent
overwrite by CREATE INDEX should not be a problem.

regards, tom lane

Re: Statistics Import and Export

2024-04-01 Thread Bruce Momjian

Reality check --- are we still targeting this feature for PG 17?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-04-01 Thread Jeff Davis

On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
> I haven't looked at the details, but I'm really a bit surprised
> by Jeff's assertion that CREATE INDEX destroys statistics on the
> base table.  That seems wrong from here, and maybe something we
> could have it not do.  (I do realize that it recalculates reltuples
> and relpages, but so what?  If it updates those, the results should
> be perfectly accurate.)

In the v15 of the patch I was looking at, "pg_dump -s" included the
statistics. The stats appeared first in the dump, followed by the
CREATE INDEX commands. The latter overwrote the relpages/reltuples set
by the former.

While zeros are the right answers for a schema-only dump, it defeated
the purpose of including relpages/reltuples stats in the dump, and
caused the pg_upgrade TAP test to fail.

You're right that there are a number of ways this could be resolved --
I don't think it's an inherent problem.

Regards,
Jeff Davis

Re: Table AM Interface Enhancements

2024-04-01 Thread Alexander Korotkov

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane  wrote:
>
> Coverity complained about what you did in RelationParseRelOptions
> in c95c25f9a:
>
> *** CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 
> 499 in RelationParseRelOptions()
> 493
> 494 /*
> 495  * Fetch reloptions from tuple; have to use a hardwired 
> descriptor because
> 496  * we might not have any other for pg_class yet (consider 
> executing this
> 497  * code for pg_class itself)
> 498  */
> >>> CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> >>> Passing null pointer "tableam" to "extractRelOptions", which 
> >>> dereferences it.
> 499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
> 500 
> tableam, amoptsfn);
> 501
>
> I see that extractRelOptions only uses the tableam argument for some
> relkinds, and RelationParseRelOptions does set it up for those
> relkinds --- but Coverity's complaint isn't without merit, because
> those two switch statements are looking at *different copies of the
> relkind*, which in theory could be different.  This all seems quite
> messy and poorly factored.  Can't we do better?  Why do we need to
> involve two copies of allegedly the same pg_class tuple, anyhow?

Thank you for reporting this, Tom.
I'm planning to investigate this later today.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-01 Thread Tom Lane

Coverity complained about what you did in RelationParseRelOptions
in c95c25f9a:

*** CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 499 
in RelationParseRelOptions()
493 
494 /*
495  * Fetch reloptions from tuple; have to use a hardwired 
descriptor because
496  * we might not have any other for pg_class yet (consider 
executing this
497  * code for pg_class itself)
498  */
>>> CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
>>> Passing null pointer "tableam" to "extractRelOptions", which 
>>> dereferences it.
499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
500 
tableam, amoptsfn);
501 

I see that extractRelOptions only uses the tableam argument for some
relkinds, and RelationParseRelOptions does set it up for those
relkinds --- but Coverity's complaint isn't without merit, because
those two switch statements are looking at *different copies of the
relkind*, which in theory could be different.  This all seems quite
messy and poorly factored.  Can't we do better?  Why do we need to
involve two copies of allegedly the same pg_class tuple, anyhow?

regards, tom lane

Re: Psql meta-command conninfo+

2024-04-01 Thread Alvaro Herrera

Hello

Yeah, that's what I meant about the translations, thanks.


Two more comments,

- You don't need a two-level conditional here
if (pset.sversion >= 90500)
{
if (pset.sversion < 14)
appendPQExpBuffer(,
  "  ssl.ssl AS 
\"%s\",\n"
  "  ssl.version AS 
\"%s\",\n"
  "  ssl.cipher AS 
\"%s\",\n"
  "  ssl.compression AS 
\"%s\",\n",
  _("SSL Connection"),
  _("SSL Protocol"),
  _("SSL Cipher"),
  _("SSL Compression"));
if (pset.sversion >= 14)
appendPQExpBuffer(,
  "  ssl.ssl AS 
\"%s\",\n"
  "  ssl.version AS 
\"%s\",\n"
  "  ssl.cipher AS 
\"%s\",\n"
  "  NULL AS \"%s\",\n",
  _("SSL Connection"),
  _("SSL Protocol"),
  _("SSL Cipher"),
  _("SSL Compression"));
}
else
appendPQExpBuffer(,
  "  NULL AS \"%s\",\n"
  "  NULL AS \"%s\",\n"
  "  NULL AS \"%s\",\n"
  "  NULL AS \"%s\",\n",
  _("SSL Connection"),
  _("SSL Protocol"),
  _("SSL Cipher"),
  _("SSL Compression"));

  Instead you can just do something like

  if (pset.version >= 14)
one thing;
  else if (pset.version > 90500)
second thing;
  else
third thing;

  This also appears where you add the GSSAPI columns; and it's also in the
  final part where you append the FROM clause, though it looks a bit
  different there.

- You have three lines to output a semicolon at the end of the query
  based on version number.  Remove the first two, and just have a final
  one where the semicolon is added unconditionally.

- I don't think one  for each item in the docs is reasonable.
  There's too much vertical whitespace in the output.  Maybe do this
  instead:

[...]
database connection. When + is appended,
more details about the connection are displayed in table
format:


 
  Database: The name of the current
  database on this connection.
 

 
  Authenticated User: The authenticated
  user at the time of psql connection with the server.
 
 
 ...



- This text is wrong to start with "Returns the":

System User: Returns the authentication method and the identity (if
any) that the user presented during the authentication cycle before
they were assigned a database role. It is represented as
auth_method:identity or NULL if the user has not been authenticated.

  That minor point aside, I disagree with Sami about repeating the docs
  for system_user() here.  I would just say "The authentication data
  provided for this connection; see the function system_user() for more
  details." with a link to the appropriate section of the docs.  Making
  us edit this doc if we ever modify the behavior of the function is not
  great.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We're here to devour each other alive"(Hobbes)

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-01 Thread Daniel Verite

Laurenz Albe wrote:

> I had a look at patch 0001 (0002 will follow).

Thanks for reviewing this!

I've implemented the suggested doc changes. A patch update
will follow with the next part of the review.

> > --- a/src/interfaces/libpq/fe-exec.c
> > +++ b/src/interfaces/libpq/fe-exec.c
> > @@ -41,7 +41,8 @@ char *const pgresStatus[] = {
> > "PGRES_COPY_BOTH",
> > "PGRES_SINGLE_TUPLE",
> > "PGRES_PIPELINE_SYNC",
> > -   "PGRES_PIPELINE_ABORTED"
> > +   "PGRES_PIPELINE_ABORTED",
> > +   "PGRES_TUPLES_CHUNK"
> >  };
> 
> I think that PGRES_SINGLE_TUPLE and PGRES_TUPLES_CHUNK should be next to
> each other, but that's no big thing.
> The same applies to the change in src/interfaces/libpq/libpq-fe.h

I assume we can't renumber/reorder existing values, otherwise it would be
an ABI break. We can only add new values.

> I understand that we need to keep the single-row mode for compatibility
> reasons.  But I think that under the hood, "single-row mode" should be the
> same as "chunk mode with chunk size one".

I've implemented it like that at first, and wasn't thrilled with the result.
libpq still has to return PGRES_SINGLE_TUPLE in single-row
mode and PGRES_TUPLES_CHUNK with chunks of size 1, so
the mutualization did not work that well in practice.
I also contemplated not creating PGRES_TUPLES_CHUNK
and instead using PGRES_SINGLE_TUPLE for N rows, but I found
it too ugly.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Melanie Plageman

On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
> On 30/03/2024 07:57, Melanie Plageman wrote:
> > On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote:
> > > On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas  
> > > wrote:
> > > > Here's another idea: In the first loop through the offsets, where we
> > > > gather the HTSV status of each item, also collect the offsets of all HOT
> > > > and non-HOT items to two separate arrays. Call heap_prune_chain() for
> > > > all the non-HOT items first, and then process any remaining HOT tuples
> > > > that haven't been marked yet.
> > > 
> > > That's an interesting idea. I'll try it out and see how it works.
> > 
> > Attached v10 implements this method of dividing tuples into HOT and
> > non-HOT and processing the potential HOT chains first then processing
> > tuples not marked by calling heap_prune_chain().
> > 
> > I have applied the refactoring of heap_prune_chain() to master and then
> > built the other patches on top of that.
> 
> Committed some of the changes. Continuing to reviewing the rest.

Thanks!

> > The early patches in the set include some additional comment cleanup as
> > well. 0001 is fairly polished. 0004 could use some variable renaming
> > (this patch partitions the tuples into HOT and not HOT and then
> > processes them). I was struggling with some of the names here
> > (chainmembers and chaincandidates is confusing).
> 
> I didn't understand why you wanted to juggle both partitions in the same
> array. So I separated them into two arrays, and called them 'root_items' and
> 'heaponly_items'.

I thought it was worth it to save the space. And the algorithm for doing
it seemed pretty straightforward. But looking at your patch, it is a lot
easier to understand with two arrays (since, for example, they can each
have a name).

> In some micro-benchmarks, the order that the items were processed made a
> measurable difference. So I'm processing the items in the reverse order.
> That roughly matches the order the items are processed in master, as it
> iterates the offsets from high-to-low in the first loop, and low-to-high in
> the second loop.

This makes sense. I noticed you didn't there isn't a comment about this
above the loop. It might be worth it to mention it.

Below is a review of only 0001. I'll look at the others shortly.

> From a6ab891779876e7cc1b4fb6fddb09f52f0094646 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 1 Apr 2024 16:59:38 +0300
> Subject: [PATCH v11 1/7] Handle non-chain tuples outside of heap_prune_chain()
> ---
>  src/backend/access/heap/pruneheap.c | 264 +---
>  1 file changed, 166 insertions(+), 98 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer,
>   tup.t_tableOid = RelationGetRelid(relation);
>  
>   /*
> -  * Determine HTSV for all tuples.
> +  * Determine HTSV for all tuples, and queue them up for processing as 
> HOT
> +  * chain roots or as a heap-only items.

Reading this comment now as a whole, I would add something like
"Determining HTSV for all tuples once is required for correctness" to
the start of the second paragraph. The new conjunction on the first
paragraph sentence followed by the next paragraph is a bit confusing
because it sounds like queuing them up for processing is required for
correctness (which, I suppose it is on some level). Basically, I'm just
saying that it is now less clear what is required for correctness.

>* This is required for correctness to deal with cases where running 
> HTSV
>* twice could result in different results (e.g. RECENTLY_DEAD can turn 
> to
>* DEAD if another checked item causes GlobalVisTestIsRemovableFullXid()
>* to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the
> -  * inserting transaction aborts, ...). That in turn could cause
> -  * heap_prune_chain() to behave incorrectly if a tuple is reached twice,
> -  * once directly via a heap_prune_chain() and once following a HOT 
> chain.
> +  * inserting transaction aborts, ...).  VACUUM assumes that there are no
> +  * normal DEAD tuples left on the page after pruning, so it needs to 
> have
> +  * the same understanding of what is DEAD and what is not.
>*
>* It's also good for performance. Most commonly tuples within a page 
> are
>* stored at decreasing offsets (while the items are stored at 
> increasing
> @@ -282,52 +297,140 @@ heap_page_prune(Relation relation, Buffer buffer,

> + /*
> +  * Process any heap-only tuples that were not already processed as part 
> of
> +  * a HOT chain.
> +  */

While I recognize this is a matter of style and not important, I
personally prefer this for reverse looping:

for (int i = prstate.nheaponly_items; i --> 0;)

I do think a comment about the

Re: psql not responding to SIGINT upon db reconnection

2024-04-01 Thread Tristan Partin


On Mon Mar 25, 2024 at 1:44 PM CDT, Robert Haas wrote:

On Fri, Mar 22, 2024 at 4:58 PM Tristan Partin  wrote:
> I had a question about parameter naming. Right now I have a mix of
> camel-case and snake-case in the function signature since that is what
> I inherited. Should I change that to be consistent? If so, which case
> would you like?

Uh... PostgreSQL is kind of the wild west in that regard. The thing to
do is look for nearby precedents, but that doesn't help much here
because in the very same file, libpq-fe.h, we have:

extern int  PQsetResultAttrs(PGresult *res, int numAttributes,
PGresAttDesc *attDescs);
extern int  PQsetvalue(PGresult *res, int tup_num, int field_num,
char *value, int len);

Since the existing naming is consistent with one of those two styles,
I'd probably just leave it be.

+   The function returns a value greater than 0
if the specified condition
+   is met, 0 if a timeout occurred, or
-1 if an error
+   or interrupt occurred. In the event forRead and

We either need to tell people how to find out which error it was, or
if that's not possible and we can't reasonably make it possible, we
need to tell them why they shouldn't care. Because there's nothing
more delightful than someone who shows up and says "hey, I tried to do
XYZ, and I got an error," as if that were sufficient information for
me to do something useful.

+   end_time is the time in the future in
seconds starting from the UNIX
+   epoch in which you would like the function to return if the
condition is not met.

This sentence seems a bit contorted to me, like maybe Yoda wrote it. I
was about to try to rephrase it and maybe split it in two when I
wondered why we need to document how time_t works at all. Can't we
just say something like "If end_time is not -1, it specifies the time
at which this function should stop waiting for the condition to be
met" -- and maybe move it to the end of the first paragraph, so it's
before where we list the meanings of the return values?


Incorporated feedback, I have :).

--
Tristan Partin
Neon (https://neon.tech)
From 14c794d9699f7b9f27d1a4ec026f748c6b7d8853 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 30 Jan 2024 15:40:57 -0600
Subject: [PATCH v11 1/2] Expose PQsocketPoll for frontends

PQsocketPoll should help with state machine processing when connecting
to a database asynchronously via PQconnectStart().
---
 doc/src/sgml/libpq.sgml  | 40 +++-
 src/interfaces/libpq/exports.txt |  1 +
 src/interfaces/libpq/fe-misc.c   |  7 +++---
 src/interfaces/libpq/libpq-fe.h  |  4 
 4 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d3e87056f2c..e69feacfe6a 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -262,6 +262,41 @@ PGconn *PQsetdb(char *pghost,
  
 
 
+
+ PQsocketPollPQsocketPoll
+ 
+  
+   nonblocking connection
+   Poll a connections underlying socket descriptor retrieved with .
+
+int PQsocketPoll(int sock, int forRead, int forWrite, time_t end_time);
+
+  
+
+  
+   This function sets up polling of a file descriptor. The underlying function is either
+   poll(2) or select(2), depending on platform
+   support. The primary use of this function is iterating through the connection sequence
+   described in the documentation of . If
+   forRead is specified, the function waits for the socket to be ready
+   for reading. If forWrite is specified, the function waits for the
+   socket to be ready for write. See POLLIN and POLLOUT
+   from poll(2), or readfds and
+   writefds from select(2) for more information. If
+   end_time is not -1, it specifies the time at which
+   this function should stop waiting for the condition to be met.
+  
+
+  
+   The function returns a value greater than 0 if the specified condition
+   is met, 0 if a timeout occurred, or -1 if an error
+   occurred. The error can be retrieved by checking the errno(3) value. In
+   the event forRead and forWrite are not set, the
+   function immediately returns a timeout condition.
+  
+ 
+
+
 
  PQconnectStartParamsPQconnectStartParams
  PQconnectStartPQconnectStart
@@ -358,7 +393,10 @@ PostgresPollingStatusType PQconnectPoll(PGconn *conn);
Loop thus: If PQconnectPoll(conn) last returned
PGRES_POLLING_READING, wait until the socket is ready to
read (as indicated by select(), poll(), or
-   similar system function).
+   similar system function).  Note that PQsocketPoll
+   can help reduce boilerplate by abstracting the setup of
+   select(2) or poll(2) if it is
+   available on your system.
Then call PQconnectPoll(conn) again.
Conversely, if PQconnectPoll(conn) last returned
PGRES_POLLING_WRITING, wait until the socket is ready
diff --git

Re: [plpython] Add missing volatile qualifier.

2024-04-01 Thread Tom Lane

Xing Guo  writes:
> I'm playing a toy static analysis checker with PostgreSQL and found a
> variable is missing volatile qualifier.

Good catch!  It looks like the consequences of a failure would be
pretty minimal --- AFAICS, no worse than a possible failure to remove
a refcount on Py_None --- but that's still a bug.

I don't care for your proposed fix though.  I think the real problem
here is schizophrenia about when to set up pltargs, and we could
fix it more nicely as attached.  (Perhaps the Asserts are overkill
though.)

regards, tom lane

diff --git a/src/pl/plpython/plpy_exec.c b/src/pl/plpython/plpy_exec.c
index e06fde1dd9..3145c69699 100644
--- a/src/pl/plpython/plpy_exec.c
+++ b/src/pl/plpython/plpy_exec.c
@@ -689,7 +689,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r
 			   *pltrelid,
 			   *plttablename,
 			   *plttableschema,
-			   *pltargs = NULL,
+			   *pltargs,
 			   *pytnew,
 			   *pytold,
 			   *pltdata;
@@ -713,6 +713,11 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r
 			return NULL;
 		}
 	}
+	else
+	{
+		Py_INCREF(Py_None);
+		pltargs = Py_None;
+	}
 
 	PG_TRY();
 	{
@@ -856,7 +861,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r
 			PyObject   *pltarg;
 
 			/* pltargs should have been allocated before the PG_TRY block. */
-			Assert(pltargs);
+			Assert(pltargs && pltargs != Py_None);
 
 			for (i = 0; i < tdata->tg_trigger->tgnargs; i++)
 			{
@@ -870,8 +875,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r
 		}
 		else
 		{
-			Py_INCREF(Py_None);
-			pltargs = Py_None;
+			Assert(pltargs == Py_None);
 		}
 		PyDict_SetItemString(pltdata, "args", pltargs);
 		Py_DECREF(pltargs);

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart

On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> On 2024-Mar-31, Nathan Bossart wrote:
>> +popcnt = _mm512_reduce_add_epi64(accum);
>> +return popcnt + pg_popcount_fast(buf, bytes);
> 
> Hmm, doesn't this arrangement cause an extra function call to
> pg_popcount_fast to be used here?  Given the level of micro-optimization
> being used by this code, I would have thought that you'd have tried to
> avoid that.  (At least, maybe avoid the call if bytes is 0, no?)

Yes, it does.  I did another benchmark on very small arrays and can see the
overhead.  This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:

size (bytes)  HEAD  AVX512-POPCNT
1 1707.685  3480.424
2 1926.694  4606.182
4 3210.412  5284.506
8 1920.703  3640.968
162936.91   4045.586
323627.956  5538.418
645347.213  3748.212

I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.

We could avoid the call if there are no remaining bytes, but the numbers
for the smallest arrays probably wouldn't improve much, and that might
actually add some overhead due to branching.  The other option to avoid
this overhead is to put most of pg_bitutils.c into its header file so that
we can inline the call.

Reviewing the current callers of pg_popcount(), IIUC the only ones that are
passing very small arrays are the bit_count() implementations and a call in
the syslogger for a single byte.  I don't know how much to worry about the
overhead for bit_count() since there's presumably a bunch of other
overhead, and the syslogger one could probably be fixed via an inline
function that pulled the value from pg_number_of_ones (which would probably
be an improvement over the status quo, anyway).  But this is all to save a
couple of nanoseconds...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

2024-04-01 Thread Tom Lane

Corey Huinker  writes:
>> IIRC, "variadic any" requires having at least one variadic parameter.
>> But that seems fine --- what would be the point, or even the
>> semantics, of calling pg_set_attribute_stats with no data fields?

> If my pg_dump run emitted a bunch of stats that could never be imported,
> I'd want to know. With silent failures, I don't.

What do you think would be silent about that?  If there's a complaint
to be made, it's that it'd be a hard failure ("no such function").

To be clear, I'm ok with emitting ERROR for something that pg_dump
clearly did wrong, which in this case would be emitting a
set_statistics call for an attribute it had exactly no stats values
for.  What I think needs to be WARN is conditions that the originating
pg_dump couldn't have foreseen, for example cross-version differences.
If we do try to check things like sort order, that complaint obviously
has to be WARN, since it's checking something potentially different
from what was correct at the source server.

>> Perhaps we could
>> invent a new backend function that extracts the actual element type
>> of a non-null anyarray argument.

> A backend function that we can't guarantee exists on the source system. :(

[ shrug... ] If this doesn't work for source servers below v17, that
would be a little sad, but it wouldn't be the end of the world.
I see your point that that is an argument for finding another way,
though.

>> Another way we could get to no-coercions is to stick with your
>> signature but declare the relevant parameters as anyarray instead of
>> text.

> I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

> # select '{1,2,3}'::anyarray;
> ERROR:  cannot accept a value of type anyarray

That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.  My formulation is a bit safer
though in that it's the originating pg_dump, not the receiving server,
that is in charge of saying which type that is.  (If that type doesn't
agree with what the receiving server thinks it should be, that's a
condition that pg_set_attribute_stats itself will detect, and then it
can WARN and move on to the next thing.)

regards, tom lane

RE: Psql meta-command conninfo+

2024-04-01 Thread Maiquel Grassi

Hi!

(v24)

> I meant those columns to be just examples, but this should be applied to
> all the other columns in the output as well.

Applied the translation to all columns.

Regards,
Maiquel Grassi.


v24-0001-psql-meta-command-conninfo-plus.patch
Description: v24-0001-psql-meta-command-conninfo-plus.patch

Re: Allow non-superuser to cancel superuser tasks.

2024-04-01 Thread Leung, Anthony

I took the liberty of continuing to work on this after chatting with Nathan.

I've attached the updated patch with some improvements.

Thanks.

--
Anthony Leung
Amazon Web Services: https://aws.amazon.com







v1-0001-Introduce-pg_signal_autovacuum-role-to-allow-non-sup.patch
Description: v1-0001-Introduce-pg_signal_autovacuum-role-to-allow-non-sup.patch

Re: Security lessons from liblzma

2024-04-01 Thread David E. Wheeler

On Apr 1, 2024, at 06:55, walt...@technowledgy.de wrote:

> Also a configurable directoy to look up extensions, possibly even to be 
> changed at run-time like [2]. The patch says this:
> 
>> This directory is prepended to paths when loading extensions (control and 
>> SQL files), and to the '$libdir' directive when loading modules that back 
>> functions. The location is made configurable to allow build-time testing of 
>> extensions that do not have been installed to their proper location yet.
> 
> This seems like a great thing to have. This might also be relevant in light 
> of recent discussions in the ecosystem around extension management.
> 
> All the path-related issues have in common, that while it's easy to move 
> files around to their proper locations later, they all need to adjust 
> pg_config's output.

Funny timing, I was planning to resurrect this old patch[1] and propose that 
patch this week. One of motivators is the increasing use of Docker images in 
Kubernetes to run Postgres, where there’s a desire to keep the core service and 
extensions immutable, and to have a second directory mounted to a persistent 
volume into which other extensions can be installed and preserved independently 
of the Docker image.

The current approach involves symlinking shenanigans[2] that complicate things 
pretty substantially, making it more difficult to administer. A second 
directory fit for purpose would be far better.

There are some other motivators, so I’ll do some additional diligence and start 
a separate thread (or reply to the original[3]).

Best,

David

[1] https://commitfest.postgresql.org/5/170/
[2] https://speakerdeck.com/ongres/postgres-extensions-in-kubernetes?slide=14
[3] 
https://www.postgresql.org/message-id/flat/51AE0845.8010600%40ocharles.org.uk

Re: Popcount optimization using AVX512

2024-04-01 Thread Alvaro Herrera

On 2024-Mar-31, Nathan Bossart wrote:

> +uint64
> +pg_popcount_avx512(const char *buf, int bytes)
> +{
> + uint64  popcnt;
> + __m512i accum = _mm512_setzero_si512();
> +
> + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
> + {
> + const   __m512i val = _mm512_loadu_si512((const __m512i 
> *) buf);
> + const   __m512i cnt = _mm512_popcnt_epi64(val);
> +
> + accum = _mm512_add_epi64(accum, cnt);
> + buf += sizeof(__m512i);
> + }
> +
> + popcnt = _mm512_reduce_add_epi64(accum);
> + return popcnt + pg_popcount_fast(buf, bytes);
> +}

Hmm, doesn't this arrangement cause an extra function call to
pg_popcount_fast to be used here?  Given the level of micro-optimization
being used by this code, I would have thought that you'd have tried to
avoid that.  (At least, maybe avoid the call if bytes is 0, no?)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Re: Psql meta-command conninfo+

2024-04-01 Thread Alvaro Herrera

On 2024-Mar-31, Maiquel Grassi wrote:

> Yes Álvaro, I have already appointed you as the patch reviewer.
> It's true, even before publishing it on Commifest, you have
> already provided good ideas and guidance.

Thanks.

> I adjusted the translation syntax for the SSL and GSSAPI columns.
> After your validation, that is, an okay confirmation that it's fine, I'll
> proceed with the others.

I meant those columns to be just examples, but this should be applied to
all the other columns in the output as well.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)

[plpython] Add missing volatile qualifier.

2024-04-01 Thread Xing Guo

Hi hackers,

I'm playing a toy static analysis checker with PostgreSQL and found a
variable is missing volatile qualifier.

Best Regards,
Xing
From 7084055da64d0433b09e50faff630e551c363f27 Mon Sep 17 00:00:00 2001
From: Xing Guo 
Date: Mon, 1 Apr 2024 21:39:04 +0800
Subject: [PATCH v1] Add missing volatile qualifier.

The object pltargs is modified in the PG_TRY block (line 874) and used
in the PG_CATCH block (line 881) which should be qualified with
volatile.
---
 src/pl/plpython/plpy_exec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pl/plpython/plpy_exec.c b/src/pl/plpython/plpy_exec.c
index e06fde1dd9..cd71082a5f 100644
--- a/src/pl/plpython/plpy_exec.c
+++ b/src/pl/plpython/plpy_exec.c
@@ -689,7 +689,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r
 			   *pltrelid,
 			   *plttablename,
 			   *plttableschema,
-			   *pltargs = NULL,
+			   *volatile pltargs = NULL,
 			   *pytnew,
 			   *pytold,
 			   *pltdata;
-- 
2.44.0

Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-01 Thread vignesh C

On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  wrote:
>
> Hi,
>
> Thank you for the patch!
>
> On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> >
> > Hi,
> >
> > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > completion of alter default privileges like the below statement:
> > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM dba1;
>
> +1
>
> >
> > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > public FOR " like in below statement:
> > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > ON TABLES TO PUBLIC;
>
> Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> really want to support both in tab-completion.

I have removed this change

> >
> > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > REVOKE " like in below statement:
> > alter default privileges revoke grant option for select ON tables FROM dba1;
>
> +1. But the v3 patch doesn't cover the following case:
>
> =# alter default privileges for role masahiko revoke [tab]
> ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
>  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE

Modified in the updated patch

> And it doesn't cover MAINTAIN neither:
>
> =# alter default privileges revoke [tab]
> ALL   DELETEGRANT OPTION FOR  REFERENCES
>  TRIGGER   UPDATE
> CREATEEXECUTE   INSERTSELECT
>  TRUNCATE  USAGE

Modified in the updated patch

> The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> but we handle such case in GRANT and REVOKE part:
>
> (around L3958)
> /*
>  * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
>  * privileges (can't grant roles)
>  */
> if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
>   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
>   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");

The current patch handles the fix from here now.

> Also, I think we can support WITH GRANT OPTION too. For example,
>
> =# alter default privileges for role masahiko grant all on tables to
> public [tab]

I have handled this in the updated patch

> It's already supported in the GRANT statement.
>
> >
> > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > column-name SET" like in:
> > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> >
>
> +1. The patch looks good to me, so pushed.

Thanks for committing this.

The updated patch has the changes for the above comments.

Regards,
Vignesh
From 7c3bbfb0c8c17bf5d4a7e2ce579d00b811844a06 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 2 Jul 2023 19:35:46 +0530
Subject: [PATCH v4] Fix missing tab completion in "ALTER DEFAULT PRIVILEGES"

GRANT, REVOKE and FOR USER keyword was not displayed in tab completion of "ALTER DEFAULT PRIVILEGES" like the below statements:
ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
ALTER DEFAULT PRIVILEGES FOR USER testdba REVOKE INSERT ON tables FROM testdba;

"GRANT OPTION OPTION" was not displayed in tab completion of "ALTER DEFAULT PRIVILEGES REVOKE " like in below statement:
ALTER DEFAULT PRIVILEGES REVOKE GRANT OPTION FOR SELECT ON tables FROM testdba;

"WITH GRANT OPTION" was not completed in tab completion of "ALTER DEFAULT PRIVILEGES ...  GRANT ... to role" like in below statement:
ALTER DEFAULT PRIVILEGES FOR ROLE testdba1 GRANT ALL ON tables to testdba2 WITH GRANT OPTION;
---
 src/bin/psql/tab-complete.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 82eb3955ab..620628948a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2147,7 +2147,7 @@ psql_completion(const char *text, int start, int end)
 
 	/* ALTER DEFAULT PRIVILEGES */
 	else if (Matches("ALTER", "DEFAULT", "PRIVILEGES"))
-		COMPLETE_WITH("FOR ROLE", "IN SCHEMA");
+		COMPLETE_WITH("FOR", "GRANT", "IN SCHEMA", "REVOKE");
 	/* ALTER DEFAULT PRIVILEGES FOR */
 	else if (Matches("ALTER", "DEFAULT", "PRIVILEGES", "FOR"))
 		COMPLETE_WITH("ROLE");
@@ -3949,9 +3949,17 @@ psql_completion(const char *text, int start, int end)
 		 * privileges (can't grant roles)
 		 */
 		if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
-			COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
-		  "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
-		  "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");
+		{
+			if (TailMatches("GRANT"))
+COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
+

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bertrand Drouvot

Hi,

On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote:
> On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
>  wrote:
> > Then there is no need to call WaitForStandbyConfirmation() as it could go 
> > until
> > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
> > already
> > know it).
> >
> 
> Won't it will normally return from the first check in
> WaitForStandbyConfirmation() because standby_slot_names_config is not
> set on standby?

I think standby_slot_names can be set on a standby. One could want to set it in
a cascading standby env (though it won't have any real effects until the standby
is promoted). 

> 
> > 2 ===
> >
> > +   {
> > +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> > +   {
> >
> > That could call SnapBuildSnapshotExists() multiple times for the same
> > "restart_lsn" (for example in case of multiple remote slots to sync).
> >
> > What if the sync worker records the last lsn it asks for serialization (and
> > serialized ? Then we could check that value first before deciding to call 
> > (or not)
> > SnapBuildSnapshotExists() on it?
> >
> > It's not ideal because it would record "only the last one" but that would be
> > simple enough for now (currently there is only one sync worker so that 
> > scenario
> > is likely to happen).
> >
> 
> Yeah, we could do that but I am not sure how much it can help. I guess
> we could do some tests to see if it helps.

Yeah not sure either. I just think it can only help and shouldn't make things
worst (but could avoid extra SnapBuildSnapshotExists() calls).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Bharath Rupireddy

On Mon, Apr 1, 2024 at 10:40 AM Amit Kapila  wrote:
>
> After this step and before the next, did you ensure that the slot sync
> has synced the latest confirmed_flush/restart LSNs? You can query:
> "select slot_name,restart_lsn, confirmed_flush_lsn from
> pg_replication_slots;" to ensure the same on both the primary and
> standby.

Yes, after ensuring the slot is synced on the standby, the problem is
reproduced for me and the proposed patch fixes it (i.e. able to see
the changes even after the promotion). I'm just thinking if we can add
a TAP test for this issue, but one key aspect of this reproducer is to
not let someone write a RUNNING_XACTS WAL record on the primary in
between before the standby promotion. Setting bgwriter_delay to max
isn't helping me. I think we can think of using an injection point to
add delay in LogStandbySnapshot() for having this problem reproduced
consistently in a TAP test. Perhaps, we can think of adding this later
after the fix is shipped.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fix parameters order for relation_copy_for_cluster

2024-04-01 Thread Pavel Borisov

correction: so now code is not broken

>

Re: Teach predtest about IS [NOT] proofs

2024-04-01 Thread James Coleman

On Mon, Mar 25, 2024 at 5:53 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > [ v6 patchset ]
>
> I went ahead and committed 0001 after one more round of review
>
> statements; my bad).  I also added the changes in test_predtest.c from
> 0002.  I attach a rebased version of 0002, as well as 0003 which isn't
> changed, mainly to keep the cfbot happy.
>
> I'm still not happy with what you did in predicate_refuted_by_recurse:
> it feels wrong and rather expensively so.  There has to be a better
> way.  Maybe strong vs. weak isn't quite the right formulation for
> refutation tests?

Possibly. Earlier I'd mused that:

> Alternatively (to avoid unnecessary CPU burn) we could modify
> predicate_implied_by_recurse (and functionals called by it) to have a
> argument beyond "weak = true/false" Ie.g., an enum that allows for
> something like "WEAK", "STRONG", and "EITHER". That's a bigger change,
> so I didn't want to do that right away unless there was agreement on
> that direction.

I'm going to try implementing that and see how I feel about what it
looks like in practice.

Regards,
James Coleman

Re: Teach predtest about IS [NOT] proofs

2024-04-01 Thread James Coleman

On Mon, Mar 25, 2024 at 11:45 PM Tom Lane  wrote:
>
> I wrote:
> > I went ahead and committed 0001 after one more round of review
> >
> > statements; my bad).  I also added the changes in test_predtest.c from
> > 0002.  I attach a rebased version of 0002, as well as 0003 which isn't
> > changed, mainly to keep the cfbot happy.
>
> [ squint.. ]  Apparently I managed to hit ^K right before sending this
> email.  The missing line was meant to be more or less
>
> > which found a couple of missing "break"
>
> Not too important, but perhaps future readers of the archives will
> be confused.

I was wondering myself :) so thanks for clarifying.

Regards,
James Coleman

Re: Emitting JSON to file using COPY TO

2024-04-01 Thread jian he

On Sat, Mar 9, 2024 at 9:13 AM jian he  wrote:
>
> On Sat, Mar 9, 2024 at 2:03 AM Joe Conway  wrote:
> >
> > On 3/8/24 12:28, Andrey M. Borodin wrote:
> > > Hello everyone!
> > >
> > > Thanks for working on this, really nice feature!
> > >
> > >> On 9 Jan 2024, at 01:40, Joe Conway  wrote:
> > >>
> > >> Thanks -- will have a look
> > >
> > > Joe, recently folks proposed a lot of patches in this thread that seem 
> > > like diverted from original way of implementation.
> > > As an author of CF entry [0] can you please comment on which patch 
> > > version needs review?
> >
> >
> > I don't know if I agree with the proposed changes, but I have also been
> > waiting to see how the parallel discussion regarding COPY extensibility
> > shakes out.
> >
> > And there were a couple of issues found that need to be tracked down.
> >
> > Additionally I have had time/availability challenges recently.
> >
> > Overall, chances seem slim that this will make it into 17, but I have
> > not quite given up hope yet either.
>
> Hi.
> summary changes I've made in v9 patches at [0]
>
> meta: rebased. Now you need to use `git apply` or `git am`, previously
> copyto_json.007.diff, you need to use GNU patch.
>
>
> at [1], Dean Rasheed found some corner cases when the returned slot's
> tts_tupleDescriptor
> from
> `
> ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true);
> processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
> `
> cannot be used for composite_to_json.
> generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver,
> The COPY TO DestReceiver's rStartup function is copy_dest_startup,
> however copy_dest_startup is a no-op.
> That means to make the final TupleDesc of COPY TO (FORMAT JSON)
> operation bullet proof,
> we need to copy the tupDesc from CopyToState's queryDesc.
> This only applies to when the COPY TO source is a query (example:
> copy (select 1) to stdout), not a table.
> The above is my interpretation.
>

trying to simplify the explanation.
first refer to the struct DestReceiver.
COPY TO (FORMAT JSON), we didn't send the preliminary Tupdesc to the
DestReceiver
via the rStartup function pointer within struct _DestReceiver.

`CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)`
the slot is the final slot returned via query execution.
but we cannot use the tupdesc (slot->tts_tupleDescriptor) to do
composite_to_json.
because the final return slot Tupdesc may change during the query execution.

so we need to copy the tupDesc from CopyToState's queryDesc.

aslo rebased, now we can apply it cleanly.
From 17d9f3765bb74d863e229afed59af7dd923f5379 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 1 Apr 2024 19:47:40 +0800
Subject: [PATCH v10 1/2] introduce json format for COPY TO operation.  json
 format is only allowed in COPY TO operation.  also cannot be used with header
 option.

 discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
 discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c...@joeconway.com
---
 doc/src/sgml/ref/copy.sgml |   5 ++
 src/backend/commands/copy.c|  13 +++
 src/backend/commands/copyto.c  | 125 -
 src/backend/parser/gram.y  |   8 ++
 src/backend/utils/adt/json.c   |   5 +-
 src/include/commands/copy.h|   1 +
 src/include/utils/json.h   |   2 +
 src/test/regress/expected/copy.out |  54 +
 src/test/regress/sql/copy.sql  |  38 +
 9 files changed, 208 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 33ce7c4e..add84dbb 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -218,9 +218,14 @@ COPY { table_name [ ( text,
   csv (Comma Separated Values),
+  json (JavaScript Object Notation),
   or binary.
   The default is text.
  
+ 
+  The json option is allowed only in
+  COPY TO.
+ 
 

 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f75e1d70..02f16d9e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -497,6 +497,8 @@ ProcessCopyOptions(ParseState *pstate,
  /* default format */ ;
 			else if (strcmp(fmt, "csv") == 0)
 opts_out->csv_mode = true;
+			else if (strcmp(fmt, "json") == 0)
+opts_out->json_mode = true;
 			else if (strcmp(fmt, "binary") == 0)
 opts_out->binary = true;
 			else
@@ -740,6 +742,11 @@ ProcessCopyOptions(ParseState *pstate,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot specify HEADER in BINARY mode")));
 
+	if (opts_out->json_mode && opts_out->header_line)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot specify HEADER in JSON mode")));
+
 	/* Check quote */
 	if (!opts_out->csv_mode && opts_out->quote != NULL)
 		ereport(ERROR,
@@ -817,6 +824,12 @@ ProcessCopyOptions(ParseState *pstate,

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Amit Kapila

On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Apr 01, 2024 at 06:05:34AM +, Zhijie Hou (Fujitsu) wrote:
> > On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) 
> >  wrote:
> > Attach the V4 patch which includes the optimization to skip the decoding if
> > the snapshot at the syncing restart_lsn is already serialized. It can avoid 
> > most
> > of the duplicate decoding in my test, and I am doing some more tests 
> > locally.
> >
>
> Thanks!
>
> 1 ===
>
> Same comment as in [1].
>
> In LogicalSlotAdvanceAndCheckReadynessForDecoding(), if we are synchronizing 
> slots
> then I think that we can skip:
>
> +   /*
> +* Wait for specified streaming replication standby servers 
> (if any)
> +* to confirm receipt of WAL up to moveto lsn.
> +*/
> +   WaitForStandbyConfirmation(moveto);
>
> Indeed if we are dealing with synced slot then we know we're in 
> RecoveryInProgress().
>
> Then there is no need to call WaitForStandbyConfirmation() as it could go 
> until
> the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
> already
> know it).
>

Won't it will normally return from the first check in
WaitForStandbyConfirmation() because standby_slot_names_config is not
set on standby?

> 2 ===
>
> +   {
> +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> +   {
>
> That could call SnapBuildSnapshotExists() multiple times for the same
> "restart_lsn" (for example in case of multiple remote slots to sync).
>
> What if the sync worker records the last lsn it asks for serialization (and
> serialized ? Then we could check that value first before deciding to call (or 
> not)
> SnapBuildSnapshotExists() on it?
>
> It's not ideal because it would record "only the last one" but that would be
> simple enough for now (currently there is only one sync worker so that 
> scenario
> is likely to happen).
>

Yeah, we could do that but I am not sure how much it can help. I guess
we could do some tests to see if it helps.

-- 
With Regards,
Amit Kapila.

1 2 >

1 - 100 of 122 matches

Mail list logo