Re: Table AM Interface Enhancements
On Sat, 2024-03-30 at 23:33 +0200, Alexander Korotkov wrote: > I've pushed 0001, 0002 and 0006. Sorry to jump in to this discussion so late. I had worked on something like the custom reloptions (0002) in the past, and there were some complications that don't seem to be addressed in commit c95c25f9af. * At minimum I think it needs some direction (comments, docs, tests) that show how it's supposed to be used. * The bytea returned by the reloptions() method is not in a trivial format. It's a StdRelOptions struct with string values stored after the end of the struct. To build the bytea internally, there's some infrastructure like allocateRelOptStruct() and fillRelOptions(), and it's not very easy to extend those to support a few custom options. * If we ever decide to add a string option to StdRdOptions, I think the design breaks, because the code that looks for those string values wouldn't know how to skip over the custom options. Perhaps we can just promise to never do that, but we should make it explicit somehow. * Most existing heap reloptions (other than fillfactor) are used by other parts of the system (like autovacuum) so should be considered valid for any AM. Most AMs will just want to add a handful of their own options on top, so it would be good to demonstrate how this should be done. * There are still places that are inappropriately calling heap_reloptions directly. For instance, in ProcessUtilitySlow(), it seems to assume that a toast table is a heap? Regards, Jeff Davis
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
Justin Pryzby writes: > On Sun, Mar 31, 2024 at 12:00:00PM +0300, Alexander Lakhin wrote: >> I've stumbled upon a test failure caused by the test query added in that >> commit: >> +ERROR: deadlock detected >> +DETAIL: Process 3076180 waits for AccessShareLock on relation 1259 of >> database 16386; blocked by process 3076181. >> +Process 3076181 waits for AccessShareLock on relation 2601 of database >> 16386; blocked by process 3076180. > I think means that, although it was cute to use pg_am in the reproducer > given in the problem report, it's not a good choice to use here in the > sql regression tests. Another case here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sevengill=2024-04-02%2001%3A32%3A17 AFAICS, e2395cdbe posits that taking exclusive lock on pg_am in the middle of a bunch of concurrent regression scripts couldn't possibly cause any problems. Really? regards, tom lane
Re: Synchronizing slots from primary to standby
Hi, On Tue, Apr 02, 2024 at 04:24:49AM +, Zhijie Hou (Fujitsu) wrote: > On Monday, April 1, 2024 9:28 PM Bertrand Drouvot > wrote: > > > > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > > > > > > > > > 2 === > > > > > > > > + { > > > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > > > + { > > > > > > > > That could call SnapBuildSnapshotExists() multiple times for the > > > > same "restart_lsn" (for example in case of multiple remote slots to > > > > sync). > > > > > > > > What if the sync worker records the last lsn it asks for > > > > serialization (and serialized ? Then we could check that value first > > > > before deciding to call (or not) > > > > SnapBuildSnapshotExists() on it? > > > > > > > > It's not ideal because it would record "only the last one" but that > > > > would be simple enough for now (currently there is only one sync > > > > worker so that scenario is likely to happen). > > > > > > > > > > Yeah, we could do that but I am not sure how much it can help. I guess > > > we could do some tests to see if it helps. > > > > Yeah not sure either. I just think it can only help and shouldn't make > > things > > worst (but could avoid extra SnapBuildSnapshotExists() calls). > > Thanks for the idea. I tried some tests based on Nisha's setup[1]. Thank you and Nisha and Shveta for the testing! > I tried to > advance the slots on the primary to the same restart_lsn before calling > sync_replication_slots(), and reduced the data generated by pgbench. Agree that this scenario makes sense to try to see the impact of SnapBuildSnapshotExists(). > The SnapBuildSnapshotExists is still not noticeable in the profile. SnapBuildSnapshotExists() number of calls are probably negligeable when compared to the IO calls generated by the fast forward logical decoding in this scenario. > So, I feel we > could leave this as a further improvement once we encounter scenarios where > the duplicate SnapBuildSnapshotExists call becomes noticeable. Sounds reasonable to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Why is parula failing?
"Tharakan, Robins" writes: >> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests >> stabilize now. > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of > which passed on subsequent-tries, > of which some were even"signal 6: Aborted". Ugh... > Also, it'd be great if someone could point me to a way to update the > "Compiler" section in "System Detail" on > the buildfarm page (it wrongly shows GCC as v7.3.1). The update_personality.pl script in the buildfarm client distro is what to use to adjust OS version or compiler version data. regards, tom lane
RE: Why is parula failing?
> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests > stabilize now. So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries, of which some were even"signal 6: Aborted". FWIW, I compiled gcc v13.2 (default options) from source which IIUC shouldn't be to blame. Two other possible reasons could be that the buildfarm doesn't have an aarch64 + gcc 13.2 combination (quick check I couldn't see any), or, that this hardware is flaky. Either way, this instance is a throw-away so let me know if this isn't helping. I'll swap it out in case there's not much benefit to be had. Also, it'd be great if someone could point me to a way to update the "Compiler" section in "System Detail" on the buildfarm page (it wrongly shows GCC as v7.3.1). - thanks robins
Re: Synchronizing slots from primary to standby
On Mon, Apr 1, 2024 at 5:05 PM Amit Kapila wrote: > > > 2 === > > > > + { > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > + { > > > > That could call SnapBuildSnapshotExists() multiple times for the same > > "restart_lsn" (for example in case of multiple remote slots to sync). > > > > What if the sync worker records the last lsn it asks for serialization (and > > serialized ? Then we could check that value first before deciding to call > > (or not) > > SnapBuildSnapshotExists() on it? > > > > It's not ideal because it would record "only the last one" but that would be > > simple enough for now (currently there is only one sync worker so that > > scenario > > is likely to happen). > > > > Yeah, we could do that but I am not sure how much it can help. I guess > we could do some tests to see if it helps. I had a look at test-results conducted by Nisha, I did not find any repetitive restart_lsn from primary being synced to standby for that particular test of 100 slots. Unless we have some concrete test in mind (having repetitive restart_lsn), I do not think that by using the given tests, we can establish the benefit of suggested optimization. Attached the log files of all slots test for reference, thanks Shveta slot_name | database | xmin | cxmin | restart_lsn | flushlsn | wal_status | conf | failover | synced | temp ---+--+-+---+-+++--+--++-- slot_1| newdb1 | | 772 | 0/AD0 | 0/E43E9240 | reserved | f| t| f | f slot_10 | newdb2 | | 772 | 0/A0002C8 | 0/E43E9240 | reserved | f| t| f | f slot_100 | newdb20 | | 772 | 0/A005F90 | 0/E43E9240 | reserved | f| t| f | f slot_11 | newdb3 | | 772 | 0/A000300 | 0/E43E9240 | reserved | f| t| f | f slot_12 | newdb3 | | 772 | 0/A000338 | 0/E43E9240 | reserved | f| t| f | f slot_13 | newdb3 | | 772 | 0/A000370 | 0/E43E9240 | reserved | f| t| f | f slot_14 | newdb3 | | 772 | 0/A0003A8 | 0/E43E9240 | reserved | f| t| f | f slot_15 | newdb3 | | 772 | 0/A0003E0 | 0/E43E9240 | reserved | f| t| f | f slot_16 | newdb4 | | 772 | 0/A000418 | 0/E43E9240 | reserved | f| t| f | f slot_17 | newdb4 | | 772 | 0/A000450 | 0/E43E9240 | reserved | f| t| f | f slot_18 | newdb4 | | 772 | 0/A000488 | 0/E43E9240 | reserved | f| t| f | f slot_19 | newdb4 | | 772 | 0/A0004C0 | 0/E43E9240 | reserved | f| t| f | f slot_2| newdb1 | | 772 | 0/A000108 | 0/E43E9240 | reserved | f| t| f | f slot_20 | newdb4 | | 772 | 0/A0004F8 | 0/E43E9240 | reserved | f| t| f | f slot_21 | newdb5 | | 772 | 0/A000530 | 0/E43E9240 | reserved | f| t| f | f slot_22 | newdb5 | | 772 | 0/A000568 | 0/E43E9240 | reserved | f| t| f | f slot_23 | newdb5 | | 772 | 0/A0005A0 | 0/E43E9240 | reserved | f| t| f | f slot_24 | newdb5 | | 772 | 0/A0005D8 | 0/E43E9240 | reserved | f| t| f | f slot_25 | newdb5 | | 772 | 0/A000610 | 0/E43E9240 | reserved | f| t| f | f slot_26 | newdb6 | | 772 | 0/A000648 | 0/E43E9240 | reserved | f| t| f | f slot_27 | newdb6 | | 772 | 0/A000680 | 0/E43E9240 | reserved | f| t| f | f slot_28 | newdb6 | | 772 | 0/A0006B8 | 0/E43E9240 | reserved | f| t| f | f slot_29 | newdb6 | | 772 | 0/A0006F0 | 0/E43E9240 | reserved | f| t| f | f slot_3| newdb1 | | 772 | 0/A000140 | 0/E43E9240 | reserved | f| t| f | f slot_30 | newdb6 | | 772 | 0/A000728 | 0/E43E9240 | reserved | f| t| f | f slot_31 | newdb7 | | 772 | 0/A000760 | 0/E43E9240 | reserved | f| t| f | f slot_32 | newdb7 | | 772 | 0/A000798 | 0/E43E9240 | reserved | f| t| f | f slot_33 | newdb7 | | 772 | 0/A0007D0 | 0/E43E9240 | reserved | f| t| f | f slot_34 | newdb7 | | 772 | 0/A000808 | 0/E43E9240 | reserved | f| t| f | f slot_35 | newdb7 | | 772 | 0/A000840 | 0/E43E9240 | reserved | f| t| f | f slot_36 | newdb8 | | 772 | 0/A000878 | 0/E43E9240 | reserved | f| t| f | f slot_37 |
RE: Synchronizing slots from primary to standby
On Tuesday, April 2, 2024 8:43 AM Bharath Rupireddy wrote: > > On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu) > wrote: > > > > Attach the V4 patch which includes the optimization to skip the > > decoding if the snapshot at the syncing restart_lsn is already > > serialized. It can avoid most of the duplicate decoding in my test, and I am > doing some more tests locally. > > Thanks for the patch. I'm thinking if we can reduce the amount of work that we > do for synced slots in each sync worker cycle. With that context in mind, why > do > we need to create decoding context every time? > Can't we create it once, store it in an in-memory structure and use it for > each > sync worker cycle? Is there any problem with it? What do you think? Thanks for the idea. I think the cost of decoding context seems to be relatively minor when compared to the IO cost. After generating the profiles for the tests shared by Nisha[1], it appears that the StartupDecodingContext is not a issue. While the suggested refactoring is an option, I think we can consider this as a future improvement and addressing it only if we encounter scenarios where the creation of decoding context becomes a bottleneck. [1] https://www.postgresql.org/message-id/CALj2ACUeij5tFzJ1-cuoUh%2Bmhj33v%2BYgqD_gHYUpRdXSCSBbhw%40mail.gmail.com Best Regards, Hou zj <>
Re: Synchronizing slots from primary to standby
On Mon, Apr 1, 2024 at 6:58 PM Bertrand Drouvot wrote: > > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > > wrote: > > > Then there is no need to call WaitForStandbyConfirmation() as it could go > > > until > > > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we > > > already > > > know it). > > > > > > > Won't it will normally return from the first check in > > WaitForStandbyConfirmation() because standby_slot_names_config is not > > set on standby? > > I think standby_slot_names can be set on a standby. One could want to set it > in > a cascading standby env (though it won't have any real effects until the > standby > is promoted). > Yeah, it is possible but doesn't seem worth additional checks for this micro-optimization. -- With Regards, Amit Kapila.
Re: [HACKERS] make async slave to wait for lsn to be replayed
Alexander Korotkov writes: Hello, > > Did you forget to attach the new patch? > > Yes, here it is. > > -- > Regards, > Alexander Korotkov > > [4. text/x-diff; > v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]... + +pg_wal_replay_wait ( + target_lsn pg_lsn, + timeout bigint DEFAULT 0) +void + Should we return the millseconds of waiting time? I think this information may be useful for customer if they want to know how long time it waits for for minitor purpose. -- Best Regards Andy Fan
Re: Change GUC hashtable to use simplehash?
On Tue, Mar 5, 2024 at 5:30 PM John Naylor wrote: > > On Tue, Jan 30, 2024 at 5:04 PM John Naylor wrote: > > > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma wrote: > > > But given that we know the data length and we have it in a register > > > already, it's easy enough to just mask out data past the end with a > > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > > small test harness that just hashes and finalizes an array of strings, > > > with a data dependency between consecutive hashes (next address > > > depends on the previous hash output). > > > > Interesting work! I've taken this idea and (I'm guessing, haven't > > tested) improved it by re-using an intermediate step for the > > conditional, simplifying the creation of the mask, and moving the > > bitscan out of the longest dependency chain. > > This needed a rebase, and is now 0001. I plan to push this soon. I pushed but had to revert -- my version (and I believe both) failed to keep the invariant that the aligned and unaligned must result in the same hash. It's clear to me how to fix, but I've injured my strong hand and won't be typing much in for a cuople days. I'll prioritize the removal of strlen calls for v17, since the optimization can wait and there is also a valgrind issue I haven't looked into.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy wrote: > > On Fri, Mar 29, 2024 at 9:39 AM Amit Kapila wrote: > > > > Commit message states: "why we can't just update inactive_since for > > synced slots on the standby with the value received from remote slot > > on the primary. This is consistent with any other slot parameter i.e. > > all of them are synced from the primary." > > > > The inactive_since is not consistent with other slot parameters which > > we copy. We don't perform anything related to those other parameters > > like say two_phase phase which can change that property. However, we > > do acquire the slot, advance the slot (as per recent discussion [1]), > > and release it. Since these operations can impact inactive_since, it > > seems to me that inactive_since is not the same as other parameters. > > It can have a different value than the primary. Why would anyone want > > to know the value of inactive_since from primary after the standby is > > promoted? > > After thinking about it for a while now, it feels to me that the > synced slots (slots on the standby that are being synced from the > primary) can have their own inactive_sicne value. Fundamentally, > inactive_sicne is set to 0 when slot is acquired and set to current > time when slot is released, no matter who acquires and releases it - > be it walsenders for replication, or backends for slot advance, or > backends for slot sync using pg_sync_replication_slots, or backends > for other slot functions, or background sync worker. Remember the > earlier patch was updating inactive_since just for walsenders, but > then the suggestion was to update it unconditionally - > https://www.postgresql.org/message-id/CAJpy0uD64X%3D2ENmbHaRiWTKeQawr-rbGoy_GdhQQLVXzUSKTMg%40mail.gmail.com. > Whoever syncs the slot, *acutally* acquires the slot i.e. makes it > theirs, syncs it from the primary, and releases it. IMO, no > differentiation is to be made for synced slots. FWIW, coming to this thread late, I think that the inactive_since should not be synchronized from the primary. The wall clocks are different on the primary and the standby so having the primary's timestamp on the standby can confuse users, especially when there is a big clock drift. Also, as Amit mentioned, inactive_since seems not to be consistent with other parameters we copy. The replication_slot_inactive_timeout feature should work on the standby independent from the primary, like other slot invalidation mechanisms, and it should be based on its own local clock. > Coming to a future patch for inactive timeout based slot invalidation, > we can either allow invalidation without any differentiation for > synced slots or restrict invalidation to avoid more sync work. For > instance, if inactive timeout is kept low on the standby, the sync > worker will be doing more work as it drops and recreates a slot > repeatedly if it keeps getting invalidated. Another thing is that the > standby takes independent invalidation decisions for synced slots. > AFAICS, invalidation due to wal_removal is the only sole reason (out > of all available invalidation reasons) for a synced slot to get > invalidated independently of the primary. Check > https://www.postgresql.org/message-id/CAA4eK1JXBwTaDRD_%3D8t6UB1fhRNjC1C%2BgH4YdDxj_9U6djLnXw%40mail.gmail.com > for the suggestion on we better not differentiaing invalidation > decisions for synced slots. > > The assumption of letting synced slots have their own inactive_since > not only simplifies the code, but also looks less-confusing and more > meaningful to the user. The only code that we put in on top of the > committed code is to use InRecovery in place of > RecoveryInProgress() in RestoreSlotFromDisk() to fix the issue raised > by Shveta upthread. If we want to invalidate the synced slots due to the timeout, I think we need to define what is "inactive" for synced slots. Suppose that the slotsync worker updates the local (synced) slot's inactive_since whenever releasing the slot, irrespective of the actual LSNs (or other slot parameters) having been updated. I think that this idea cannot handle a slot that is not acquired on the primary. In this case, the remote slot is inactive but the local slot is regarded as active. WAL files are piled up on the standby (and on the primary) as the slot's LSNs don't move forward. I think we want to regard such a slot as "inactive" also on the standby and invalidate it because of the timeout. > > > Now, the other concern is that calling GetCurrentTimestamp() > > could be costly when the values for the slot are not going to be > > updated but if that happens we can optimize such that before acquiring > > the slot we can have some minimal pre-checks to ensure whether we need > > to update the slot or not. If we use such pre-checks, another problem might happen; it cannot handle a case where the slot is acquired on the primary but its LSNs don't move forward. Imagine a logical replication conflict happened on the
Re: Asymmetric partition-wise JOIN
On 15/10/2023 13:25, Alexander Korotkov wrote: Great! I'm looking forward to the revised patch. Revising the code and opinions before restarting this work, I found two different possible strategies mentioned in the thread: 1. 'Common Resources' shares the materialised result of the inner table scan (a hash table in the case of HashJoin) to join each partition one by one. It gives us a profit in the case of parallel append and possibly other cases, like the one shown in the initial message. 2. 'Individual strategies' - By limiting the AJ feature to cases when the JOIN clause contains a partitioning expression, we can push an additional scan clause into each copy of the inner table scan, reduce the number of tuples scanned, and even prune something because of proven zero input. I see the pros and cons of both approaches. The first option is more straightforward, and its outcome is obvious in the case of parallel append. But how can we guarantee the same join type for each join? Why should we ignore the positive effect of different strategies for different partitions? The second strategy is more expensive for the optimiser, especially in the multipartition case. But as I can predict, it is easier to implement and looks more natural for the architecture. What do you think about that? -- regards, Andrei Lepikhov Postgres Professional
RE: Why is parula failing?
> ... in connection with which, I can't help noticing that parula is using a > very old compiler: > > configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) > > From some quick checking around, that would have to be near the beginning of > aarch64 > support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture > until earlier > that same year). It's not exactly hard to believe that there were some > lingering compiler bugs. > I wonder why parula is using that when its underlying system seems markedly > newer (the kernel at least has a recent build date). Parula used GCC v7.3.1 since that's what came by default. I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now. - Robins
Re: Table AM Interface Enhancements
On Tue, 02 Apr 2024 at 07:59, Alexander Korotkov wrote: > On Mon, Apr 1, 2024 at 7:36 PM Tom Lane wrote: >> Coverity complained about what you did in RelationParseRelOptions >> in c95c25f9a: >> >> *** CID 1595992: Null pointer dereferences (FORWARD_NULL) >> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: >> 499 in RelationParseRelOptions() >> 493 >> 494 /* >> 495 * Fetch reloptions from tuple; have to use a hardwired >> descriptor because >> 496 * we might not have any other for pg_class yet (consider >> executing this >> 497 * code for pg_class itself) >> 498 */ >> >>> CID 1595992: Null pointer dereferences (FORWARD_NULL) >> >>> Passing null pointer "tableam" to "extractRelOptions", which >> >>> dereferences it. >> 499 options = extractRelOptions(tuple, GetPgClassDescriptor(), >> 500 >> tableam, amoptsfn); >> 501 >> >> I see that extractRelOptions only uses the tableam argument for some >> relkinds, and RelationParseRelOptions does set it up for those >> relkinds --- but Coverity's complaint isn't without merit, because >> those two switch statements are looking at *different copies of the >> relkind*, which in theory could be different. This all seems quite >> messy and poorly factored. Can't we do better? Why do we need to >> involve two copies of allegedly the same pg_class tuple, anyhow? > > I wasn't registered at Coverity yet. Now I've registered and am > waiting for approval to access the PostgreSQL analysis data. > > I wonder why Coverity complains about tableam, but not amoptsfn. > Their usage patterns are very similar. > > It appears that relation->rd_rel isn't the full copy of pg_class tuple > (see AllocateRelationDesc). RelationParseRelOptions() is going to > update relation->rd_options, and thus needs a full pg_class tuple to > fetch options out of it. However, it is really unnecessary to access > both tuples at the same time. We can use a full tuple, not > relation->rd_rel, in both cases. See the attached patch. > > -- > Regards, > Alexander Korotkov + Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple); +; There is an additional semicolon in the code. -- Regards, Japin Li
Re: Add new error_action COPY ON_ERROR "log"
On Mon, Apr 1, 2024 at 10:03 AM Masahiko Sawada wrote: > > On Sat, Mar 30, 2024 at 11:05 PM Bharath Rupireddy > wrote: > > > > On Thu, Mar 28, 2024 at 6:31 PM Masahiko Sawada > > wrote: > > > > > > That is, > > > since the LOG_VERBOSITY option is an enum parameter, it might make > > > more sense to require the value, instead of making the value optional. > > > For example, the following command could not be obvious for users: > > > > > > COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY); > > > > Agreed. Please see the attached v14 patch. > > Thank you for updating the patch! > > > The LOG_VERBOSITY now needs > > a value to be specified. Note that I've not added any test for this > > case as there seemed to be no such tests so far generating "ERROR: > > <> requires a parameter". I don't mind adding one for > > LOG_VERBOSITY though. > > +1 > > One minor point: > > ENCODING 'encoding_name' > +LOG_VERBOSITY [ mode ] > > > '[' and ']' are not necessary because the value is no longer optional. > > I've attached the updated patch. I'll push it, barring any objections. > Pushed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu) wrote: > > Attach the V4 patch which includes the optimization to skip the decoding if > the snapshot at the syncing restart_lsn is already serialized. It can avoid > most > of the duplicate decoding in my test, and I am doing some more tests locally. Thanks for the patch. I'm thinking if we can reduce the amount of work that we do for synced slots in each sync worker cycle. With that context in mind, why do we need to create decoding context every time? Can't we create it once, store it in an in-memory structure and use it for each sync worker cycle? Is there any problem with it? What do you think? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: Synchronizing slots from primary to standby
On Monday, April 1, 2024 7:30 PM Amit Kapila wrote: > > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Friday, March 29, 2024 2:50 PM Amit Kapila > wrote: > > > > > > > > > > > > > > 2. > > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr > moveto, > > > + bool *found_consistent_point); > > > + > > > > > > This API looks a bit awkward as the functionality doesn't match the > > > name. How about having a function with name > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > > > ready_for_decoding) with the same functionality as your patch has > > > for > > > pg_logical_replication_slot_advance() and then invoke it both from > > > pg_logical_replication_slot_advance and slotsync.c. The function > > > name is too big, we can think of a shorter name. Any ideas? > > > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just > > LogicalSlotAdvanceAndCheckDecoding()? > > > > It is about snapbuild state, so how about naming the function as > LogicalSlotAdvanceAndCheckSnapState()? It looks better to me, so changed. > > I have made quite a few cosmetic changes in comments and code. See > attached. This is atop your latest patch. Can you please review and include > these changes in the next version? Thanks, I have reviewed and merged them. Attach the V5 patch set which addressed above comments and ran pgindent. I will think and test the improvement suggested by Bertrand[1] and reply after that. [1] https://www.postgresql.org/message-id/Zgp8n9QD5nYSESnM%40ip-10-97-1-34.eu-west-3.compute.internal Best Regards, Hou zj v5-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch Description: v5-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Re: Why is parula failing?
David Rowley writes: > On Sat, 30 Mar 2024 at 09:17, Tom Lane wrote: >> ... in connection with which, I can't help noticing that parula >> is using a very old compiler: >> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) >> I wonder why parula is using that when its underlying system seems >> markedly newer (the kernel at least has a recent build date). > It could be, but wouldn't the bug have to relate to the locking code > and be caused by some other backend stomping on the memory? > Otherwise, shouldn't it be failing consistently every run rather than > sporadically? Your guess is as good as mine ... but we still haven't seen this class of failure on any other animal, so the idea that it's strictly a chance timing issue is getting weaker all the time. regards, tom lane
Confusing #if nesting in hmac_openssl.c
I noticed that buildfarm member batfish has been complaining like this for awhile: hmac_openssl.c:90:1: warning: unused function 'ResourceOwnerRememberHMAC' [-Wunused-function] hmac_openssl.c:95:1: warning: unused function 'ResourceOwnerForgetHMAC' [-Wunused-function] Looking at the code, this is all from commit e6bdfd970, and apparently batfish is our only animal that doesn't HAVE_HMAC_CTX_NEW. I tried to understand the #if nesting and soon got very confused. I don't think it is helpful to put the resource owner manipulations inside #ifdef HAVE_HMAC_CTX_NEW and HAVE_HMAC_CTX_FREE --- probably, it would never be the case that only one of those is defined, but it just seems messy. What do you think of rearranging it as attached? regards, tom lane diff --git a/src/common/hmac_openssl.c b/src/common/hmac_openssl.c index d10f7e5af7..4a62c8832f 100644 --- a/src/common/hmac_openssl.c +++ b/src/common/hmac_openssl.c @@ -41,6 +41,7 @@ */ #ifndef FRONTEND #ifdef HAVE_HMAC_CTX_NEW +#define USE_RESOWNER_FOR_HMAC #define ALLOC(size) MemoryContextAlloc(TopMemoryContext, size) #else #define ALLOC(size) palloc(size) @@ -67,13 +68,13 @@ struct pg_hmac_ctx pg_hmac_errno error; const char *errreason; -#ifndef FRONTEND +#ifdef USE_RESOWNER_FOR_HMAC ResourceOwner resowner; #endif }; /* ResourceOwner callbacks to hold HMAC contexts */ -#ifndef FRONTEND +#ifdef USE_RESOWNER_FOR_HMAC static void ResOwnerReleaseHMAC(Datum res); static const ResourceOwnerDesc hmac_resowner_desc = @@ -138,10 +139,12 @@ pg_hmac_create(pg_cryptohash_type type) * previous runs. */ ERR_clear_error(); -#ifdef HAVE_HMAC_CTX_NEW -#ifndef FRONTEND + +#ifdef USE_RESOWNER_FOR_HMAC ResourceOwnerEnlarge(CurrentResourceOwner); #endif + +#ifdef HAVE_HMAC_CTX_NEW ctx->hmacctx = HMAC_CTX_new(); #else ctx->hmacctx = ALLOC(sizeof(HMAC_CTX)); @@ -159,14 +162,14 @@ pg_hmac_create(pg_cryptohash_type type) return NULL; } -#ifdef HAVE_HMAC_CTX_NEW -#ifndef FRONTEND +#ifndef HAVE_HMAC_CTX_NEW + memset(ctx->hmacctx, 0, sizeof(HMAC_CTX)); +#endif /* HAVE_HMAC_CTX_NEW */ + +#ifdef USE_RESOWNER_FOR_HMAC ctx->resowner = CurrentResourceOwner; ResourceOwnerRememberHMAC(CurrentResourceOwner, ctx); #endif -#else - memset(ctx->hmacctx, 0, sizeof(HMAC_CTX)); -#endif /* HAVE_HMAC_CTX_NEW */ return ctx; } @@ -327,15 +330,16 @@ pg_hmac_free(pg_hmac_ctx *ctx) #ifdef HAVE_HMAC_CTX_FREE HMAC_CTX_free(ctx->hmacctx); -#ifndef FRONTEND - if (ctx->resowner) - ResourceOwnerForgetHMAC(ctx->resowner, ctx); -#endif #else explicit_bzero(ctx->hmacctx, sizeof(HMAC_CTX)); FREE(ctx->hmacctx); #endif +#ifdef USE_RESOWNER_FOR_HMAC + if (ctx->resowner) + ResourceOwnerForgetHMAC(ctx->resowner, ctx); +#endif + explicit_bzero(ctx, sizeof(pg_hmac_ctx)); FREE(ctx); } @@ -375,7 +379,7 @@ pg_hmac_error(pg_hmac_ctx *ctx) /* ResourceOwner callbacks */ -#ifndef FRONTEND +#ifdef USE_RESOWNER_FOR_HMAC static void ResOwnerReleaseHMAC(Datum res) {
Re: Table AM Interface Enhancements
On Mon, Apr 1, 2024 at 7:36 PM Tom Lane wrote: > Coverity complained about what you did in RelationParseRelOptions > in c95c25f9a: > > *** CID 1595992: Null pointer dereferences (FORWARD_NULL) > /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: > 499 in RelationParseRelOptions() > 493 > 494 /* > 495 * Fetch reloptions from tuple; have to use a hardwired > descriptor because > 496 * we might not have any other for pg_class yet (consider > executing this > 497 * code for pg_class itself) > 498 */ > >>> CID 1595992: Null pointer dereferences (FORWARD_NULL) > >>> Passing null pointer "tableam" to "extractRelOptions", which > >>> dereferences it. > 499 options = extractRelOptions(tuple, GetPgClassDescriptor(), > 500 > tableam, amoptsfn); > 501 > > I see that extractRelOptions only uses the tableam argument for some > relkinds, and RelationParseRelOptions does set it up for those > relkinds --- but Coverity's complaint isn't without merit, because > those two switch statements are looking at *different copies of the > relkind*, which in theory could be different. This all seems quite > messy and poorly factored. Can't we do better? Why do we need to > involve two copies of allegedly the same pg_class tuple, anyhow? I wasn't registered at Coverity yet. Now I've registered and am waiting for approval to access the PostgreSQL analysis data. I wonder why Coverity complains about tableam, but not amoptsfn. Their usage patterns are very similar. It appears that relation->rd_rel isn't the full copy of pg_class tuple (see AllocateRelationDesc). RelationParseRelOptions() is going to update relation->rd_options, and thus needs a full pg_class tuple to fetch options out of it. However, it is really unnecessary to access both tuples at the same time. We can use a full tuple, not relation->rd_rel, in both cases. See the attached patch. -- Regards, Alexander Korotkov fix-RelationParseRelOptions-Coverity-complain.patch Description: Binary data
Re: On disable_cost
On Mon, Apr 1, 2024 at 5:00 PM Tom Lane wrote: > Very interesting, thanks for the summary. So the fact that > disable_cost is additive across plan nodes is actually a pretty > important property of the current setup. I think this is closely > related to one argument you made against my upthread idea of using > IEEE Infinity for disable_cost: that'd mask whether more than one > of the sub-plans had been disabled. Yes, exactly. I just hadn't quite put the pieces together. > Yeah, that seems like the next thing to try if anyone plans to pursue > this further. That'd essentially do what we're doing now except that > disable_cost is its own "order of infinity", entirely separate from > normal costs. Right. I think that's actually what I had in mind in the last paragraph of http://postgr.es/m/CA+TgmoY+Ltw7B=1FSFSN4yHcu2roWrz-ijBovj-99LZU=9h...@mail.gmail.com but that was a while ago and I'd lost track of why it actually mattered. But I also have questions about whether that's really the right approach. I think the approach of just not generating paths we don't want in the first place merits more consideration. We do that in some cases already, but not in others, and I'm not clear why. Like, if index-scans, index-only scans, sorts, nested loops, and hash joins are disabled, something is going to have to give, because the only remaining join type is a merge join yet we've ruled out every possible way of getting the day into some order, but I'm not sure whether there's some reason that we need exactly the behavior that we have right now rather than anything else. Maybe it would be OK to just insist on at least one unparameterized, non-partial path at the baserel level, and then if that ends up forcing us to ignore the join-type restrictions higher up, so be it. Or maybe that's not OK and after I try that out I'll end up writing another email about how I was a bit clueless about all of this. I don't know. But I feel like it merits more investigation, because I'm having trouble shaking the theory that what we've got right now is pretty arbitrary. And also ... looking at the regression tests, and also thinking about the kinds of problems that I think people run into in real deployments, I can't help feeling like we've somehow got this whole thing backwards. enable_wunk imagines that you want to plan as normal except with one particular plan type excluded from consideration. And maybe that makes sense if the point of the enable_wunk GUC is that the planner feature might be buggy and you might therefore want to turn it off to protect yourself, or if the planner feature might be expensive and you might want to turn it off to save cycles. But surely that's not the case with something like enable_seqscan or enable_indexscan. What I think we're mostly doing in the regression tests is shutting off every relevant type of plan except one. I theorize that what we actually want to do is tell the planner what we do want to happen, rather than what we don't want to happen, but we've got this weird set of GUCs that do the opposite of that and we're super-attached to them because they've existed forever. I don't really have a concrete proposal here, but I wonder if what we're actually talking about here is spending time and energy polishing a mechanism that nobody likes in the first place. -- Robert Haas EDB: http://www.enterprisedb.com
Silence Meson warning on HEAD
A warning was recently[0] introduced into the Meson build: WARNING: Project targets '>=0.54' but uses feature introduced in '0.55.0': Passing executable/found program object to script parameter of add_dist_script There are 3 way to solve the issue that I have laid out in 3 separate patches that you pick at your leisure: 1. version-check.diff Wrap the offending line in a Meson version check. 2. perl-string.diff Pass the perl program as a string via its .path() method. 3. meson-bump.diff Bump the minimum meson version from 0.54 to 0.55, at least. I chose to do some analysis on option 3. The other 2 options are perfectly valid. I looked at all the systems in the buildfarm, and found the following results: First column is the list of systems we test HEAD against in the buildfarm. Second column is the Meson version available to those systems, based on what I could find. Third column is whether or not we test Meson on that system. System Meson Version Tested AlmaLinux 8 0.58.2 AlmaLinux 9 0.63.3 Alpine Linux 3.19.1 1.3.0 Amazon Linux 2 0.55.1 Amazon Linux 20230.62.2 ArchLinux1.4.0 CentOS 7 0.55.1 CentOS 7.1 0.55.1 CentOS 7.4 0.55.1 CentOS 7.9.2009 0.55.1 CentOS Stream 8 0.58.2 Debian 100.56.1 Debian 110.56.2 Debian 11.5 0.56.2 Debian 121.0.1 Debian 7.0 Debian 8.11 Debian 9 Debian 9.3 Debian unstable 1.4.0 x DragonFly BSD 6.2.2 0.63.2 Fedora 381.0.1 Fedora 391.3.2 x Fedora 401.3.2 FreeBSD 12.2 FreeBSD 13.1 0.57.1 FreeBSD 14 1.4.0 Loongnix-Server 8.4.0 macOS 13.6 macOS 14.0 macOS 14.0 / MacPorts1.4.0 macOS 14.3 1.4.0 x NetBSD 10.0 1.2.3 NetBSD 9.2 1.2.3 OmniOS / illumos r151038 OpenBSD 6.8 OpenBSD 6.9 OpenBSD 7.3 OpenBSD 7.4 1.2.1 OpenIndiana/illumos hipster rolling release 1.4.0 openSUSE 15.00.46.0 openSUSE 15.30.54.2 Oracle Linux 9 0.63.3 Photon 2.0 Photon 3.0 Raspbian 7.8 0.56.2 Raspbian 8.0 1.0.1 Red Hat Enterprise Linux 7 0.55.1 Red Hat Enterprise Linux 7.1 0.55.1 Red Hat Enterprise Linux 7.9 0.55.1 Red Hat Enterprise Linux 9.2 0.63.3 Solaris 11.4.42 CBE 0.59.2 SUSE Linux Enterprise Server 12 SP5 SUSE Linux Enterprise Server 15 SP2 Ubuntu 16.04 0.40.1 Ubuntu 16.04.3 0.40.1 Ubuntu 18.04 0.45.1 Ubuntu 18.04.5
Re: Why is parula failing?
On Sat, 30 Mar 2024 at 09:17, Tom Lane wrote: > > I wrote: > > I'd not looked closely enough at the previous failure, because > > now that I have, this is well out in WTFF territory: how can > > reltuples be greater than zero when relpages is zero? This can't > > be a state that autovacuum would have left behind, unless it's > > really seriously broken. I think we need to be looking for > > explanations like "memory stomp" or "compiler bug". > > ... in connection with which, I can't help noticing that parula > is using a very old compiler: > > configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) > > From some quick checking around, that would have to be near the > beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64 > to a primary architecture until earlier that same year). It's not > exactly hard to believe that there were some lingering compiler bugs. > I wonder why parula is using that when its underlying system seems > markedly newer (the kernel at least has a recent build date). It could be, but wouldn't the bug have to relate to the locking code and be caused by some other backend stomping on the memory? Otherwise, shouldn't it be failing consistently every run rather than sporadically? David
Re: Crash on UNION with PG 17
On Thu, 28 Mar 2024 at 04:34, Regina Obe wrote: > CREATE TABLE edge_data AS > SELECT i AS edge_id, i + 1 AS start_node, i + 2 As end_node > FROM generate_series(1,10) AS i; > > WITH edge AS ( > SELECT start_node, end_node > FROM edge_data > WHERE edge_id = 1 > ) > SELECT start_node id FROM edge UNION > SELECT end_node FROM edge; As of d5d2205c8, this query should work as expected. Thank you for letting us know about this. David
Re: Properly pathify the union planner
On Fri, 29 Mar 2024 at 08:53, Tom Lane wrote: > On third thought, I'm not at all convinced that we even want to > invent this struct as compared to just adding another parameter > to subquery_planner. The problem with a struct is what happens > the next time we need to add a parameter? If we add yet another > function parameter, we can count on the compiler to complain > about call sites that didn't get the memo. Adding a field > within an existing struct provokes no such warning, leading > to situations with uninitialized fields that might accidentally > work during testing, but fail the minute they get to the field. I agree it's best to break callers that don't update their code to consider passing or not passing a SetOperationStmt. I've just committed a fix to do it that way. This also seems to be the path of least resistance, which also appeals. I opted to add a new test alongside the existing tests which validate set operations with an empty SELECT list work. The new tests include the variation that the set operation has both a materialized and non-materialized CTE as a child. This was only a problem with a materialized CTE, but I opted to include a non-materialized one as I don't expect that we'll get this exact problem again. I was just keen on getting more coverage with a couple of cheap tests. Thanks for your input on this. I'll review your other comments shortly. David
Re: Reports on obsolete Postgres versions
On Wed, Mar 13, 2024 at 02:04:16PM -0400, Robert Treat wrote: > I tend to agree with Bruce, and major/minor seems to be the more > common usage within the industry; iirc, debian, ubuntu, gnome, suse, > and mariadb all use that nomenclature; and ISTR some distro's who > release packaged versions of postgres with custom patches applied (ie > 12.4-2 for postgres 12.4 patchlevel 2). > > BTW, as a reminder, we do have this statement, in bold, in the > "upgrading" section of the versioning page: > "We always recommend that all users run the latest available minor > release for whatever major version is in use." There is actually > several other phrases and wording on that page that could probably be > propagated as replacement language in some of these other areas. I ended up writing the attached doc patch. I found that some or our text was overly-wordy, causing the impact of what we were trying to say to be lessened. We might want to go farther than this patch, but I think it is an improvement. I also moved the text to the bottom of the section --- previously, our wording referenced minor releases, then we talked about major releases, and then minor releases. This gives a more natural flow. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/templates/support/versioning.html b/templates/support/versioning.html index d48e11e0..cee06954 100644 --- a/templates/support/versioning.html +++ b/templates/support/versioning.html @@ -45,15 +45,8 @@ number, e.g. 9.5.3 to 9.5.4. Upgrading - -We always recommend that all users run the latest available minor -release for whatever major version is in use. - - - - -Major versions usually change the internal format of system tables and data -files. These changes are often complex, so we do not maintain backward +Major versions usually change the internal format of the system tables. +These changes are often complex, so we do not maintain backward compatibility of all stored data. A dump/reload of the database or use of the pg_upgrade module is required for major upgrades. We also recommend reading the @@ -65,18 +58,25 @@ versions prior to doing so. -Upgrading to a minor release does not normally require a dump and restore; you -can stop the database server, install the updated binaries, and restart the -server. For some releases, manual changes may be required to complete the -upgrade, so always read the release notes before upgrading. +Minor release upgrades do not require a dump and restore; you simply stop +the database server, install the updated binaries, and restart the server. +Such upgrades might require manual changes to complete so always read +the release notes first. -While upgrading will always contain some level of risk, PostgreSQL minor releases -fix only frequently-encountered bugs, security -issues, and data corruption problems to reduce the risk associated with -upgrading. For minor releases, the community considers not upgrading to be -riskier than upgrading. +Minor releases only fix frequently-encountered bugs, security issues, and data corruption +problems, so such upgrades are very low risk. The community +considers performing minor upgrades to be less risky than continuing to +run superseded minor versions. + + + + +We recommend that users always run the latest minor release associated +with their major version. + Releases
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out portions of a load instruction >> > will not generate an exception. To allow byte level granularity >> > masking, -mavx512bw is needed. Based on wikipedia this will only >> > disable this fast path on Knights Mill (Xeon Phi), in all other cases >> > VPOPCNTQ implies availability of BW. >> >> Sounds promising. IMHO we should really be sure that these kinds of loads >> won't generate segfaults and the like due to the masked-out portions. I >> searched around a little bit but haven't found anything that seemed >> definitive. > > Interestingly the Intel software developer manual is not exactly > crystal clear on how memory faults with masks work, but volume 2A > chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb > that supports memory fault suppression on page fault. Perhaps Paul or Akash could chime in here... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Popcount optimization using AVX512
Here is a v19 of the patch set. I moved out the refactoring of the function pointer selection code to 0001. I think this is a good change independent of $SUBJECT, and I plan to commit this soon. In 0002, I changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones instead. This is standard practice elsewhere where the popcount functions are unlikely to win. I'll probably commit this one soon, too, as it's even more trivial than 0001. 0003 is the AVX512 POPCNT patch. Besides refactoring out 0001, there are no changes from v18. 0004 is an early proof-of-concept for using AVX512 for the visibility map code. The code is missing comments, and I haven't performed any benchmarking yet, but I figured I'd post it because it demonstrates how it's possible to build upon 0003 in other areas. AFAICT the main open question is the function call overhead in 0003 that Alvaro brought up earlier. After 0002 is committed, I believe the only in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm not sure it's worth expending too much energy to make sure there are absolutely no regressions there. However, I'm happy to do so if folks feel that it is necessary, and I'd be grateful for thoughts on how to proceed on this one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From cedad23b7b35e77fde164b1d577c37fb07a578c6 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 1 Apr 2024 16:37:53 -0500 Subject: [PATCH v19 1/4] refactor popcount function choosing --- src/port/pg_bitutils.c | 37 + 1 file changed, 9 insertions(+), 28 deletions(-) diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 1197696e97..28312f3dd9 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -148,8 +148,8 @@ pg_popcount_available(void) * the function pointers so that subsequent calls are routed directly to * the chosen implementation. */ -static int -pg_popcount32_choose(uint32 word) +static inline void +choose_popcount_functions(void) { if (pg_popcount_available()) { @@ -163,45 +163,26 @@ pg_popcount32_choose(uint32 word) pg_popcount64 = pg_popcount64_slow; pg_popcount = pg_popcount_slow; } +} +static int +pg_popcount32_choose(uint32 word) +{ + choose_popcount_functions(); return pg_popcount32(word); } static int pg_popcount64_choose(uint64 word) { - if (pg_popcount_available()) - { - pg_popcount32 = pg_popcount32_fast; - pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; - } - else - { - pg_popcount32 = pg_popcount32_slow; - pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; - } - + choose_popcount_functions(); return pg_popcount64(word); } static uint64 pg_popcount_choose(const char *buf, int bytes) { - if (pg_popcount_available()) - { - pg_popcount32 = pg_popcount32_fast; - pg_popcount64 = pg_popcount64_fast; - pg_popcount = pg_popcount_fast; - } - else - { - pg_popcount32 = pg_popcount32_slow; - pg_popcount64 = pg_popcount64_slow; - pg_popcount = pg_popcount_slow; - } - + choose_popcount_functions(); return pg_popcount(buf, bytes); } -- 2.25.1 >From 038b74045b006c5d8a5470364f2041370ec0b083 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 1 Apr 2024 16:47:22 -0500 Subject: [PATCH v19 2/4] use pg_number_of_ones instead of pg_popcount for single byte --- src/backend/postmaster/syslogger.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c index 08efe74cc9..437947dbb9 100644 --- a/src/backend/postmaster/syslogger.c +++ b/src/backend/postmaster/syslogger.c @@ -898,7 +898,7 @@ process_pipe_input(char *logbuffer, int *bytes_in_logbuffer) if (p.nuls[0] == '\0' && p.nuls[1] == '\0' && p.len > 0 && p.len <= PIPE_MAX_PAYLOAD && p.pid != 0 && - pg_popcount((char *) _flags, 1) == 1) + pg_number_of_ones[dest_flags] == 1) { List *buffer_list; ListCell *cell; -- 2.25.1 >From 73ee8d6018b047856e63ad075641a0dcfe889417 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v19 3/4] AVX512 popcount support --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 15 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 7 +- src/port/pg_popcount_avx512.c| 49 ++ src/port/pg_popcount_avx512_choose.c | 71 src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed,
Re: Popcount optimization using AVX512
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To allow byte level granularity > > masking, -mavx512bw is needed. Based on wikipedia this will only > > disable this fast path on Knights Mill (Xeon Phi), in all other cases > > VPOPCNTQ implies availability of BW. > > Sounds promising. IMHO we should really be sure that these kinds of loads > won't generate segfaults and the like due to the masked-out portions. I > searched around a little bit but haven't found anything that seemed > definitive. Interestingly the Intel software developer manual is not exactly crystal clear on how memory faults with masks work, but volume 2A chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb that supports memory fault suppression on page fault. Regards, Ants Aasma [1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Thu, Mar 28, 2024 at 3:34 PM Daniel Gustafsson wrote: > In jsonapi.c, makeJsonLexContextCstringLen initialize a JsonLexContext with > palloc0 which would need to be ported over to use ALLOC for frontend code. Seems reasonable (but see below, too). > On > that note, the errorhandling in parse_oauth_json() for content-type checks > attempts to free the JsonLexContext even before it has been created. Here we > can just return false. Agreed. They're zero-initialized, so freeJsonLexContext() is safe IIUC, but it's clearer not to call the free function at all. But for these additions: > - makeJsonLexContextCstringLen(, resp->data, resp->len, PG_UTF8, true); > + if (!makeJsonLexContextCstringLen(, resp->data, resp->len, PG_UTF8, > true)) > + { > + actx_error(actx, "out of memory"); > + return false; > + } ...since we're using the stack-based API as opposed to the heap-based API, they shouldn't be possible to hit. Any failures in createStrVal() are deferred to parse time on purpose. > - echo 'libpq must not be calling any function which invokes exit'; exit 1; \ > + echo 'libpq must not be calling any function which invokes exit'; \ > The offending codepath in libcurl was in the NTLM_WB module, a very old and > obscure form of NTLM support which was replaced (yet remained in the tree) a > long time ago by a full NTLM implementatin. Based on the findings in this > thread it was deprecated with a removal date set to April 2024 [0]. A bug in > the 8.4.0 release however disconnected NTLM_WB from the build and given the > lack of complaints it was decided to leave as is, so we can base our libcurl > requirements on 8.4.0 while keeping the exit() check intact. Of the Cirrus machines, it looks like only FreeBSD has a new enough libcurl for that. Ubuntu won't until 24.04, Debian Bookworm doesn't have it unless you're running backports, RHEL 9 is still on 7.x... I think requiring libcurl 8 is effectively saying no one will be able to use this for a long time. Is there an alternative? > + else if (strcasecmp(content_type, "application/json") != 0) > This needs to handle parameters as well since it will now fail if the charset > parameter is appended (which undoubtedly will be pretty common). The easiest > way is probably to just verify the mediatype and skip the parameters since we > know it can only be charset? Good catch. application/json no longer defines charsets officially [1], so I think we should be able to just ignore them. The new strncasecmp needs to handle a spurious prefix, too; I have that on my TODO list. > + /* TODO investigate using conn->Pfdebug and CURLOPT_DEBUGFUNCTION here */ > + CHECK_SETOPT(actx, CURLOPT_VERBOSE, 1L, return false); > + CHECK_SETOPT(actx, CURLOPT_ERRORBUFFER, actx->curl_err, return false); > CURLOPT_ERRORBUFFER is the old and finicky way of extracting error messages, > we > should absolutely move to using CURLOPT_DEBUGFUNCTION instead. This new way doesn't do the same thing. Here's a sample error: connection to server at "127.0.0.1", port 56619 failed: failed to fetch OpenID discovery document: Weird server reply ( Trying 127.0.0.1:36647... Connected to localhost (127.0.0.1) port 36647 (#0) Mark bundle as not supporting multiuse HTTP 1.0, assume close after body Invalid Content-Length: value Closing connection 0 ) IMO that's too much noise. Prior to the change, the same error would have been connection to server at "127.0.0.1", port 56619 failed: failed to fetch OpenID discovery document: Weird server reply (Invalid Content-Length: value) The error buffer is finicky for sure, but it's also a generic one-line explanation of what went wrong... Is there an alternative API for that I'm missing? > + /* && response_code != 401 TODO */ ) > Why is this marked with a TODO, do you remember? Yeah -- I have a feeling that 401s coming back are going to need more helpful hints to the user, since it implies that libpq itself hasn't authenticated correctly as opposed to some user-related auth failure. I was hoping to find some sample behaviors in the wild and record those into the suite. > + print("# OAuth provider (PID $pid) is listening on port $port\n"); > Code running under Test::More need to use diag() for printing non-test output > like this. Ah, thanks. > +#if LIBCURL_VERSION_MAJOR <= 8 && LIBCURL_VERSION_MINOR < 4 I don't think this catches versions like 7.76, does it? Maybe `LIBCURL_VERSION_MAJOR < 8 || (LIBCURL_VERSION_MAJOR == 8 && LIBCURL_VERSION_MINOR < 4)`, or else `LIBCURL_VERSION_NUM < 0x080400`? > my $pid = open(my $read_fh, "-|", $ENV{PYTHON}, "t/oauth_server.py") > - // die "failed to start OAuth server: $!"; > + or die "failed to start OAuth server: $!"; > > - read($read_fh, $port, 7) // die "failed to read port number: $!"; > + read($read_fh, $port, 7) or die "failed to read port number: $!"; The first hunk here looks good (thanks for the catch!) but I think the second
Re: Security lessons from liblzma
On 2024-03-31 Su 17:12, Andres Freund wrote: Hi, On 2024-03-31 12:18:29 +0200, Michael Banck wrote: If you ask where they are maintained, the answer is here: https://salsa.debian.org/postgresql/postgresql/-/tree/17/debian/patches?ref_type=heads the other major versions have their own branch. Luckily these are all quite small, leaving little space to hide stuff. I'd still like to get rid of at least some of them. I've previously proposed a patch to make pkglibdir configurable, I think we should just go for that. For the various defines, ISTM we should just make them #ifndef guarded, then they could be overridden by defining them at configure time. Some of them, like DEFAULT_PGSOCKET_DIR seem to be overridden by just about every distro. And others would be nice to easily override anyway, I e.g. dislike the default DEFAULT_PAGER value. +1 to this proposal. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Combine Prune and Freeze records emitted by vacuum
On 01/04/2024 20:22, Melanie Plageman wrote: From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Fri, 29 Mar 2024 19:43:09 -0400 Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions We will eventually take additional actions in heap_page_prune() at the discretion of the caller. For now, introduce these PRUNE_DO_* macros and turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_ paramter -> parameter action. --- src/backend/access/heap/pruneheap.c | 51 ++-- src/backend/access/heap/vacuumlazy.c | 11 -- src/include/access/heapam.h | 13 ++- 3 files changed, 46 insertions(+), 29 deletions(-) diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index fb0ad834f1b..30965c3c5a1 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -29,10 +29,11 @@ /* Working data for heap_page_prune and subroutines */ typedef struct { + /* PRUNE_DO_* arguments */ + uint8 actions; I wasn't sure if actions is a good name. What do you think? Committed this part, with the name 'options'. There's some precedent for that in heap_insert(). I decided to keep it a separate bool field here in the PruneState struct, though, and only changed it in the heap_page_prune() function signature. It didn't feel worth the code churn here, and 'prstate.mark_unused_now' is a shorter than "(prstate.options & HEAP_PRUNE_PAGE_MARK_UNUSED_NOW) != 0" anyway. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Statistics Import and Export
Corey Huinker writes: > A boolean is what we had before, I'm quite comfortable with that, and it > addresses my silent-failure concerns. WFM. regards, tom lane
Re: pg_combinebackup --copy-file-range
On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra wrote: > And I think he's right, and my tests confirm this. I did a trivial patch > to align the blocks to 8K boundary, by forcing the header to be a > multiple of 8K (I think 4K alignment would be enough). See the 0001 > patch that does this. > > And if I measure the disk space used by pg_combinebackup, and compare > the results with results without the patch ([1] from a couple days > back), I see this: > >pct not alignedaligned > - > 1% 689M19M >10%3172M22M >20% 13797M27M > > Yes, those numbers are correct. I didn't believe this at first, but the > backups are valid/verified, checksums are OK, etc. BTRFS has similar > numbers (e.g. drop from 20GB to 600MB). Fantastic. > I think we absolutely need to align the blocks in the incremental files, > and I think we should do that now. I think 8K would work, but maybe we > should add alignment parameter to basebackup & manifest? > > The reason why I think maybe this should be a basebackup parameter is > the recent discussion about large fs blocks - it seems to be in the > works, so maybe better to be ready and not assume all fs have 4K. > > And I think we probably want to do this now, because this affects all > tools dealing with incremental backups - even if someone writes a custom > version of pg_combinebackup, it will have to deal with misaligned data. > Perhaps there might be something like pg_basebackup that "transforms" > the data received from the server (and also the backup manifest), but > that does not seem like a great direction. +1, and I think BLCKSZ is the right choice. > I was very puzzled by the awful performance on ZFS. When every other fs > (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took > 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning > advice I could think of, with almost no effect. > > Ultimately I decided that it probably is the "no readahead" behavior > I've observed on ZFS. I assume it's because it doesn't use the page > cache where the regular readahead is detected etc. And there's no > prefetching in pg_combinebackup, so I decided to an experiment and added > a trivial explicit prefetch when reconstructing the file - every time > we'd read data from a file, we do posix_fadvise for up to 128 blocks > ahead (similar to what bitmap heap scan code does). See 0002. > > And tadaaa - the duration dropped from 900-1000 seconds to only about > 250-300 seconds, so an improvement of a factor of 3-4x. I think this is > pretty massive. Interesting. ZFS certainly has its own prefetching heuristics with lots of logic and settings, but it could be that it's using strict-next-block detection of access pattern (ie what I called effective_io_readahead_window=0 in the streaming I/O thread) instead of a window (ie like the Linux block device level read ahead where, AFAIK, if you access anything in that sliding window it is triggered), and perhaps your test has a lot of non-contiguous but close-enough blocks? (Also reminds me of the similar discussion on the BHS thread about distinguishing sequential access from mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the fine line between them.) You could double-check this and related settings (for example I think it might disable itself automatically if you're on a VM with small RAM size): https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable > There's a couple more interesting ZFS details - the prefetching seems to > be necessary even when using copy_file_range() and don't need to read > the data (to calculate checksums). This is why the "manifest=off" chart > has the strange group of high bars at the end - the copy cases are fast > because prefetch happens, but if we switch to copy_file_range() there > are no prefetches and it gets slow. Hmm, at a guess, it might be due to prefetching the dnode (root object for a file) and block pointers, ie the structure but not the data itself. > This is a bit bizarre, especially because the manifest=on cases are > still fast, exactly because the pread + prefetching still happens. I'm > sure users would find this puzzling. > > Unfortunately, the prefetching is not beneficial for all filesystems. > For XFS it does not seem to make any difference, but on BTRFS it seems > to cause a regression. > > I think this means we may need a "--prefetch" option, that'd force > prefetching, probably both before pread and copy_file_range. Otherwise > people on ZFS are doomed and will have poor performance. Seems reasonable if you can't fix it by tuning ZFS. (Might also be an interesting research topic for a potential ZFS patch: prefetch_swiss_cheese_window_size. I will not be nerd-sniped into reading the relevant source today, but I'll figure it out soonish...) > So I took a stab
Re: Popcount optimization using AVX512
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > What about using the masking capabilities of AVX-512 to handle the > tail in the same code path? Masked out portions of a load instruction > will not generate an exception. To allow byte level granularity > masking, -mavx512bw is needed. Based on wikipedia this will only > disable this fast path on Knights Mill (Xeon Phi), in all other cases > VPOPCNTQ implies availability of BW. Sounds promising. IMHO we should really be sure that these kinds of loads won't generate segfaults and the like due to the masked-out portions. I searched around a little bit but haven't found anything that seemed definitive. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Psql meta-command conninfo+
> Here I considered your suggestion (Sami and Álvaro's). However, I haven't yet > added the links for the functions system_user(), current_user(), and > session_user(). > 'm not sure how to do it. Any suggestion on how to create/add the link? Here is an example [1] where the session information functions are referenced. The public doc for this example is in [2]. Something similar can be done here. For example: The session user's name; see the session_user() function in for more details. [1] https://github.com/postgres/postgres/blob/master/doc/src/sgml/system-views.sgml#L1663-L1664 [2] https://www.postgresql.org/docs/16/view-pg-locks.html Regards, Sami
Re: Statistics Import and Export
> > If we are envisioning that the function might emit multiple warnings > per call, a useful definition could be to return the number of > warnings (so zero is good, not-zero is bad). But I'm not sure that's > really better than a boolean result. pg_dump/pg_restore won't notice > anyway, but perhaps other programs using these functions would care. > A boolean is what we had before, I'm quite comfortable with that, and it addresses my silent-failure concerns.
Re: Popcount optimization using AVX512
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart wrote: > > On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > > On 2024-Mar-31, Nathan Bossart wrote: > >> +popcnt = _mm512_reduce_add_epi64(accum); > >> +return popcnt + pg_popcount_fast(buf, bytes); > > > > Hmm, doesn't this arrangement cause an extra function call to > > pg_popcount_fast to be used here? Given the level of micro-optimization > > being used by this code, I would have thought that you'd have tried to > > avoid that. (At least, maybe avoid the call if bytes is 0, no?) > > Yes, it does. I did another benchmark on very small arrays and can see the > overhead. This is the time in milliseconds to run pg_popcount() on an > array 1 billion times: > > size (bytes) HEAD AVX512-POPCNT > 1 1707.685 3480.424 > 2 1926.694 4606.182 > 4 3210.412 5284.506 > 8 1920.703 3640.968 > 162936.91 4045.586 > 323627.956 5538.418 > 645347.213 3748.212 > > I suspect that anything below 64 bytes will see this regression, as that is > the earliest point where there are enough bytes for ZMM registers. What about using the masking capabilities of AVX-512 to handle the tail in the same code path? Masked out portions of a load instruction will not generate an exception. To allow byte level granularity masking, -mavx512bw is needed. Based on wikipedia this will only disable this fast path on Knights Mill (Xeon Phi), in all other cases VPOPCNTQ implies availability of BW. Attached is an example of what I mean. I did not have a machine to test it with, but the code generated looks sane. I added the clang pragma because it insisted on unrolling otherwise and based on how the instruction dependencies look that is probably not too helpful even for large cases (needs to be tested). The configure check and compile flags of course need to be amended for BW. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index f86558d1ee5..7fb2ada16c9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -30,20 +30,27 @@ uint64 pg_popcount_avx512(const char *buf, int bytes) { - uint64 popcnt; + __m512i val, cnt; + __mmask64 remaining_mask; __m512i accum = _mm512_setzero_si512(); - for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) + #pragma clang loop unroll(disable) + for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i)) { - const __m512i val = _mm512_loadu_si512((const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); + val = _mm512_loadu_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); buf += sizeof(__m512i); } - popcnt = _mm512_reduce_add_epi64(accum); - return popcnt + pg_popcount_fast(buf, bytes); + remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes); + val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); } #endif /* TRY_POPCNT_FAST */
Re: Statistics Import and Export
Corey Huinker writes: > Any thoughts about going back to having a return value, a caller could then > see that the function returned NULL rather than whatever the expected value > was (example: TRUE)? If we are envisioning that the function might emit multiple warnings per call, a useful definition could be to return the number of warnings (so zero is good, not-zero is bad). But I'm not sure that's really better than a boolean result. pg_dump/pg_restore won't notice anyway, but perhaps other programs using these functions would care. regards, tom lane
Re: Security lessons from liblzma
Bruce Momjian writes: > On Mon, Apr 1, 2024 at 03:17:55PM -0400, Tom Lane wrote: >> AFAIK, every open-source distro makes all the pieces needed to >> rebuild their packages available to users. It wouldn't be much >> of an open-source situation otherwise. You do have to learn >> their package build process. > I wasn't clear if all the projects provide a source tree that can be > verified against the project's source tree, and then independent > patches, or if the patches were integrated and therefore harder to > verify against the project source tree. In the systems I'm familiar with, an SRPM-or-equivalent includes the pristine upstream tarball and then some patch files to apply to it. The patch files have to be maintained anyway, and if you don't ship them then you're not shipping "source". regards, tom lane
Re: On disable_cost
Robert Haas writes: > One of the things I realized relatively early is that the patch does > nothing to propagate disable_cost upward through the plan tree. > ... > After straining my brain over various plan changes for a long time, > and hacking on the code somewhat, I realized that just propagating the > Boolean upward is insufficient to set things right. That's basically > because I was being dumb when I said this: >> I don't think we should care how MANY disabled nodes appear in a >> plan, particularly. Very interesting, thanks for the summary. So the fact that disable_cost is additive across plan nodes is actually a pretty important property of the current setup. I think this is closely related to one argument you made against my upthread idea of using IEEE Infinity for disable_cost: that'd mask whether more than one of the sub-plans had been disabled. > ... And while there's probably more than one way > to make it work, the easiest thing seems to be to just have a > disabled-counter in every node that gets initialized to the total > disabled-counter values of all of its children, and then you add 1 if > that node is itself doing something that is disabled, i.e. the exact > opposite of what I said in the quote above. Yeah, that seems like the next thing to try if anyone plans to pursue this further. That'd essentially do what we're doing now except that disable_cost is its own "order of infinity", entirely separate from normal costs. regards, tom lane
Re: Security lessons from liblzma
On Mon, Apr 1, 2024 at 03:17:55PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > I was more asking if users have access to patches so they could recreate > > the build by using the Postgres git tree and supplied OS-specific > > patches. > > AFAIK, every open-source distro makes all the pieces needed to > rebuild their packages available to users. It wouldn't be much > of an open-source situation otherwise. You do have to learn > their package build process. I wasn't clear if all the projects provide a source tree that can be verified against the project's source tree, and then independent patches, or if the patches were integrated and therefore harder to verify against the project source tree. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Security lessons from liblzma
On Fri, Mar 29, 2024 at 06:37:24PM -0400, Bruce Momjian wrote: > You might have seen reports today about a very complex exploit added to > recent versions of liblzma. Fortunately, it was only enabled two months > ago and has not been pushed to most stable operating systems like Debian > and Ubuntu. The original detection report is: > > https://www.openwall.com/lists/oss-security/2024/03/29/4 I was watching this video about the exploit: https://www.youtube.com/watch?v=bS9em7Bg0iU and at 2:29, they mention "hero software developer", our own Andres Freund as the person who discovered the exploit. I noticed the author's name at the openwall email link above, but I assumed it was someone else with the same name. They mentioned it was found while researching Postgres performance, and then I noticed the email address matched! I thought the analogy he uses at the end of the video is very clear. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: On disable_cost
On Tue, Mar 12, 2024 at 1:32 PM Tom Lane wrote: > > 1. You stated that it changes lots of plans in the regression tests, > > but you haven't provided any sort of analysis of why those plans > > changed. I'm kind of surprised that there would be "tons" of plan > > changes. You (or someone) should look into why that's happening. > > I've not read the patch, but given this description I would expect > there to be *zero* regression changes --- I don't think we have any > test cases that depend on disable_cost being finite. If there's more > than zero changes, that very likely indicates a bug in the patch. > Even if there are places where the output legitimately changes, you > need to justify each one and make sure that you didn't invalidate the > intent of that test case. I spent some more time poking at this patch. It's missing a ton of important stuff and is wrong in a whole bunch of really serious ways, and I'm not going to try to mention all of them in this email. But I do want to talk about some of the more interesting realizations that came to me as I was working my way through this. One of the things I realized relatively early is that the patch does nothing to propagate disable_cost upward through the plan tree. That means that if you have a choice between, say, Sort-over-Append-over-SeqScan and MergeAppend-over-IndexScan, as we do in the regression tests, disabling IndexScan doesn't change the plan with the patch applied, as it does in master. That's because only the IndexScan node ends up flagged as disabled. Once we start stacking other plan nodes on top of disabled plan nodes, the resultant plans are at no disadvantage compared to plans containing no disabled nodes. The IndexScan plan survives initially, despite being disabled, because it's got a sort order. That lets us use it to build a MergeAppend path, and that MergeAppend path is not disabled, and compares favorably on cost. After straining my brain over various plan changes for a long time, and hacking on the code somewhat, I realized that just propagating the Boolean upward is insufficient to set things right. That's basically because I was being dumb when I said this: > I don't think we should care how MANY disabled nodes appear in a > plan, particularly. Suppose we try to plan a Nested Loop with SeqScan disabled, but there's no alternative to a SeqScan for the outer side of the join. If we suppose an upward-propagating Boolean, every path for the join is disabled because every path for the outer side is disabled. That means that we have no reason to avoid paths where the inner side also uses a disabled path. When we loop over the inner rel's pathlist looking for ways to build a path for the join, we may find some disabled paths there, and the join paths we build from those paths are disabled, but so are the join paths where we use a non-disabled path on the inner side. So those paths are just competing with each other on cost, and the path type that is supposedly disabled on the outer side of the join ends up not really being very disabled at all. More precisely, if disabling a plan type causes paths to be discarded completely before the join paths are constructed, then they actually do get removed from consideration. But if those paths make it into inner rel's path list, even way out towards the end, then paths derived from them can jump to the front of the joinrel's path list. The same kind of problem happens with Append or MergeAppend nodes. The regression tests expect that we'll avoid disabled plan types whenever possible even if we can't avoid them completely; for instance, the matest0 table intentionally omits an index on one child table. Disabling sequential scans is expected to disable them for all of the other child tables even though for that particular child table there is no other option. But that behavior is hard to achieve if every path for the parent rel is "equally disabled". You either want the path that uses only the one required SeqScan to be not-disabled even though one of its children is disabled ... or you want the disabled flag to be more than a Boolean. And while there's probably more than one way to make it work, the easiest thing seems to be to just have a disabled-counter in every node that gets initialized to the total disabled-counter values of all of its children, and then you add 1 if that node is itself doing something that is disabled, i.e. the exact opposite of what I said in the quote above. I haven't done enough work to know whether that would match the current behavior, let alone whether it would have acceptable performance, and I'm not at all convinced that's the right direction anyway. Actually, at the moment, I don't have a very good idea at all what the right direction is. I do have a feeling that this patch is probably not going in the right direction, but I might be wrong about that, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Allow non-superuser to cancel superuser tasks.
On Mon, Apr 01, 2024 at 02:29:29PM +, Leung, Anthony wrote: > I've attached the updated patch with some improvements. Thanks! + + pg_signal_autovacuum + Allow terminating backend running autovacuum + I think we should be more precise here by calling out the exact types of workers: "Allow signaling autovacuum worker processes to..." -if ((!OidIsValid(proc->roleId) || superuser_arg(proc->roleId)) && -!superuser()) -return SIGNAL_BACKEND_NOSUPERUSER; - -/* Users can signal backends they have role membership in. */ -if (!has_privs_of_role(GetUserId(), proc->roleId) && -!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_BACKEND)) -return SIGNAL_BACKEND_NOPERMISSION; +if (!superuser()) +{ +if (!OidIsValid(proc->roleId)) +{ +/* + * We only allow user with pg_signal_autovacuum role to terminate + * autovacuum worker as an exception. + */ +if (!(pg_stat_is_backend_autovac_worker(proc->backendId) && +has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM))) +return SIGNAL_BACKEND_NOSUPERUSER; +} +else +{ +if (superuser_arg(proc->roleId)) +return SIGNAL_BACKEND_NOSUPERUSER; + +/* Users can signal backends they have role membership in. */ +if (!has_privs_of_role(GetUserId(), proc->roleId) && +!has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_BACKEND)) +return SIGNAL_BACKEND_NOPERMISSION; +} +} I don't think we should rely on !OidIsValid(proc->roleId) for signaling autovacuum workers. That might not always be true, and I don't see any need to rely on that otherwise. IMHO we should just add a few lines before the existing code, which doesn't need to be changed at all: if (pg_stat_is_backend_autovac_worker(proc->backendId) && !has_privs_of_role(GetUserId(), ROLE_PG_SIGNAL_AUTOVACUUM)) return SIGNAL_BACKEND_NOAUTOVACUUM; I also think we need to return something besides SIGNAL_BACKEND_NOSUPERUSER in this case. Specifically, we probably need to introduce a new value and provide the relevant error messages in pg_cancel_backend() and pg_terminate_backend(). +/* -- + * pg_stat_is_backend_autovac_worker() - + * + * Return whether the backend of the given backend id is of type autovacuum worker. + */ +bool +pg_stat_is_backend_autovac_worker(BackendId beid) +{ +PgBackendStatus *ret; + +Assert(beid != InvalidBackendId); + +ret = pgstat_get_beentry_by_backend_id(beid); + +if (!ret) +return false; + +return ret->st_backendType == B_AUTOVAC_WORKER; +} Can we have this function return the backend type so that we don't have to create a new function for every possible type? That might be handy in the future. I haven't looked too closely, but I'm pretty skeptical that the test suite in your patch would be stable. Unfortunately, I don't have any better ideas at the moment besides not adding a test for this new role. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Streaming read-ready sequential scan code
On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote: > On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman > wrote: > > > > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > > > wrote: > > > > > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > > > wrote: > > > > > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > > > wrote: > > > > > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > > > PgStreamingRead object for sequential scans > > > > > > > > > > > > I've written three alternative implementations of the actual > > > > > > streaming > > > > > > read user for sequential scan which handle the question of where to > > > > > > allocate the streaming read object and how to handle changing scan > > > > > > direction in different ways. > > > > > > > > > > > > Option A) > > > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > > > know the scan direction at this time, if the scan ends up not being > > > > > > a > > > > > > forwards scan, the streaming read object must later be freed -- so > > > > > > this will sometimes allocate a streaming read object it never uses. > > > > > > - Only supports ForwardScanDirection and once the scan direction > > > > > > changes, streaming is never supported again -- even if we return to > > > > > > ForwardScanDirection > > > > > > - Must maintain a "fallback" codepath that does not use the > > > > > > streaming read API > > > > > > > > > > Attached is a version of this patch which implements a "reset" > > > > > function for the streaming read API which should be cheaper than the > > > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > > > work on any of my proposed implementations (A/B/C). I implemented it > > > > > on A as an example. > > > > > > > > Attached is the latest version of this patchset -- rebased in light of > > > > Thomas' updatees to the streaming read API [1]. I have chosen the > > > > approach I think we should go with. It is a hybrid of my previously > > > > proposed approaches. > > > > > > While investigating some performance concerns, Andres pointed out that > > > the members I add to HeapScanDescData in this patch push rs_cindex and > > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > > > v4's HeapScanDescData is as well-packed as on master and its members > > > are reordered so that rs_cindex and rs_ntuples are back on the second > > > cacheline of the struct's data. > > > > I did some additional profiling and realized that dropping the > > unlikely() from the places we check rs_inited frequently was negatively > > impacting performance. v5 adds those back and also makes a few other > > very minor cleanups. > > > > Note that this patch set has a not yet released version of Thomas > > Munro's Streaming Read API with a new ramp-up logic which seems to fix a > > performance issue I saw with my test case when all of the sequential > > scan's blocks are in shared buffers. Once he sends the official new > > version, I will rebase this and point to his explanation in that thread. > > Attached v6 has the version of the streaming read API mentioned here > [1]. This resolved the fully-in-shared-buffers regressions > investigated in that thread by Andres, Bilal, and Thomas. Attached v7 has version 14 of the streaming read API as well as a few small tweaks to comments and code. I noticed that 0001 in the set posed a small regression from master for a sequential scan of a relation already in shared buffers. While investigating this, I saw that heapfetchbuf() was still not being inlined (compiled at -O2) and when I promoted heapfetchbuf() from static inline to static pg_attribute_always_inline, most of the very small regression I saw went away. I don't know if I squashed the issue entirely, though. - Melanie >From db6219d9ea689fdd0150aa0fbbeaa2ee11364aa8 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v7 1/4] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for streaming reads. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former
Re: Statistics Import and Export
> > > I still think that we could just declare the function strict, if we > use the variadic-any approach. Passing a null in any position is > indisputable caller error. However, if you're allergic to silently > doing nothing in such a case, we could have pg_set_attribute_stats > check each argument and throw an error. (Or warn and keep going; > but according to the design principle I posited earlier, this'd be > the sort of thing we don't need to tolerate.) > Any thoughts about going back to having a return value, a caller could then see that the function returned NULL rather than whatever the expected value was (example: TRUE)?
RE: Psql meta-command conninfo+
Amigos, boa tarde! (v25) >if (pset.version >= 14) > one thing; >else if (pset.version > 90500) > second thing; > else > third thing; > >This also appears where you add the GSSAPI columns; and it's also in the >final part where you append the FROM clause, though it looks a bit >different there. > >- You have three lines to output a semicolon at the end of the query >based on version number. Remove the first two, and just have a final >one where the semicolon is added unconditionally. Adjustment made. > - I don't think one for each item in the docs is reasonable. > There's too much vertical whitespace in the output. Maybe do this > instead: > >[...] >database connection. When + is appended, >more details about the connection are displayed in table >format: > > > > Database: The name of the current > database on this connection. > > > > Authenticated User: The authenticated > user at the time of psql connection with the server. > > > ... > Adjustment made. But I think it needs a review glance. > - This text is wrong to start with "Returns the": > > System User: Returns the authentication method and the identity (if > any) that the user presented during the authentication cycle before > they were assigned a database role. It is represented as > auth_method:identity or NULL if the user has not been authenticated. > > That minor point aside, I disagree with Sami about repeating the docs > for system_user() here. I would just say "The authentication data > provided for this connection; see the function system_user() for more > details." with a link to the appropriate section of the docs. Making > us edit this doc if we ever modify the behavior of the function is not > great. Here I considered your suggestion (Sami and Álvaro's). However, I haven't yet added the links for the functions system_user(), current_user(), and session_user(). I'm not sure how to do it. Any suggestion on how to create/add the link? Regards, Maiquel Grassi. v25-0001-psql-meta-command-conninfo-plus.patch Description: v25-0001-psql-meta-command-conninfo-plus.patch
Re: Broken error detection in genbki.pl
On 2024-03-20 We 12:44, Tom Lane wrote: David Wheeler complained over in [1] that genbki.pl fails to produce a useful error message if there's anything wrong in a catalog data file. He's right about that, but the band-aid patch he proposed doesn't improve the situation much. The actual problem is that key error messages in genbki.pl expect $bki_values->{line_number} to be set, which it is not because we're invoking Catalog::ParseData with $preserve_formatting = 0, and that runs a fast path that doesn't do line-by-line processing and hence doesn't/can't fill that field. I'm quite sure that those error cases worked as-intended when first coded. I did not check the git history, but I suppose that somebody added the non-preserve_formatting fast path later without any consideration for the damage it was doing to error reporting ability. I'm of the opinion that this change was penny-wise and pound-foolish. On my machine, the time to re-make the bki files with the fast path enabled is 0.597s, and with it disabled (measured with the attached quick-hack patch) it's 0.612s. Is that savings worth future hackers having to guess what they broke and where in a large .dat file? As you can see from the quick-hack patch, it's kind of a mess to use the $preserve_formatting = 1 case, because there are a lot of loops that have to be taught to ignore comment lines, which we don't really care about except in reformat_dat_file.pl. What I suggest doing, but have not yet coded up, is to nuke the fast path in Catalog::ParseData and reinterpret the $preserve_formatting argument as controlling whether comments and whitespace are entered in the output data structure, but not whether we parse it line-by-line. That should fix this problem with zero change in the callers, and also buy back a little bit of the cost compared to this quick hack. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/60EF4E11-BC1C-4034-B37B-695662D28AD2%40justatheory.com Makes sense cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: pg_upgrade failing for 200+ million Large Objects
On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote: > Nathan Bossart writes: >> The one design point that worries me a little is the non-configurability of >> --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 >> or something, but given how often I've had to fiddle with >> max_locks_per_transaction, I'm wondering if we might regret hard-coding it. > > Well, we could add a command-line switch to pg_upgrade, but I'm > unconvinced that it'd be worth the trouble. I think a very large > fraction of users invoke pg_upgrade by means of packager-supplied > scripts that are unlikely to provide a way to pass through such > a switch. I'm inclined to say let's leave it as-is until we get > some actual field requests for a switch. Okay. I'll let you know if I see anything. IIRC usually the pg_dump side of pg_upgrade is more prone to lock exhaustion, so you may very well be right that this is unnecessary. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade failing for 200+ million Large Objects
Nathan Bossart writes: > Sorry for taking so long to get back to this one. Overall, I think the > code is in decent shape. Thanks for looking at it! > The one design point that worries me a little is the non-configurability of > --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 > or something, but given how often I've had to fiddle with > max_locks_per_transaction, I'm wondering if we might regret hard-coding it. Well, we could add a command-line switch to pg_upgrade, but I'm unconvinced that it'd be worth the trouble. I think a very large fraction of users invoke pg_upgrade by means of packager-supplied scripts that are unlikely to provide a way to pass through such a switch. I'm inclined to say let's leave it as-is until we get some actual field requests for a switch. regards, tom lane
Re: Statistics Import and Export
Corey Huinker writes: > So what's the behavior when the user fails to supply a parameter that is > currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit? I still think that we could just declare the function strict, if we use the variadic-any approach. Passing a null in any position is indisputable caller error. However, if you're allergic to silently doing nothing in such a case, we could have pg_set_attribute_stats check each argument and throw an error. (Or warn and keep going; but according to the design principle I posited earlier, this'd be the sort of thing we don't need to tolerate.) regards, tom lane
RE: Psql meta-command conninfo+
Database: The database name of the connection. Authenticated User: The authenticated database user of the connection. Socket Directory: The Unix socket directory of the connection. NULL if host or hostaddr are specified. Host: The host name or address of the connection. NULL if a Unix socket is used. Server Port: The IP address of the connection. NULL if a Unix socket is used. Server Address: The IP address of the host name. NULL if a Unix socket is used. Client Address: The IP address of the client connected to this backend. NULL if a Unix socket is used. Client Port: The port of the client connected to this backend. NULL if a Unix socket is used. Backend PID: The process id of the backend for the connection. Application name: psql is the default for a psql connection unless otherwise specified in the configuration parameter. -//- Hi Sami, I considered your suggestions and Álvaro's suggestions Regards, Maiquel Grassi.
Re: pg_upgrade failing for 200+ million Large Objects
On Wed, Mar 27, 2024 at 10:08:26AM -0500, Nathan Bossart wrote: > On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote: >> Michael Banck writes: >>> What is the status of this? In the commitfest, this patch is marked as >>> "Needs Review" with Nathan as reviewer - Nathan, were you going to take >>> another look at this or was your mail from January 12th a full review? >> >> In my mind the ball is in Nathan's court. I feel it's about >> committable, but he might not agree. > > I'll prioritize another round of review on this one. FWIW I don't remember > having any major concerns on a previous version of the patch set I looked > at. Sorry for taking so long to get back to this one. Overall, I think the code is in decent shape. Nothing stands out after a couple of passes. The small amount of runtime improvement cited upthread is indeed a bit disappointing, but IIUC this at least sets the stage for additional parallelism in the future, and the memory/disk usage improvements are nothing to sneeze at, either. The one design point that worries me a little is the non-configurability of --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 or something, but given how often I've had to fiddle with max_locks_per_transaction, I'm wondering if we might regret hard-coding it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Security lessons from liblzma
Bruce Momjian writes: > I was more asking if users have access to patches so they could recreate > the build by using the Postgres git tree and supplied OS-specific > patches. AFAIK, every open-source distro makes all the pieces needed to rebuild their packages available to users. It wouldn't be much of an open-source situation otherwise. You do have to learn their package build process. regards, tom lane
Re: Statistics Import and Export
> > Ah, yeah, you could change the function to have more parameters, > given the assumption that all calls will be named-parameter style. > I still suggest that my proposal is more robust for the case where > the dump lists parameters that the receiving system doesn't have. > So what's the behavior when the user fails to supply a parameter that is currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?
Re: Statistics Import and Export
> > I think pg_upgrade could check for the existence of extended statistics > in any database and adjust the analyze recommdnation wording > accordingly. > +1
Re: Combine Prune and Freeze records emitted by vacuum
On Mon, Apr 1, 2024 at 1:37 PM Heikki Linnakangas wrote: > > Committed the first of the remaining patches with those changes. And > also this, which is worth calling out: > > > if (prstate.htsv[offnum] == HEAPTUPLE_DEAD) > > { > > ItemId itemid = PageGetItemId(page, offnum); > > HeapTupleHeader htup = (HeapTupleHeader) > > PageGetItem(page, itemid); > > > > if (likely(!HeapTupleHeaderIsHotUpdated(htup))) > > { > > HeapTupleHeaderAdvanceConflictHorizon(htup, > > > > _xid_removed); > > heap_prune_record_unused(, offnum, > > true); > > } > > else > > { > > /* > >* This tuple should've been processed and > > removed as part of > >* a HOT chain, so something's wrong. To > > preserve evidence, > >* we don't dare to remove it. We cannot > > leave behind a DEAD > >* tuple either, because that will cause > > VACUUM to error out. > >* Throwing an error with a distinct error > > message seems like > >* the least bad option. > >*/ > > elog(ERROR, "dead heap-only tuple (%u, %d) is > > not linked to from any HOT chain", > >blockno, offnum); > > } > > } > > else > > heap_prune_record_unchanged_lp_normal(page, , > > offnum); > > As you can see, I turned that into a hard error. Previously, that code > was at the top of heap_prune_chain(), and it was normal to see DEAD > heap-only tuples there, because they would presumably get processed > later as part of a HOT chain. But now this is done after all the HOT > chains have already been processed. > > Previously if there was a dead heap-only tuple like that on the page for > some reason, it was silently not processed by heap_prune_chain() > (because it was assumed that it would be processed later as part of a > HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the > pruning was done as part of VACUUM, VACUUM would fail with "ERROR: > unexpected HeapTupleSatisfiesVacuum result". Or am I missing something? I think you are right. I wasn't sure if there was some way for a HOT, DEAD tuple to be not HOT-updated, but that doesn't make much sense. > Now you get that above error also on on-access pruning, which is not > ideal. But I don't remember hearing about corruption like that, and > you'd get the error on VACUUM anyway. Yea, that makes sense. One thing I don't really understand is why vacuum has its own system for saving and restoring error information for context messages (LVSavedErrInfo and update/restore_vacuum_err_info()). I'll confess I don't know much about how error cleanup works in any sub-system. But it stuck out to me that vacuum has its own. I assume it is okay to add new error messages and they somehow will work with the existing system? - Melanie
Re: Statistics Import and Export
> > That's not what I suggested at all. The function parameters would > be declared anyarray, but the values passed to them would be coerced > to the correct concrete array types. So as far as the coercion rules > are concerned this'd be equivalent to the variadic-any approach. > +1 > > > That's pretty persuasive. It also means that we need to trap for error in > > the array_in() calls, as that function does not yet have a _safe() mode. > > Well, the approach I'm advocating for would have the array input and > coercion done by the calling query before control ever reaches > pg_set_attribute_stats, so that any incorrect-for-the-data-type values > would result in hard errors. I think that's okay for the same reason > you probably figured you didn't have to trap array_in: it's the fault > of the originating pg_dump if it offers a value that doesn't coerce to > the datatype it claims the value is of. +1
Re: [plpython] Add missing volatile qualifier.
Nathan Bossart writes: > On Mon, Apr 01, 2024 at 11:57:07AM -0400, Tom Lane wrote: >> Good catch! It looks like the consequences of a failure would be >> pretty minimal --- AFAICS, no worse than a possible failure to remove >> a refcount on Py_None --- but that's still a bug. > Huh. I seem to have dropped that "volatile" shortly before committing for > some reason [0]. Oh, I'd forgotten that discussion. Given that we were both confused about the need for it, all the more reason to try to avoid using a within-PG_TRY assignment. > Your fix seems fine to me. Thanks for looking, I'll push it shortly. regards, tom lane
Re: [plpython] Add missing volatile qualifier.
On Mon, Apr 01, 2024 at 11:57:07AM -0400, Tom Lane wrote: > Xing Guo writes: >> I'm playing a toy static analysis checker with PostgreSQL and found a >> variable is missing volatile qualifier. > > Good catch! It looks like the consequences of a failure would be > pretty minimal --- AFAICS, no worse than a possible failure to remove > a refcount on Py_None --- but that's still a bug. Huh. I seem to have dropped that "volatile" shortly before committing for some reason [0]. > I don't care for your proposed fix though. I think the real problem > here is schizophrenia about when to set up pltargs, and we could > fix it more nicely as attached. (Perhaps the Asserts are overkill > though.) Your fix seems fine to me. [0] https://postgr.es/m/20230504234235.GA2419591%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Statistics Import and Export
Jeff Davis writes: > On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote: >> What happens when >> somebody adds a new stakind (and hence new pg_stats column)? > Why would you need to overload in this case? Wouldn't we just define a > new function with more optional named parameters? Ah, yeah, you could change the function to have more parameters, given the assumption that all calls will be named-parameter style. I still suggest that my proposal is more robust for the case where the dump lists parameters that the receiving system doesn't have. regards, tom lane
Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)
Em seg., 1 de abr. de 2024 às 14:52, Tom Lane escreveu: > "Euler Taveira" writes: > > On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote: > >> Coverity has some reports in the new code > >> pg_createsubscriber.c > >> I think that Coverity is right. > > > It depends on your "right" definition. If your program execution is > ephemeral > > and the leak is just before exiting, do you think it's worth it? > > I agree with Ranier, actually. The functions in question don't > exit() themselves, but return control to a caller that might or > might not choose to exit. I don't think they should be assuming > that an exit will happen promptly, even if that's true today. > > The community Coverity run found a few additional leaks of the same > kind in that file. I pushed a merged fix for all of them. > Thanks Tom, for the commit. best regards, Ranier Vilela
Re: Statistics Import and Export
Jeff Davis writes: > On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote: >> Reality check --- are we still targeting this feature for PG 17? > I see a few useful pieces here: > 1. Support import of statistics (i.e. > pg_set_{relation|attribute}_stats()). > 2. Support pg_dump of stats > 3. Support pg_upgrade with stats > It's possible that not all of them make it, but let's not dismiss the > entire feature yet. The unresolved questions largely have to do with the interactions between these pieces. I think we would seriously regret setting any one of them in stone before all three are ready to go. regards, tom lane
Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs
Laurenz Albe wrote: > Here is the code review for patch number 2: > +static void > +CloseGOutput(FILE *gfile_fout, bool is_pipe) > > It makes sense to factor out this code. > But shouldn't these functions have a prototype at the beginning of the file? Looking at the other static functions in psql/common.c, there are 22 of them but only 3 have prototypes at the top of the file. These 3 functions are called before being defined, so these prototypes are mandatory. The other static functions that are defined before being called happen not to have forward declarations, so SetupGOutput() and CloseGOutput() follow that model. > Here is a suggestion for a consolidated comment: > > Fetch the result in chunks if FETCH_COUNT is set. We don't enable chunking > if SHOW_ALL_RESULTS is false, since that requires us to accumulate all rows > before we can tell what should be displayed, which would counter the idea > of FETCH_COUNT. Chunk fetching is also disabled if \gset, \crosstab, > \gexec and \watch are used. OK, done like that. > > + if (fetch_count > 0 && result_status == PGRES_TUPLES_CHUNK) > > Could it be that result_status == PGRES_TUPLES_CHUNK, but fetch_count is 0? > if not, perhaps there should be an Assert that verifies that, and the "if" > statement should only check for the latter condition. Good point. In fact it can be simplified to if (result_status == PGRES_TUPLES_CHUNK), and fetch_count as a variable can be removed from the function. Done that way. > > --- a/src/bin/psql/t/001_basic.pl > > +++ b/src/bin/psql/t/001_basic.pl > > @@ -184,10 +184,10 @@ like( > > "\\set FETCH_COUNT 1\nSELECT error;\n\\errverbose", > > on_error_stop => 0))[2], > > qr/\A^psql::2: ERROR: .*$ > > -^LINE 2: SELECT error;$ > > +^LINE 1: SELECT error;$ > > ^ *^.*$ > > ^psql::3: error: ERROR: [0-9A-Z]{5}: .*$ > > -^LINE 2: SELECT error;$ > > +^LINE 1: SELECT error;$ > > Why does the output change? Perhaps there is a good and harmless > explanation, but the naïve expectation would be that it doesn't. Unpatched, psql builds this query: DECLARE _psql_cursor NO SCROLL CURSOR FOR \n therefore the user query starts at line 2. With the patch, the user query is sent as-is, starting at line1, hence the different error location. > After fixing the problem manually, it builds without warning. > The regression tests pass, and the feature works as expected. Thanks for testing. Updated patches are attached. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 0fefaee7d5b3003ad0d089ea9e92675c6f50245f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Mon, 1 Apr 2024 19:46:20 +0200 Subject: [PATCH v7 1/2] Implement retrieval of results in chunks with libpq. This mode is similar to the single-row mode except that chunks of results contain up to N rows instead of a single row. It is meant to reduce the overhead of the row-by-row allocations for large result sets. The mode is selected with PQsetChunkedRowsMode(int maxRows) and results have the new status code PGRES_TUPLES_CHUNK. --- doc/src/sgml/libpq.sgml | 98 +++ .../libpqwalreceiver/libpqwalreceiver.c | 1 + src/bin/pg_amcheck/pg_amcheck.c | 1 + src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-exec.c| 117 +++--- src/interfaces/libpq/libpq-fe.h | 4 +- src/interfaces/libpq/libpq-int.h | 7 +- 7 files changed, 185 insertions(+), 44 deletions(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index d3e87056f2..1814921d5a 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3545,7 +3545,20 @@ ExecStatusType PQresultStatus(const PGresult *res); The PGresult contains a single result tuple from the current command. This status occurs only when single-row mode has been selected for the query -(see ). +(see ). + + + + + + PGRES_TUPLES_CHUNK + + +The PGresult contains several tuples +from the current command. The count of tuples cannot exceed +the maximum passed to . +This status occurs only when the chunked mode has been selected +for the query (see ). @@ -5197,8 +5210,8 @@ PGresult *PQgetResult(PGconn *conn); Another frequently-desired feature that can be obtained with and - is retrieving large query results a row at a time. This is discussed - in . + is retrieving large query results a limited number of rows at a time. This is discussed + in . @@ -5562,12 +5575,13 @@ int PQflush(PGconn *conn); - To enter single-row mode, call PQsetSingleRowMode - before
Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)
"Euler Taveira" writes: > On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote: >> Coverity has some reports in the new code >> pg_createsubscriber.c >> I think that Coverity is right. > It depends on your "right" definition. If your program execution is ephemeral > and the leak is just before exiting, do you think it's worth it? I agree with Ranier, actually. The functions in question don't exit() themselves, but return control to a caller that might or might not choose to exit. I don't think they should be assuming that an exit will happen promptly, even if that's true today. The community Coverity run found a few additional leaks of the same kind in that file. I pushed a merged fix for all of them. regards, tom lane
Re: Psql meta-command conninfo+
> That minor point aside, I disagree with Sami about repeating the docs > for system_user() here. I would just say "The authentication data > provided for this connection; see the function system_user() for more > details." +1. FWIW; Up the thread [1], I did mention we should link to the functions page, but did not have a very strong opinion on that. I do like your suggestion and the same should be done for "Current User" and "Session User" as well. [1] https://www.postgresql.org/message-id/640B2586-EECF-44C0-B474-CA8510F8EAFC%40amazon.com Regards, Sami
Re: Statistics Import and Export
On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote: > What happens when > somebody adds a new stakind (and hence new pg_stats column)? > You could try to add an overloaded pg_set_attribute_stats > version with more parameters, but I'm pretty sure that would > lead to "ambiguous function call" failures when trying to load > old dump files containing only the original parameters. Why would you need to overload in this case? Wouldn't we just define a new function with more optional named parameters? > The > present design is also fragile in that an unrecognized parameter > will lead to a parse-time failure and no function call happening, > which is less robust than I'd like. I agree on this point; I found this annoying when testing the feature. > So this leads me to suggest that we'd be best off with a VARIADIC > ANY signature, where the variadic part consists of alternating > parameter labels and values: I didn't consider this and I think it has a lot of advantages. It's slightly unfortunate that we can't make them explicitly name/value pairs, but pg_dump can use whitespace or even SQL comments to make it more readable. Regards, Jeff Davis
Re: Combine Prune and Freeze records emitted by vacuum
On 01/04/2024 19:08, Melanie Plageman wrote: On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote: diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer, tup.t_tableOid = RelationGetRelid(relation); /* -* Determine HTSV for all tuples. +* Determine HTSV for all tuples, and queue them up for processing as HOT +* chain roots or as a heap-only items. Reading this comment now as a whole, I would add something like "Determining HTSV for all tuples once is required for correctness" to the start of the second paragraph. The new conjunction on the first paragraph sentence followed by the next paragraph is a bit confusing because it sounds like queuing them up for processing is required for correctness (which, I suppose it is on some level). Basically, I'm just saying that it is now less clear what is required for correctness. Fixed. While I recognize this is a matter of style and not important, I personally prefer this for reverse looping: for (int i = prstate.nheaponly_items; i --> 0;) I don't think we use that style anywhere in the Postgres source tree currently. (And I don't like it ;-) ) I do think a comment about the reverse order would be nice. I know it says something above the first loop to this effect: * Processing the items in reverse order (and thus the tuples in * increasing order) increases prefetching efficiency significantly / * decreases the number of cache misses. So perhaps we could just say "as above, process the items in reverse order" I'm actually not sure why it makes a difference. I would assume all the data to already be in CPU cache at this point, since the first loop already accessed it, so I think there's something else going on. But I didn't investigate it deeper. Anyway, added a comment. Committed the first of the remaining patches with those changes. And also this, which is worth calling out: if (prstate.htsv[offnum] == HEAPTUPLE_DEAD) { ItemId itemid = PageGetItemId(page, offnum); HeapTupleHeader htup = (HeapTupleHeader) PageGetItem(page, itemid); if (likely(!HeapTupleHeaderIsHotUpdated(htup))) { HeapTupleHeaderAdvanceConflictHorizon(htup, _xid_removed); heap_prune_record_unused(, offnum, true); } else { /* * This tuple should've been processed and removed as part of * a HOT chain, so something's wrong. To preserve evidence, * we don't dare to remove it. We cannot leave behind a DEAD * tuple either, because that will cause VACUUM to error out. * Throwing an error with a distinct error message seems like * the least bad option. */ elog(ERROR, "dead heap-only tuple (%u, %d) is not linked to from any HOT chain", blockno, offnum); } } else heap_prune_record_unchanged_lp_normal(page, , offnum); As you can see, I turned that into a hard error. Previously, that code was at the top of heap_prune_chain(), and it was normal to see DEAD heap-only tuples there, because they would presumably get processed later as part of a HOT chain. But now this is done after all the HOT chains have already been processed. Previously if there was a dead heap-only tuple like that on the page for some reason, it was silently not processed by heap_prune_chain() (because it was assumed that it would be processed later as part of a HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the pruning was done as part of VACUUM, VACUUM would fail with "ERROR: unexpected HeapTupleSatisfiesVacuum result". Or am I missing something? Now you get that above error also on on-access pruning, which is not ideal. But I don't remember hearing about corruption like that, and you'd get the error on VACUUM anyway. With the next patches, heap_prune_record_unchanged() will do more, and will also throw an error on a HEAPTUPLE_LIVE tuple, so even though in the first patch we could print just a WARNING and move on, it gets more awkward with the rest of the patches. (Continuing with the remaining patches..) -- Heikki Linnakangas Neon (https://neon.tech)
Re: Statistics Import and Export
On Sun, Mar 31, 2024 at 07:04:47PM -0400, Tom Lane wrote: > Corey Huinker writes: > >> I can't quibble with that view of what has priority. I'm just > >> suggesting that redesigning what pg_upgrade does in this area > >> should come later than doing something about extended stats. > > > I mostly agree, with the caveat that pg_upgrade's existing message saying > > that optimizer stats were not carried over wouldn't be 100% true anymore. > > I think we can tweak the message wording. I just don't want to be > doing major redesign of the behavior, nor adding fundamentally new > monitoring capabilities. I think pg_upgrade could check for the existence of extended statistics in any database and adjust the analyze recommdnation wording accordingly. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Statistics Import and Export
On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote: > Reality check --- are we still targeting this feature for PG 17? I see a few useful pieces here: 1. Support import of statistics (i.e. pg_set_{relation|attribute}_stats()). 2. Support pg_dump of stats 3. Support pg_upgrade with stats It's possible that not all of them make it, but let's not dismiss the entire feature yet. Regards, Jeff Davis
Re: Statistics Import and Export
Bruce Momjian writes: > Reality check --- are we still targeting this feature for PG 17? I'm not sure. I think if we put our heads down we could finish the changes I'm suggesting and resolve the other issues this week. However, it is starting to feel like the sort of large, barely-ready patch that we often regret cramming in at the last minute. Maybe we should agree that the first v18 CF would be a better time to commit it. regards, tom lane
Re: Security lessons from liblzma
On Sun, Mar 31, 2024 at 02:12:57PM -0700, Andres Freund wrote: > Hi, > > On 2024-03-31 12:18:29 +0200, Michael Banck wrote: > > If you ask where they are maintained, the answer is here: > > > > https://salsa.debian.org/postgresql/postgresql/-/tree/17/debian/patches?ref_type=heads > > > > the other major versions have their own branch. > > Luckily these are all quite small, leaving little space to hide stuff. I'd > still like to get rid of at least some of them. > > I've previously proposed a patch to make pkglibdir configurable, I think we > should just go for that. > > For the various defines, ISTM we should just make them #ifndef guarded, then > they could be overridden by defining them at configure time. Some of them, > like DEFAULT_PGSOCKET_DIR seem to be overridden by just about every > distro. And others would be nice to easily override anyway, I e.g. dislike the > default DEFAULT_PAGER value. I realize we can move some changes into our code, but packagers are still going to need a way to do immediate adjustments to match their OS in time frames that don't match the Postgres release schedule. I was more asking if users have access to patches so they could recreate the build by using the Postgres git tree and supplied OS-specific patches. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Combine Prune and Freeze records emitted by vacuum
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote: > On 30/03/2024 07:57, Melanie Plageman wrote: > > > The final state of the code could definitely use more cleanup. I've been > > staring at it for awhile, so I could use some thoughts/ideas about what > > part to focus on improving. > > Committed some of the changes. I plan to commit at least the first of these > remaining patches later today. I'm happy with it now, but I'll give it a > final glance over after dinner. > > I'll continue to review the rest after that, but attached is what I have > now. Review for 0003-0006 (I didn't have any new thoughts on 0002). I know you didn't modify them much/at all, but I noticed some things in my code that could be better. > From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Fri, 29 Mar 2024 19:43:09 -0400 > Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions > > We will eventually take additional actions in heap_page_prune() at the > discretion of the caller. For now, introduce these PRUNE_DO_* macros and > turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_ paramter -> parameter > action. > --- > src/backend/access/heap/pruneheap.c | 51 ++-- > src/backend/access/heap/vacuumlazy.c | 11 -- > src/include/access/heapam.h | 13 ++- > 3 files changed, 46 insertions(+), 29 deletions(-) > > diff --git a/src/backend/access/heap/pruneheap.c > b/src/backend/access/heap/pruneheap.c > index fb0ad834f1b..30965c3c5a1 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -29,10 +29,11 @@ > /* Working data for heap_page_prune and subroutines */ > typedef struct > { > + /* PRUNE_DO_* arguments */ > + uint8 actions; I wasn't sure if actions is a good name. What do you think? > + > /* tuple visibility test, initialized for the relation */ > GlobalVisState *vistest; > - /* whether or not dead items can be set LP_UNUSED during pruning */ > - boolmark_unused_now; > > TransactionId new_prune_xid;/* new prune hint value for page */ > TransactionId snapshotConflictHorizon; /* latest xid removed */ > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index 32a3fbce961..35b8486c34a 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -191,6 +191,17 @@ typedef struct HeapPageFreeze > > } HeapPageFreeze; > > +/* > + * Actions that can be taken during pruning and freezing. By default, we will > + * at least attempt regular pruning. > + */ > + > +/* > + * PRUNE_DO_MARK_UNUSED_NOW indicates whether or not dead items can be set > + * LP_UNUSED during pruning. > + */ > +#define PRUNE_DO_MARK_UNUSED_NOW (1 << 1) No reason for me to waste the zeroth bit here. I just realized that I did this with XLHP_IS_CATALOG_REL too. #define XLHP_IS_CATALOG_REL (1 << 1) > From 083690b946e19ab5e536a9f2689772e7b91d2a70 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Fri, 29 Mar 2024 21:22:14 -0400 > Subject: [PATCH v11 4/7] Prepare freeze tuples in heap_page_prune() > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index 35b8486c34a..ac129692c13 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > > +/* > + * Prepare to freeze if advantageous or required and try to advance > + * relfrozenxid and relminmxid. To attempt freezing, we will need to > determine > + * if the page is all frozen. So, if this action is set, we will also inform > + * the caller if the page is all-visible and/or all-frozen and calculate a I guess we don't inform the caller if the page is all-visible, so this is not quite right. > + * snapshot conflict horizon for updating the visibility map. > + */ > +#define PRUNE_DO_TRY_FREEZE (1 << 2) > From ef8cb2c089ad9474a6da309593029c08f71b0bb9 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Fri, 29 Mar 2024 21:36:37 -0400 > Subject: [PATCH v11 5/7] Set hastup in heap_page_prune > > diff --git a/src/backend/access/heap/pruneheap.c > b/src/backend/access/heap/pruneheap.c > index 8bdd6389b25..65b0ed185ff 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -66,6 +66,9 @@ typedef struct > boolprocessed[MaxHeapTuplesPerPage + 1]; > > int ndeleted; /* Number of tuples > deleted from the page */ > + > + /* Whether or not the page makes rel truncation unsafe */ > + boolhastup; > } PruneState; > > /* Local functions */ > @@ -271,6 +274,7 @@ heap_page_prune(Relation relation, Buffer buffer, > prstate.ndeleted = 0; > prstate.nroot_items = 0; > prstate.nheaponly_items = 0; > + prstate.hastup = false; > > /* >* If we will prepare to
Re: Statistics Import and Export
Jeff Davis writes: > On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote: >> I haven't looked at the details, but I'm really a bit surprised >> by Jeff's assertion that CREATE INDEX destroys statistics on the >> base table. That seems wrong from here, and maybe something we >> could have it not do. (I do realize that it recalculates reltuples >> and relpages, but so what? If it updates those, the results should >> be perfectly accurate.) > In the v15 of the patch I was looking at, "pg_dump -s" included the > statistics. The stats appeared first in the dump, followed by the > CREATE INDEX commands. The latter overwrote the relpages/reltuples set > by the former. > While zeros are the right answers for a schema-only dump, it defeated > the purpose of including relpages/reltuples stats in the dump, and > caused the pg_upgrade TAP test to fail. > You're right that there are a number of ways this could be resolved -- > I don't think it's an inherent problem. I'm inclined to call it not a problem at all. While I do agree there are use-cases for injecting false statistics with these functions, I do not think that pg_dump has to cater to such use-cases. In any case, I remain of the opinion that stats are data and should not be included in a -s dump (with some sort of exception for pg_upgrade). If the data has been loaded, then a subsequent overwrite by CREATE INDEX should not be a problem. regards, tom lane
Re: Statistics Import and Export
Reality check --- are we still targeting this feature for PG 17? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Statistics Import and Export
On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote: > I haven't looked at the details, but I'm really a bit surprised > by Jeff's assertion that CREATE INDEX destroys statistics on the > base table. That seems wrong from here, and maybe something we > could have it not do. (I do realize that it recalculates reltuples > and relpages, but so what? If it updates those, the results should > be perfectly accurate.) In the v15 of the patch I was looking at, "pg_dump -s" included the statistics. The stats appeared first in the dump, followed by the CREATE INDEX commands. The latter overwrote the relpages/reltuples set by the former. While zeros are the right answers for a schema-only dump, it defeated the purpose of including relpages/reltuples stats in the dump, and caused the pg_upgrade TAP test to fail. You're right that there are a number of ways this could be resolved -- I don't think it's an inherent problem. Regards, Jeff Davis
Re: Table AM Interface Enhancements
On Mon, Apr 1, 2024 at 7:36 PM Tom Lane wrote: > > Coverity complained about what you did in RelationParseRelOptions > in c95c25f9a: > > *** CID 1595992: Null pointer dereferences (FORWARD_NULL) > /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: > 499 in RelationParseRelOptions() > 493 > 494 /* > 495 * Fetch reloptions from tuple; have to use a hardwired > descriptor because > 496 * we might not have any other for pg_class yet (consider > executing this > 497 * code for pg_class itself) > 498 */ > >>> CID 1595992: Null pointer dereferences (FORWARD_NULL) > >>> Passing null pointer "tableam" to "extractRelOptions", which > >>> dereferences it. > 499 options = extractRelOptions(tuple, GetPgClassDescriptor(), > 500 > tableam, amoptsfn); > 501 > > I see that extractRelOptions only uses the tableam argument for some > relkinds, and RelationParseRelOptions does set it up for those > relkinds --- but Coverity's complaint isn't without merit, because > those two switch statements are looking at *different copies of the > relkind*, which in theory could be different. This all seems quite > messy and poorly factored. Can't we do better? Why do we need to > involve two copies of allegedly the same pg_class tuple, anyhow? Thank you for reporting this, Tom. I'm planning to investigate this later today. -- Regards, Alexander Korotkov
Re: Table AM Interface Enhancements
Coverity complained about what you did in RelationParseRelOptions in c95c25f9a: *** CID 1595992: Null pointer dereferences (FORWARD_NULL) /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 499 in RelationParseRelOptions() 493 494 /* 495 * Fetch reloptions from tuple; have to use a hardwired descriptor because 496 * we might not have any other for pg_class yet (consider executing this 497 * code for pg_class itself) 498 */ >>> CID 1595992: Null pointer dereferences (FORWARD_NULL) >>> Passing null pointer "tableam" to "extractRelOptions", which >>> dereferences it. 499 options = extractRelOptions(tuple, GetPgClassDescriptor(), 500 tableam, amoptsfn); 501 I see that extractRelOptions only uses the tableam argument for some relkinds, and RelationParseRelOptions does set it up for those relkinds --- but Coverity's complaint isn't without merit, because those two switch statements are looking at *different copies of the relkind*, which in theory could be different. This all seems quite messy and poorly factored. Can't we do better? Why do we need to involve two copies of allegedly the same pg_class tuple, anyhow? regards, tom lane
Re: Psql meta-command conninfo+
Hello Yeah, that's what I meant about the translations, thanks. Two more comments, - You don't need a two-level conditional here if (pset.sversion >= 90500) { if (pset.sversion < 14) appendPQExpBuffer(, " ssl.ssl AS \"%s\",\n" " ssl.version AS \"%s\",\n" " ssl.cipher AS \"%s\",\n" " ssl.compression AS \"%s\",\n", _("SSL Connection"), _("SSL Protocol"), _("SSL Cipher"), _("SSL Compression")); if (pset.sversion >= 14) appendPQExpBuffer(, " ssl.ssl AS \"%s\",\n" " ssl.version AS \"%s\",\n" " ssl.cipher AS \"%s\",\n" " NULL AS \"%s\",\n", _("SSL Connection"), _("SSL Protocol"), _("SSL Cipher"), _("SSL Compression")); } else appendPQExpBuffer(, " NULL AS \"%s\",\n" " NULL AS \"%s\",\n" " NULL AS \"%s\",\n" " NULL AS \"%s\",\n", _("SSL Connection"), _("SSL Protocol"), _("SSL Cipher"), _("SSL Compression")); Instead you can just do something like if (pset.version >= 14) one thing; else if (pset.version > 90500) second thing; else third thing; This also appears where you add the GSSAPI columns; and it's also in the final part where you append the FROM clause, though it looks a bit different there. - You have three lines to output a semicolon at the end of the query based on version number. Remove the first two, and just have a final one where the semicolon is added unconditionally. - I don't think one for each item in the docs is reasonable. There's too much vertical whitespace in the output. Maybe do this instead: [...] database connection. When + is appended, more details about the connection are displayed in table format: Database: The name of the current database on this connection. Authenticated User: The authenticated user at the time of psql connection with the server. ... - This text is wrong to start with "Returns the": System User: Returns the authentication method and the identity (if any) that the user presented during the authentication cycle before they were assigned a database role. It is represented as auth_method:identity or NULL if the user has not been authenticated. That minor point aside, I disagree with Sami about repeating the docs for system_user() here. I would just say "The authentication data provided for this connection; see the function system_user() for more details." with a link to the appropriate section of the docs. Making us edit this doc if we ever modify the behavior of the function is not great. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "We're here to devour each other alive"(Hobbes)
Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs
Laurenz Albe wrote: > I had a look at patch 0001 (0002 will follow). Thanks for reviewing this! I've implemented the suggested doc changes. A patch update will follow with the next part of the review. > > --- a/src/interfaces/libpq/fe-exec.c > > +++ b/src/interfaces/libpq/fe-exec.c > > @@ -41,7 +41,8 @@ char *const pgresStatus[] = { > > "PGRES_COPY_BOTH", > > "PGRES_SINGLE_TUPLE", > > "PGRES_PIPELINE_SYNC", > > - "PGRES_PIPELINE_ABORTED" > > + "PGRES_PIPELINE_ABORTED", > > + "PGRES_TUPLES_CHUNK" > > }; > > I think that PGRES_SINGLE_TUPLE and PGRES_TUPLES_CHUNK should be next to > each other, but that's no big thing. > The same applies to the change in src/interfaces/libpq/libpq-fe.h I assume we can't renumber/reorder existing values, otherwise it would be an ABI break. We can only add new values. > I understand that we need to keep the single-row mode for compatibility > reasons. But I think that under the hood, "single-row mode" should be the > same as "chunk mode with chunk size one". I've implemented it like that at first, and wasn't thrilled with the result. libpq still has to return PGRES_SINGLE_TUPLE in single-row mode and PGRES_TUPLES_CHUNK with chunks of size 1, so the mutualization did not work that well in practice. I also contemplated not creating PGRES_TUPLES_CHUNK and instead using PGRES_SINGLE_TUPLE for N rows, but I found it too ugly. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Combine Prune and Freeze records emitted by vacuum
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote: > On 30/03/2024 07:57, Melanie Plageman wrote: > > On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote: > > > On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas > > > wrote: > > > > Here's another idea: In the first loop through the offsets, where we > > > > gather the HTSV status of each item, also collect the offsets of all HOT > > > > and non-HOT items to two separate arrays. Call heap_prune_chain() for > > > > all the non-HOT items first, and then process any remaining HOT tuples > > > > that haven't been marked yet. > > > > > > That's an interesting idea. I'll try it out and see how it works. > > > > Attached v10 implements this method of dividing tuples into HOT and > > non-HOT and processing the potential HOT chains first then processing > > tuples not marked by calling heap_prune_chain(). > > > > I have applied the refactoring of heap_prune_chain() to master and then > > built the other patches on top of that. > > Committed some of the changes. Continuing to reviewing the rest. Thanks! > > The early patches in the set include some additional comment cleanup as > > well. 0001 is fairly polished. 0004 could use some variable renaming > > (this patch partitions the tuples into HOT and not HOT and then > > processes them). I was struggling with some of the names here > > (chainmembers and chaincandidates is confusing). > > I didn't understand why you wanted to juggle both partitions in the same > array. So I separated them into two arrays, and called them 'root_items' and > 'heaponly_items'. I thought it was worth it to save the space. And the algorithm for doing it seemed pretty straightforward. But looking at your patch, it is a lot easier to understand with two arrays (since, for example, they can each have a name). > In some micro-benchmarks, the order that the items were processed made a > measurable difference. So I'm processing the items in the reverse order. > That roughly matches the order the items are processed in master, as it > iterates the offsets from high-to-low in the first loop, and low-to-high in > the second loop. This makes sense. I noticed you didn't there isn't a comment about this above the loop. It might be worth it to mention it. Below is a review of only 0001. I'll look at the others shortly. > From a6ab891779876e7cc1b4fb6fddb09f52f0094646 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Mon, 1 Apr 2024 16:59:38 +0300 > Subject: [PATCH v11 1/7] Handle non-chain tuples outside of heap_prune_chain() > --- > src/backend/access/heap/pruneheap.c | 264 +--- > 1 file changed, 166 insertions(+), 98 deletions(-) > > diff --git a/src/backend/access/heap/pruneheap.c > b/src/backend/access/heap/pruneheap.c > @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer, > tup.t_tableOid = RelationGetRelid(relation); > > /* > - * Determine HTSV for all tuples. > + * Determine HTSV for all tuples, and queue them up for processing as > HOT > + * chain roots or as a heap-only items. Reading this comment now as a whole, I would add something like "Determining HTSV for all tuples once is required for correctness" to the start of the second paragraph. The new conjunction on the first paragraph sentence followed by the next paragraph is a bit confusing because it sounds like queuing them up for processing is required for correctness (which, I suppose it is on some level). Basically, I'm just saying that it is now less clear what is required for correctness. >* This is required for correctness to deal with cases where running > HTSV >* twice could result in different results (e.g. RECENTLY_DEAD can turn > to >* DEAD if another checked item causes GlobalVisTestIsRemovableFullXid() >* to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the > - * inserting transaction aborts, ...). That in turn could cause > - * heap_prune_chain() to behave incorrectly if a tuple is reached twice, > - * once directly via a heap_prune_chain() and once following a HOT > chain. > + * inserting transaction aborts, ...). VACUUM assumes that there are no > + * normal DEAD tuples left on the page after pruning, so it needs to > have > + * the same understanding of what is DEAD and what is not. >* >* It's also good for performance. Most commonly tuples within a page > are >* stored at decreasing offsets (while the items are stored at > increasing > @@ -282,52 +297,140 @@ heap_page_prune(Relation relation, Buffer buffer, > + /* > + * Process any heap-only tuples that were not already processed as part > of > + * a HOT chain. > + */ While I recognize this is a matter of style and not important, I personally prefer this for reverse looping: for (int i = prstate.nheaponly_items; i --> 0;) I do think a comment about the
Re: psql not responding to SIGINT upon db reconnection
On Mon Mar 25, 2024 at 1:44 PM CDT, Robert Haas wrote: On Fri, Mar 22, 2024 at 4:58 PM Tristan Partin wrote: > I had a question about parameter naming. Right now I have a mix of > camel-case and snake-case in the function signature since that is what > I inherited. Should I change that to be consistent? If so, which case > would you like? Uh... PostgreSQL is kind of the wild west in that regard. The thing to do is look for nearby precedents, but that doesn't help much here because in the very same file, libpq-fe.h, we have: extern int PQsetResultAttrs(PGresult *res, int numAttributes, PGresAttDesc *attDescs); extern int PQsetvalue(PGresult *res, int tup_num, int field_num, char *value, int len); Since the existing naming is consistent with one of those two styles, I'd probably just leave it be. + The function returns a value greater than 0 if the specified condition + is met, 0 if a timeout occurred, or -1 if an error + or interrupt occurred. In the event forRead and We either need to tell people how to find out which error it was, or if that's not possible and we can't reasonably make it possible, we need to tell them why they shouldn't care. Because there's nothing more delightful than someone who shows up and says "hey, I tried to do XYZ, and I got an error," as if that were sufficient information for me to do something useful. + end_time is the time in the future in seconds starting from the UNIX + epoch in which you would like the function to return if the condition is not met. This sentence seems a bit contorted to me, like maybe Yoda wrote it. I was about to try to rephrase it and maybe split it in two when I wondered why we need to document how time_t works at all. Can't we just say something like "If end_time is not -1, it specifies the time at which this function should stop waiting for the condition to be met" -- and maybe move it to the end of the first paragraph, so it's before where we list the meanings of the return values? Incorporated feedback, I have :). -- Tristan Partin Neon (https://neon.tech) From 14c794d9699f7b9f27d1a4ec026f748c6b7d8853 Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Tue, 30 Jan 2024 15:40:57 -0600 Subject: [PATCH v11 1/2] Expose PQsocketPoll for frontends PQsocketPoll should help with state machine processing when connecting to a database asynchronously via PQconnectStart(). --- doc/src/sgml/libpq.sgml | 40 +++- src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-misc.c | 7 +++--- src/interfaces/libpq/libpq-fe.h | 4 4 files changed, 47 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index d3e87056f2c..e69feacfe6a 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -262,6 +262,41 @@ PGconn *PQsetdb(char *pghost, + + PQsocketPollPQsocketPoll + + + nonblocking connection + Poll a connections underlying socket descriptor retrieved with . + +int PQsocketPoll(int sock, int forRead, int forWrite, time_t end_time); + + + + + This function sets up polling of a file descriptor. The underlying function is either + poll(2) or select(2), depending on platform + support. The primary use of this function is iterating through the connection sequence + described in the documentation of . If + forRead is specified, the function waits for the socket to be ready + for reading. If forWrite is specified, the function waits for the + socket to be ready for write. See POLLIN and POLLOUT + from poll(2), or readfds and + writefds from select(2) for more information. If + end_time is not -1, it specifies the time at which + this function should stop waiting for the condition to be met. + + + + The function returns a value greater than 0 if the specified condition + is met, 0 if a timeout occurred, or -1 if an error + occurred. The error can be retrieved by checking the errno(3) value. In + the event forRead and forWrite are not set, the + function immediately returns a timeout condition. + + + + PQconnectStartParamsPQconnectStartParams PQconnectStartPQconnectStart @@ -358,7 +393,10 @@ PostgresPollingStatusType PQconnectPoll(PGconn *conn); Loop thus: If PQconnectPoll(conn) last returned PGRES_POLLING_READING, wait until the socket is ready to read (as indicated by select(), poll(), or - similar system function). + similar system function). Note that PQsocketPoll + can help reduce boilerplate by abstracting the setup of + select(2) or poll(2) if it is + available on your system. Then call PQconnectPoll(conn) again. Conversely, if PQconnectPoll(conn) last returned PGRES_POLLING_WRITING, wait until the socket is ready diff --git
Re: [plpython] Add missing volatile qualifier.
Xing Guo writes: > I'm playing a toy static analysis checker with PostgreSQL and found a > variable is missing volatile qualifier. Good catch! It looks like the consequences of a failure would be pretty minimal --- AFAICS, no worse than a possible failure to remove a refcount on Py_None --- but that's still a bug. I don't care for your proposed fix though. I think the real problem here is schizophrenia about when to set up pltargs, and we could fix it more nicely as attached. (Perhaps the Asserts are overkill though.) regards, tom lane diff --git a/src/pl/plpython/plpy_exec.c b/src/pl/plpython/plpy_exec.c index e06fde1dd9..3145c69699 100644 --- a/src/pl/plpython/plpy_exec.c +++ b/src/pl/plpython/plpy_exec.c @@ -689,7 +689,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r *pltrelid, *plttablename, *plttableschema, - *pltargs = NULL, + *pltargs, *pytnew, *pytold, *pltdata; @@ -713,6 +713,11 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r return NULL; } } + else + { + Py_INCREF(Py_None); + pltargs = Py_None; + } PG_TRY(); { @@ -856,7 +861,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r PyObject *pltarg; /* pltargs should have been allocated before the PG_TRY block. */ - Assert(pltargs); + Assert(pltargs && pltargs != Py_None); for (i = 0; i < tdata->tg_trigger->tgnargs; i++) { @@ -870,8 +875,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r } else { - Py_INCREF(Py_None); - pltargs = Py_None; + Assert(pltargs == Py_None); } PyDict_SetItemString(pltdata, "args", pltargs); Py_DECREF(pltargs);
Re: Popcount optimization using AVX512
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > On 2024-Mar-31, Nathan Bossart wrote: >> +popcnt = _mm512_reduce_add_epi64(accum); >> +return popcnt + pg_popcount_fast(buf, bytes); > > Hmm, doesn't this arrangement cause an extra function call to > pg_popcount_fast to be used here? Given the level of micro-optimization > being used by this code, I would have thought that you'd have tried to > avoid that. (At least, maybe avoid the call if bytes is 0, no?) Yes, it does. I did another benchmark on very small arrays and can see the overhead. This is the time in milliseconds to run pg_popcount() on an array 1 billion times: size (bytes) HEAD AVX512-POPCNT 1 1707.685 3480.424 2 1926.694 4606.182 4 3210.412 5284.506 8 1920.703 3640.968 162936.91 4045.586 323627.956 5538.418 645347.213 3748.212 I suspect that anything below 64 bytes will see this regression, as that is the earliest point where there are enough bytes for ZMM registers. We could avoid the call if there are no remaining bytes, but the numbers for the smallest arrays probably wouldn't improve much, and that might actually add some overhead due to branching. The other option to avoid this overhead is to put most of pg_bitutils.c into its header file so that we can inline the call. Reviewing the current callers of pg_popcount(), IIUC the only ones that are passing very small arrays are the bit_count() implementations and a call in the syslogger for a single byte. I don't know how much to worry about the overhead for bit_count() since there's presumably a bunch of other overhead, and the syslogger one could probably be fixed via an inline function that pulled the value from pg_number_of_ones (which would probably be an improvement over the status quo, anyway). But this is all to save a couple of nanoseconds... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Statistics Import and Export
Corey Huinker writes: >> IIRC, "variadic any" requires having at least one variadic parameter. >> But that seems fine --- what would be the point, or even the >> semantics, of calling pg_set_attribute_stats with no data fields? > If my pg_dump run emitted a bunch of stats that could never be imported, > I'd want to know. With silent failures, I don't. What do you think would be silent about that? If there's a complaint to be made, it's that it'd be a hard failure ("no such function"). To be clear, I'm ok with emitting ERROR for something that pg_dump clearly did wrong, which in this case would be emitting a set_statistics call for an attribute it had exactly no stats values for. What I think needs to be WARN is conditions that the originating pg_dump couldn't have foreseen, for example cross-version differences. If we do try to check things like sort order, that complaint obviously has to be WARN, since it's checking something potentially different from what was correct at the source server. >> Perhaps we could >> invent a new backend function that extracts the actual element type >> of a non-null anyarray argument. > A backend function that we can't guarantee exists on the source system. :( [ shrug... ] If this doesn't work for source servers below v17, that would be a little sad, but it wouldn't be the end of the world. I see your point that that is an argument for finding another way, though. >> Another way we could get to no-coercions is to stick with your >> signature but declare the relevant parameters as anyarray instead of >> text. > I'm a bit confused here. AFAIK we can't construct an anyarray in SQL: > # select '{1,2,3}'::anyarray; > ERROR: cannot accept a value of type anyarray That's not what I suggested at all. The function parameters would be declared anyarray, but the values passed to them would be coerced to the correct concrete array types. So as far as the coercion rules are concerned this'd be equivalent to the variadic-any approach. > That's pretty persuasive. It also means that we need to trap for error in > the array_in() calls, as that function does not yet have a _safe() mode. Well, the approach I'm advocating for would have the array input and coercion done by the calling query before control ever reaches pg_set_attribute_stats, so that any incorrect-for-the-data-type values would result in hard errors. I think that's okay for the same reason you probably figured you didn't have to trap array_in: it's the fault of the originating pg_dump if it offers a value that doesn't coerce to the datatype it claims the value is of. My formulation is a bit safer though in that it's the originating pg_dump, not the receiving server, that is in charge of saying which type that is. (If that type doesn't agree with what the receiving server thinks it should be, that's a condition that pg_set_attribute_stats itself will detect, and then it can WARN and move on to the next thing.) regards, tom lane
RE: Psql meta-command conninfo+
Hi! (v24) > I meant those columns to be just examples, but this should be applied to > all the other columns in the output as well. Applied the translation to all columns. Regards, Maiquel Grassi. v24-0001-psql-meta-command-conninfo-plus.patch Description: v24-0001-psql-meta-command-conninfo-plus.patch
Re: Allow non-superuser to cancel superuser tasks.
I took the liberty of continuing to work on this after chatting with Nathan. I've attached the updated patch with some improvements. Thanks. -- Anthony Leung Amazon Web Services: https://aws.amazon.com v1-0001-Introduce-pg_signal_autovacuum-role-to-allow-non-sup.patch Description: v1-0001-Introduce-pg_signal_autovacuum-role-to-allow-non-sup.patch
Re: Security lessons from liblzma
On Apr 1, 2024, at 06:55, walt...@technowledgy.de wrote: > Also a configurable directoy to look up extensions, possibly even to be > changed at run-time like [2]. The patch says this: > >> This directory is prepended to paths when loading extensions (control and >> SQL files), and to the '$libdir' directive when loading modules that back >> functions. The location is made configurable to allow build-time testing of >> extensions that do not have been installed to their proper location yet. > > This seems like a great thing to have. This might also be relevant in light > of recent discussions in the ecosystem around extension management. > > All the path-related issues have in common, that while it's easy to move > files around to their proper locations later, they all need to adjust > pg_config's output. Funny timing, I was planning to resurrect this old patch[1] and propose that patch this week. One of motivators is the increasing use of Docker images in Kubernetes to run Postgres, where there’s a desire to keep the core service and extensions immutable, and to have a second directory mounted to a persistent volume into which other extensions can be installed and preserved independently of the Docker image. The current approach involves symlinking shenanigans[2] that complicate things pretty substantially, making it more difficult to administer. A second directory fit for purpose would be far better. There are some other motivators, so I’ll do some additional diligence and start a separate thread (or reply to the original[3]). Best, David [1] https://commitfest.postgresql.org/5/170/ [2] https://speakerdeck.com/ongres/postgres-extensions-in-kubernetes?slide=14 [3] https://www.postgresql.org/message-id/flat/51AE0845.8010600%40ocharles.org.uk
Re: Popcount optimization using AVX512
On 2024-Mar-31, Nathan Bossart wrote: > +uint64 > +pg_popcount_avx512(const char *buf, int bytes) > +{ > + uint64 popcnt; > + __m512i accum = _mm512_setzero_si512(); > + > + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) > + { > + const __m512i val = _mm512_loadu_si512((const __m512i > *) buf); > + const __m512i cnt = _mm512_popcnt_epi64(val); > + > + accum = _mm512_add_epi64(accum, cnt); > + buf += sizeof(__m512i); > + } > + > + popcnt = _mm512_reduce_add_epi64(accum); > + return popcnt + pg_popcount_fast(buf, bytes); > +} Hmm, doesn't this arrangement cause an extra function call to pg_popcount_fast to be used here? Given the level of micro-optimization being used by this code, I would have thought that you'd have tried to avoid that. (At least, maybe avoid the call if bytes is 0, no?) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte" (Ijon Tichy en Viajes, Stanislaw Lem)
Re: Psql meta-command conninfo+
On 2024-Mar-31, Maiquel Grassi wrote: > Yes Álvaro, I have already appointed you as the patch reviewer. > It's true, even before publishing it on Commifest, you have > already provided good ideas and guidance. Thanks. > I adjusted the translation syntax for the SSL and GSSAPI columns. > After your validation, that is, an okay confirmation that it's fine, I'll > proceed with the others. I meant those columns to be just examples, but this should be applied to all the other columns in the output as well. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No necesitamos banderas No reconocemos fronteras" (Jorge González)
[plpython] Add missing volatile qualifier.
Hi hackers, I'm playing a toy static analysis checker with PostgreSQL and found a variable is missing volatile qualifier. Best Regards, Xing From 7084055da64d0433b09e50faff630e551c363f27 Mon Sep 17 00:00:00 2001 From: Xing Guo Date: Mon, 1 Apr 2024 21:39:04 +0800 Subject: [PATCH v1] Add missing volatile qualifier. The object pltargs is modified in the PG_TRY block (line 874) and used in the PG_CATCH block (line 881) which should be qualified with volatile. --- src/pl/plpython/plpy_exec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/pl/plpython/plpy_exec.c b/src/pl/plpython/plpy_exec.c index e06fde1dd9..cd71082a5f 100644 --- a/src/pl/plpython/plpy_exec.c +++ b/src/pl/plpython/plpy_exec.c @@ -689,7 +689,7 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo, PLyProcedure *proc, HeapTuple *r *pltrelid, *plttablename, *plttableschema, - *pltargs = NULL, + *volatile pltargs = NULL, *pytnew, *pytold, *pltdata; -- 2.44.0
Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE
On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada wrote: > > Hi, > > Thank you for the patch! > > On Mon, Jul 3, 2023 at 12:12 AM vignesh C wrote: > > > > Hi, > > > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE": > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab > > completion of alter default privileges like the below statement: > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC; > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC; > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM dba1; > > +1 > > > > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA > > public FOR " like in below statement: > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT > > ON TABLES TO PUBLIC; > > Since there is no difference FOR USER and FOR ROLE, I'm not sure we > really want to support both in tab-completion. I have removed this change > > > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES > > REVOKE " like in below statement: > > alter default privileges revoke grant option for select ON tables FROM dba1; > > +1. But the v3 patch doesn't cover the following case: > > =# alter default privileges for role masahiko revoke [tab] > ALL CREATE DELETE EXECUTE INSERT MAINTAIN > REFERENCES SELECT TRIGGER TRUNCATEUPDATE USAGE Modified in the updated patch > And it doesn't cover MAINTAIN neither: > > =# alter default privileges revoke [tab] > ALL DELETEGRANT OPTION FOR REFERENCES > TRIGGER UPDATE > CREATEEXECUTE INSERTSELECT > TRUNCATE USAGE Modified in the updated patch > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE, > but we handle such case in GRANT and REVOKE part: > > (around L3958) > /* > * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable > * privileges (can't grant roles) > */ > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES")) > COMPLETE_WITH("SELECT", "INSERT", "UPDATE", > "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER", > "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL"); The current patch handles the fix from here now. > Also, I think we can support WITH GRANT OPTION too. For example, > > =# alter default privileges for role masahiko grant all on tables to > public [tab] I have handled this in the updated patch > It's already supported in the GRANT statement. > > > > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN > > column-name SET" like in: > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text; > > > > +1. The patch looks good to me, so pushed. Thanks for committing this. The updated patch has the changes for the above comments. Regards, Vignesh From 7c3bbfb0c8c17bf5d4a7e2ce579d00b811844a06 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Sun, 2 Jul 2023 19:35:46 +0530 Subject: [PATCH v4] Fix missing tab completion in "ALTER DEFAULT PRIVILEGES" GRANT, REVOKE and FOR USER keyword was not displayed in tab completion of "ALTER DEFAULT PRIVILEGES" like the below statements: ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC; ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC; ALTER DEFAULT PRIVILEGES FOR USER testdba REVOKE INSERT ON tables FROM testdba; "GRANT OPTION OPTION" was not displayed in tab completion of "ALTER DEFAULT PRIVILEGES REVOKE " like in below statement: ALTER DEFAULT PRIVILEGES REVOKE GRANT OPTION FOR SELECT ON tables FROM testdba; "WITH GRANT OPTION" was not completed in tab completion of "ALTER DEFAULT PRIVILEGES ... GRANT ... to role" like in below statement: ALTER DEFAULT PRIVILEGES FOR ROLE testdba1 GRANT ALL ON tables to testdba2 WITH GRANT OPTION; --- src/bin/psql/tab-complete.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 82eb3955ab..620628948a 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -2147,7 +2147,7 @@ psql_completion(const char *text, int start, int end) /* ALTER DEFAULT PRIVILEGES */ else if (Matches("ALTER", "DEFAULT", "PRIVILEGES")) - COMPLETE_WITH("FOR ROLE", "IN SCHEMA"); + COMPLETE_WITH("FOR", "GRANT", "IN SCHEMA", "REVOKE"); /* ALTER DEFAULT PRIVILEGES FOR */ else if (Matches("ALTER", "DEFAULT", "PRIVILEGES", "FOR")) COMPLETE_WITH("ROLE"); @@ -3949,9 +3949,17 @@ psql_completion(const char *text, int start, int end) * privileges (can't grant roles) */ if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES")) - COMPLETE_WITH("SELECT", "INSERT", "UPDATE", - "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER", - "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL"); + { + if (TailMatches("GRANT")) +COMPLETE_WITH("SELECT", "INSERT", "UPDATE", +
Re: Synchronizing slots from primary to standby
Hi, On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > wrote: > > Then there is no need to call WaitForStandbyConfirmation() as it could go > > until > > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we > > already > > know it). > > > > Won't it will normally return from the first check in > WaitForStandbyConfirmation() because standby_slot_names_config is not > set on standby? I think standby_slot_names can be set on a standby. One could want to set it in a cascading standby env (though it won't have any real effects until the standby is promoted). > > > 2 === > > > > + { > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > + { > > > > That could call SnapBuildSnapshotExists() multiple times for the same > > "restart_lsn" (for example in case of multiple remote slots to sync). > > > > What if the sync worker records the last lsn it asks for serialization (and > > serialized ? Then we could check that value first before deciding to call > > (or not) > > SnapBuildSnapshotExists() on it? > > > > It's not ideal because it would record "only the last one" but that would be > > simple enough for now (currently there is only one sync worker so that > > scenario > > is likely to happen). > > > > Yeah, we could do that but I am not sure how much it can help. I guess > we could do some tests to see if it helps. Yeah not sure either. I just think it can only help and shouldn't make things worst (but could avoid extra SnapBuildSnapshotExists() calls). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Mon, Apr 1, 2024 at 10:40 AM Amit Kapila wrote: > > After this step and before the next, did you ensure that the slot sync > has synced the latest confirmed_flush/restart LSNs? You can query: > "select slot_name,restart_lsn, confirmed_flush_lsn from > pg_replication_slots;" to ensure the same on both the primary and > standby. Yes, after ensuring the slot is synced on the standby, the problem is reproduced for me and the proposed patch fixes it (i.e. able to see the changes even after the promotion). I'm just thinking if we can add a TAP test for this issue, but one key aspect of this reproducer is to not let someone write a RUNNING_XACTS WAL record on the primary in between before the standby promotion. Setting bgwriter_delay to max isn't helping me. I think we can think of using an injection point to add delay in LogStandbySnapshot() for having this problem reproduced consistently in a TAP test. Perhaps, we can think of adding this later after the fix is shipped. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Fix parameters order for relation_copy_for_cluster
correction: so now code is not broken >
Re: Teach predtest about IS [NOT] proofs
On Mon, Mar 25, 2024 at 5:53 PM Tom Lane wrote: > > James Coleman writes: > > [ v6 patchset ] > > I went ahead and committed 0001 after one more round of review > > statements; my bad). I also added the changes in test_predtest.c from > 0002. I attach a rebased version of 0002, as well as 0003 which isn't > changed, mainly to keep the cfbot happy. > > I'm still not happy with what you did in predicate_refuted_by_recurse: > it feels wrong and rather expensively so. There has to be a better > way. Maybe strong vs. weak isn't quite the right formulation for > refutation tests? Possibly. Earlier I'd mused that: > Alternatively (to avoid unnecessary CPU burn) we could modify > predicate_implied_by_recurse (and functionals called by it) to have a > argument beyond "weak = true/false" Ie.g., an enum that allows for > something like "WEAK", "STRONG", and "EITHER". That's a bigger change, > so I didn't want to do that right away unless there was agreement on > that direction. I'm going to try implementing that and see how I feel about what it looks like in practice. Regards, James Coleman
Re: Teach predtest about IS [NOT] proofs
On Mon, Mar 25, 2024 at 11:45 PM Tom Lane wrote: > > I wrote: > > I went ahead and committed 0001 after one more round of review > > > > statements; my bad). I also added the changes in test_predtest.c from > > 0002. I attach a rebased version of 0002, as well as 0003 which isn't > > changed, mainly to keep the cfbot happy. > > [ squint.. ] Apparently I managed to hit ^K right before sending this > email. The missing line was meant to be more or less > > > which found a couple of missing "break" > > Not too important, but perhaps future readers of the archives will > be confused. I was wondering myself :) so thanks for clarifying. Regards, James Coleman
Re: Emitting JSON to file using COPY TO
On Sat, Mar 9, 2024 at 9:13 AM jian he wrote: > > On Sat, Mar 9, 2024 at 2:03 AM Joe Conway wrote: > > > > On 3/8/24 12:28, Andrey M. Borodin wrote: > > > Hello everyone! > > > > > > Thanks for working on this, really nice feature! > > > > > >> On 9 Jan 2024, at 01:40, Joe Conway wrote: > > >> > > >> Thanks -- will have a look > > > > > > Joe, recently folks proposed a lot of patches in this thread that seem > > > like diverted from original way of implementation. > > > As an author of CF entry [0] can you please comment on which patch > > > version needs review? > > > > > > I don't know if I agree with the proposed changes, but I have also been > > waiting to see how the parallel discussion regarding COPY extensibility > > shakes out. > > > > And there were a couple of issues found that need to be tracked down. > > > > Additionally I have had time/availability challenges recently. > > > > Overall, chances seem slim that this will make it into 17, but I have > > not quite given up hope yet either. > > Hi. > summary changes I've made in v9 patches at [0] > > meta: rebased. Now you need to use `git apply` or `git am`, previously > copyto_json.007.diff, you need to use GNU patch. > > > at [1], Dean Rasheed found some corner cases when the returned slot's > tts_tupleDescriptor > from > ` > ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true); > processed = ((DR_copy *) cstate->queryDesc->dest)->processed; > ` > cannot be used for composite_to_json. > generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver, > The COPY TO DestReceiver's rStartup function is copy_dest_startup, > however copy_dest_startup is a no-op. > That means to make the final TupleDesc of COPY TO (FORMAT JSON) > operation bullet proof, > we need to copy the tupDesc from CopyToState's queryDesc. > This only applies to when the COPY TO source is a query (example: > copy (select 1) to stdout), not a table. > The above is my interpretation. > trying to simplify the explanation. first refer to the struct DestReceiver. COPY TO (FORMAT JSON), we didn't send the preliminary Tupdesc to the DestReceiver via the rStartup function pointer within struct _DestReceiver. `CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)` the slot is the final slot returned via query execution. but we cannot use the tupdesc (slot->tts_tupleDescriptor) to do composite_to_json. because the final return slot Tupdesc may change during the query execution. so we need to copy the tupDesc from CopyToState's queryDesc. aslo rebased, now we can apply it cleanly. From 17d9f3765bb74d863e229afed59af7dd923f5379 Mon Sep 17 00:00:00 2001 From: jian he Date: Mon, 1 Apr 2024 19:47:40 +0800 Subject: [PATCH v10 1/2] introduce json format for COPY TO operation. json format is only allowed in COPY TO operation. also cannot be used with header option. discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c...@joeconway.com --- doc/src/sgml/ref/copy.sgml | 5 ++ src/backend/commands/copy.c| 13 +++ src/backend/commands/copyto.c | 125 - src/backend/parser/gram.y | 8 ++ src/backend/utils/adt/json.c | 5 +- src/include/commands/copy.h| 1 + src/include/utils/json.h | 2 + src/test/regress/expected/copy.out | 54 + src/test/regress/sql/copy.sql | 38 + 9 files changed, 208 insertions(+), 43 deletions(-) diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 33ce7c4e..add84dbb 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -218,9 +218,14 @@ COPY { table_name [ ( text, csv (Comma Separated Values), + json (JavaScript Object Notation), or binary. The default is text. + + The json option is allowed only in + COPY TO. + diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index f75e1d70..02f16d9e 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -497,6 +497,8 @@ ProcessCopyOptions(ParseState *pstate, /* default format */ ; else if (strcmp(fmt, "csv") == 0) opts_out->csv_mode = true; + else if (strcmp(fmt, "json") == 0) +opts_out->json_mode = true; else if (strcmp(fmt, "binary") == 0) opts_out->binary = true; else @@ -740,6 +742,11 @@ ProcessCopyOptions(ParseState *pstate, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot specify HEADER in BINARY mode"))); + if (opts_out->json_mode && opts_out->header_line) + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot specify HEADER in JSON mode"))); + /* Check quote */ if (!opts_out->csv_mode && opts_out->quote != NULL) ereport(ERROR, @@ -817,6 +824,12 @@ ProcessCopyOptions(ParseState *pstate,
Re: Synchronizing slots from primary to standby
On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot wrote: > > Hi, > > On Mon, Apr 01, 2024 at 06:05:34AM +, Zhijie Hou (Fujitsu) wrote: > > On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) > > wrote: > > Attach the V4 patch which includes the optimization to skip the decoding if > > the snapshot at the syncing restart_lsn is already serialized. It can avoid > > most > > of the duplicate decoding in my test, and I am doing some more tests > > locally. > > > > Thanks! > > 1 === > > Same comment as in [1]. > > In LogicalSlotAdvanceAndCheckReadynessForDecoding(), if we are synchronizing > slots > then I think that we can skip: > > + /* > +* Wait for specified streaming replication standby servers > (if any) > +* to confirm receipt of WAL up to moveto lsn. > +*/ > + WaitForStandbyConfirmation(moveto); > > Indeed if we are dealing with synced slot then we know we're in > RecoveryInProgress(). > > Then there is no need to call WaitForStandbyConfirmation() as it could go > until > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we > already > know it). > Won't it will normally return from the first check in WaitForStandbyConfirmation() because standby_slot_names_config is not set on standby? > 2 === > > + { > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > + { > > That could call SnapBuildSnapshotExists() multiple times for the same > "restart_lsn" (for example in case of multiple remote slots to sync). > > What if the sync worker records the last lsn it asks for serialization (and > serialized ? Then we could check that value first before deciding to call (or > not) > SnapBuildSnapshotExists() on it? > > It's not ideal because it would record "only the last one" but that would be > simple enough for now (currently there is only one sync worker so that > scenario > is likely to happen). > Yeah, we could do that but I am not sure how much it can help. I guess we could do some tests to see if it helps. -- With Regards, Amit Kapila.