Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote: > Looks reasonable to me. 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so applied. Regarding 0002, using pg_pwrite_zeros() as a routine name, as suggested by Thomas, sounds good to me. However, I am not really a fan of its dependency with PGAlignedXLogBlock, because it should be able to work with any buffers of any sizes, as long as the input buffer is aligned, shouldn't it? For example, what about PGAlignedBlock? So, should we make this more extensible? My guess would be the addition of the block size and the block pointer to the arguments of pg_pwrite_zeros(), in combination with a check to make sure that the input buffer is MAXALIGN()'d (with an Assert() rather than just an elog/pg_log_error?). -- Michael signature.asc Description: PGP signature
Re: Allow file inclusion in pg_hba and pg_ident files
Hi, On Thu, Oct 27, 2022 at 12:08:31PM +0900, Michael Paquier wrote: > > Putting things afresh, there are two different things here (sorry I > need to see that typed ;p): > 1) How do we want to check reliably the loading of the HBA and ident > files on errors? I guess you meant the failure to load HBA / ident files containing invalid data? > EXEC_BACKEND would reload an entire new thing for > each connection, hence we need some loops to go through that. > 2) How to check the contents of pg_hba_file_rules and > pg_ident_file_mappings? > > There is a dependency between 1) and 2) once we try to check for error > patterns in pg_hba_file_rules, because connections would just not > happen. This is not the case for pg_ident_file_mappings though, so we > could still test for buggy patterns in pg_ident.conf (or any of its > included parts) with some expected content of > pg_ident_file_mappings.error after a successful connection. > > Hmm. And what if we just gave up on the checks for error patterns in > pg_hba_file_rules? We discussed this problem in the past (1), and my understanding was that detecting a -DEXEC_BACKEND/Win32 build and skipping those tests in that case would be an acceptable solution to make sure there's at least some coverage. The proposed patch adds such an approach, making sure that the failure is due to an invalid HBA file. If you changed you mind I can remove that part, but again I'd like to be sure of what you exactly want before starting to rewrite stuff. > One thing that we could do for this part is to > include all the buggy patterns we want to check at once in pg_hba.conf > in its included portions, then scan for all the logs produced after > attempting to start a server as the loading of pg_hba.conf would > produce one LOG line with its CONTEXT for each buggy entry. The patch > checks for error patterns with generate_log_err_rows(), but it looks > like it would make the part 3 of the new test cleaner and easier to > maintain in the long-term. I'm not sure what you mean here. The patch does check for all the errors looking at LOG lines and CONTEXT lines, but to make the regexp easier it doesn't try to make sure that each CONTEXT line is immediately following the expected LOG line. That's why the errors are divided in 2 steps: a first step with a single error using some inclusion, so we can validate that the CONTEXT line is entirely correct (wrt. line number and such), and then every possible error pattern where we assume that the CONTEXT line are still following their LOG entry if they're found. It also has the knowledge of which errors adds a CONTEXT line and which don't. And that's done twice, for HBA and ident. The part 3 is just concatenating everything back, for HBA and ident. So long-term maintenance shouldn't get any harder as there won't be any need for more steps. We can just keep appending stuff in the 2nd step and all the tests should run as expected. [1] https://www.postgresql.org/message-id/YtZLeJPlMutL9heh%40paquier.xyz
Adding doubly linked list type which stores the number of items in the list
As part of the AIO work [1], there are quite a number of dlist_heads which a counter is used to keep track on how many items are in the list. We also have a few places in master which do the same thing. In order to tidy this up and to help ensure that the count variable does not get out of sync with the items which are stored in the list, how about we introduce "dclist" which maintains the count for us? I've attached a patch which does this. The majority of the functions for the new type are just wrappers around the equivalent dlist function. dclist provides all of the functionality that dlist does except there's no dclist_delete() function. dlist_delete() can be done by just knowing the element to delete and not the list that the element belongs to. With dclist, that's not possible as we must also subtract 1 from the count variable and obviously we need the dclist_head for that. I'll add this to the November commitfest. David [1] https://www.postgresql.org/message-id/flat/20210223100344.llw5an2akleng...@alap3.anarazel.de From a24babfe52c7ab946f192f4872dc480ce18c8c0f Mon Sep 17 00:00:00 2001 From: David Rowley Date: Thu, 27 Oct 2022 16:01:34 +1300 Subject: [PATCH v1] Add doubly linked count list implementation We have various requirements when using a dlist_head to keep track of the number of items in the list. This, traditionally, has been done by maintaining a counter variable in the calling code. Here we tidy this up by adding "dclist", which is very similar to dlists but also keeps track of the number of items stored in the list. Callers may use the new dclist_count() function when they need to know how many items are stored. Obtaining the count is an O(1) operation. For simplicity reasons, dclists and dlists both use dlist_node as their node type and dlist_iter/dlist_mutable_iter as their iterator type. dclists have all of the same functionality as dlists except there is no function named dclist_delete(). To remove an item from a list dclist_delete_from() must be used. This requires knowing which dclist the given item is stored in. Additionally, here we also convert some dlists which keep track of the number of items stored to make them use dclists instead. --- .../replication/logical/reorderbuffer.c | 58 ++-- src/backend/replication/logical/snapbuild.c | 2 +- src/backend/utils/activity/pgstat_xact.c | 34 +- src/backend/utils/adt/ri_triggers.c | 13 +- src/include/lib/ilist.h | 313 +- src/include/replication/reorderbuffer.h | 6 +- src/include/utils/pgstat_internal.h | 3 +- src/tools/pgindent/typedefs.list | 1 + 8 files changed, 351 insertions(+), 79 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index c29894041e..603c861461 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -349,8 +349,6 @@ ReorderBufferAllocate(void) buffer->by_txn_last_xid = InvalidTransactionId; buffer->by_txn_last_txn = NULL; - buffer->catchange_ntxns = 0; - buffer->outbuf = NULL; buffer->outbufsize = 0; buffer->size = 0; @@ -368,7 +366,7 @@ ReorderBufferAllocate(void) dlist_init(>toplevel_by_lsn); dlist_init(>txns_by_base_snapshot_lsn); - dlist_init(>catchange_txns); + dclist_init(>catchange_txns); /* * Ensure there's no stale data from prior uses of this slot, in case some @@ -413,7 +411,7 @@ ReorderBufferGetTXN(ReorderBuffer *rb) dlist_init(>changes); dlist_init(>tuplecids); - dlist_init(>subtxns); + dclist_init(>subtxns); /* InvalidCommandId is not zero, so set it explicitly */ txn->command_id = InvalidCommandId; @@ -1066,14 +1064,13 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid, subtxn->txn_flags |= RBTXN_IS_SUBXACT; subtxn->toplevel_xid = xid; - Assert(subtxn->nsubtxns == 0); + Assert(dclist_count(>subtxns) == 0); /* set the reference to top-level transaction */ subtxn->toptxn = txn; /* add to subtransaction list */ - dlist_push_tail(>subtxns, >node); - txn->nsubtxns++; + dclist_push_tail(>subtxns, >node); /* Possibly transfer the subtxn's snapshot to its top-level txn. */ ReorderBufferTransferSnapToParent(txn, subtxn); @@ -1241,7 +1238,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn, if (txn->nentries > 0) nr_txns++; - dlist_foreach(cur_txn_i, >subtxns) + dclist_foreach(cur_txn_i, >subtxns) { ReorderBufferTXN *cur_txn; @@ -1308,7 +1305,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn, } /* add subtransactions if they contain changes */ - dlist_foreach(cur_txn_i,
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada wrote: > > True. I'm going to start with 6 bytes and will consider reducing it to > 5 bytes. Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes. > Encoding the kind in a pointer tag could be tricky given DSA If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory. > support so currently I'm thinking to pack the node kind and node > capacity classes to uint8. That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles. > I've made some progress on investigating DSA support. I've written > draft patch for that and regression tests passed. I'll share it as a > separate patch for discussion with v8 radix tree patch. Great! > While implementing DSA support, I realized that we may not need to use > pointer tagging to distinguish between backend-local address or > dsa_pointer. In order to get a backend-local address from dsa_pointer, > we need to pass dsa_area like: I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind. Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed. > > +static inline void > > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, > > + uint8 *dst_chunks, rt_node **dst_children, int count) > > +{ > > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); > > + memcpy(dst_children, src_children, sizeof(rt_node *) * count); > > +} > > > > gcc generates better code with something like this (but not hard-coded) at the top: > > > > if (count > 4) > > pg_unreachable(); Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = { In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants. -- John Naylor EDB: http://www.enterprisedb.com
Re: Allow file inclusion in pg_hba and pg_ident files
On Wed, Oct 26, 2022 at 11:32:14PM +0800, Julien Rouhaud wrote: > Have you already done a rebase while working on the patch or are you intending > to take care of it, or should I? Let's no both do the work :) Spoiler alert: I have not done a rebase yet ;) -- Michael signature.asc Description: PGP signature
Re: Allow file inclusion in pg_hba and pg_ident files
On Wed, Oct 26, 2022 at 11:32:14PM +0800, Julien Rouhaud wrote: > I don't mind taking care of that, but before doing so I'd like to have some > feedback on whether you're ok with my approach (per my initial email about it > at [1]) or if you had some different > ideas on how to do it. Putting things afresh, there are two different things here (sorry I need to see that typed ;p): 1) How do we want to check reliably the loading of the HBA and ident files on errors? EXEC_BACKEND would reload an entire new thing for each connection, hence we need some loops to go through that. 2) How to check the contents of pg_hba_file_rules and pg_ident_file_mappings? There is a dependency between 1) and 2) once we try to check for error patterns in pg_hba_file_rules, because connections would just not happen. This is not the case for pg_ident_file_mappings though, so we could still test for buggy patterns in pg_ident.conf (or any of its included parts) with some expected content of pg_ident_file_mappings.error after a successful connection. Hmm. And what if we just gave up on the checks for error patterns in pg_hba_file_rules? One thing that we could do for this part is to include all the buggy patterns we want to check at once in pg_hba.conf in its included portions, then scan for all the logs produced after attempting to start a server as the loading of pg_hba.conf would produce one LOG line with its CONTEXT for each buggy entry. The patch checks for error patterns with generate_log_err_rows(), but it looks like it would make the part 3 of the new test cleaner and easier to maintain in the long-term. -- Michael signature.asc Description: PGP signature
Re: GUC values - recommended way to declare the C variables?
On Wed, Oct 26, 2022 at 09:49:34PM -0500, Justin Pryzby wrote: > In v4, Peter posted a 2-patch series with my patch as 001. > But I pointed out that it's better to fix the initialization of the > compile-time GUCs rather than exclude them from the check. > Then Peter submitted v5 whcih does that, and isn't built on top of my > patch. Okidoki, thanks for the clarification. -- Michael signature.asc Description: PGP signature
Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant
On Wed, Oct 26, 2022 at 4:25 PM David Rowley wrote: > One other thought I had about the duplicate "Limit" node in the final > plan was that we could make the limit clause an Expr like > LEAST(, 1). That way we could ensure we get at > most 1 row, but perhaps less if the expression given in the LIMIT > clause evaluated to 0. This will still work correctly when the > existing limit evaluates to NULL. I'm still just not that keen on this > idea as it means still having to either edit the parse's limitCount or > store the limit details in a new field in PlannerInfo and use that > when making the final LimitPath. However, I'm still not sure doing > this is worth the extra complexity. I find the duplicate "Limit" node is not that concerning after I realize it may appear in other queries, such as explain (analyze, timing off, costs off) select * from (select * from (select * from generate_series(1,100)i limit 10) limit 5) limit 1; QUERY PLAN -- Limit (actual rows=1 loops=1) -> Limit (actual rows=1 loops=1) -> Limit (actual rows=1 loops=1) -> Function Scan on generate_series i (actual rows=1 loops=1) Although the situation is different in that the Limit node is actually atop SubqueryScan which is removed afterwards, but the final plan appears as a Limit node atop another Limit node. So I wonder maybe we can just live with it, or resolve it in a separate patch. Thanks Richard
Re: GUC values - recommended way to declare the C variables?
On Thu, Oct 27, 2022 at 1:33 PM Michael Paquier wrote: > > On Wed, Oct 26, 2022 at 09:14:37PM -0500, Justin Pryzby wrote: > > It seems like you're reviewing the previous version of the patch, rather > > than the one attached to the message you responded to (which doesn't > > have anything to do with GUC_DEFAULT_COMPILE). > > It does not seem so as things stand, I have been looking at > v5-0001-GUC-C-variable-sanity-check.patch as posted by Peter here: > https://www.postgresql.org/message-id/cahut+puchjyxitgdtovhvdnjpbivllr49gwvs+8vwnfom4h...@mail.gmail.com > > In combination with a two-patch set as posted by you here: > 0001-add-DYNAMIC_DEFAULT-for-settings-which-vary-by-.-con.patch > 0002-WIP-test-guc-default-values.patch > https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com > > These are the latest patch versions posted on their respective thread > I am aware of, and based on the latest updates of each thread it still > looked like there was a dependency between both. So, is that the case > or not? If not, sorry if I misunderstood things. No. My v5 is no longer dependent on the other patch. > > > I don't know what you meant by "make the CF bot happy" (?) > > It is in my opinion confusing to see that the v5 posted on this > thread, which was marked as ready for committer as of > https://commitfest.postgresql.org/40/3934/, seem to rely on a facility > that it makes no use of. Hence it looks to me that this patch has > been posted as-is to allow the CF bot to pass (I would have posted > that as an isolated two-patch set with the first patch introducing the > flag if need be). Yeah, my v4 was posted along with the other GUC flag patch as a prerequisite to make the cfbot happy. This is no longer the case - v5 is a single independent patch. Sorry for the "ready for the committer" status being confusing. At that time I thought it was. > > Anyway, per my previous comments in my last message of this thread as > of https://www.postgresql.org/message-id/y1nnwftrnl3it...@paquier.xyz, > I don't see a need for DYNAMIC_DEFAULT from the other thread, nor do I > see a need to a style like that: > +/* GUC variable */ > +bool update_process_title = > +#ifdef WIN32 > + false; > +#else > + true; > +#endif > > I think that it would be cleaner to use the same approach as > checking_after_flush and similar GUCs with a centralized definition, > rather than spreading such style in two places for each GUC that this > patch touches (aka its declaration and its default value in > guc_tables.c). In any case, the patch of this thread still needs some > adjustments IMO. OK, I can make that adjustment if it is preferred. I think it is the same as what I already suggested a while ago [1] ("But probably it is no problem to just add #defines...") -- [1] https://www.postgresql.org/message-id/1113448.1665717297%40sss.pgh.pa.us Kind Regards, Peter Smith. Fujitsu Australia.
Re: GUC values - recommended way to declare the C variables?
On Thu, Oct 27, 2022 at 11:33:48AM +0900, Michael Paquier wrote: > On Wed, Oct 26, 2022 at 09:14:37PM -0500, Justin Pryzby wrote: > > It seems like you're reviewing the previous version of the patch, rather > > than the one attached to the message you responded to (which doesn't > > have anything to do with GUC_DEFAULT_COMPILE). > > It does not seem so as things stand, I have been looking at > v5-0001-GUC-C-variable-sanity-check.patch as posted by Peter here: > https://www.postgresql.org/message-id/cahut+puchjyxitgdtovhvdnjpbivllr49gwvs+8vwnfom4h...@mail.gmail.com This thread is about consistency of the global variables with what's set by the GUC infrastructure. In v4, Peter posted a 2-patch series with my patch as 001. But I pointed out that it's better to fix the initialization of the compile-time GUCs rather than exclude them from the check. Then Peter submitted v5 whcih does that, and isn't built on top of my patch. > In combination with a two-patch set as posted by you here: > 0001-add-DYNAMIC_DEFAULT-for-settings-which-vary-by-.-con.patch > 0002-WIP-test-guc-default-values.patch > https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com That's a separate thread regarding consistency of the default values (annotations) shown in postgresql.conf. (I'm not sure whether or not my patch adding GUC flags is an agreed way forward, although they might turn out to be useful for other purposes). -- Justin
Re: generic plans and "initial" pruning
On Mon, Oct 17, 2022 at 6:29 PM Amit Langote wrote: > On Wed, Oct 12, 2022 at 4:36 PM Amit Langote wrote: > > On Fri, Jul 29, 2022 at 1:20 PM Amit Langote > > wrote: > > > On Thu, Jul 28, 2022 at 1:27 AM Robert Haas wrote: > > > > 0001 adds es_part_prune_result but does not use it, so maybe the > > > > introduction of that field should be deferred until it's needed for > > > > something. > > > > > > Oops, looks like a mistake when breaking the patch. Will move that bit > > > to 0002. > > > > Fixed that and also noticed that I had defined PartitionPruneResult in > > the wrong header (execnodes.h). That led to PartitionPruneResult > > nodes not being able to be written and read, because > > src/backend/nodes/gen_node_support.pl doesn't create _out* and _read* > > routines for the nodes defined in execnodes.h. I moved its definition > > to plannodes.h, even though it is not actually the planner that > > instantiates those; no other include/nodes header sounds better. > > > > One more thing I realized is that Bitmapsets added to the List > > PartitionPruneResult.valid_subplan_offs_list are not actually > > read/write-able. That's a problem that I also faced in [1], so I > > proposed a patch there to make Bitmapset a read/write-able Node and > > mark (only) the Bitmapsets that are added into read/write-able node > > trees with the corresponding NodeTag. I'm including that patch here > > as well (0002) for the main patch to work (pass > > -DWRITE_READ_PARSE_PLAN_TREES build tests), though it might make sense > > to discuss it in its own thread? > > Had second thoughts on the use of List of Bitmapsets for this, such > that the make-Bitmapset-Nodes patch is no longer needed. > > I had defined PartitionPruneResult such that it stood for the results > of pruning for all PartitionPruneInfos contained in > PlannedStmt.partPruneInfos (covering all Append/MergeAppend nodes that > can use partition pruning in a given plan). So, it had a List of > Bitmapset. I think it's perhaps better for PartitionPruneResult to > cover only one PartitionPruneInfo and thus need only a Bitmapset and > not a List thereof, which I have implemented in the attached updated > patch 0002. So, instead of needing to pass around a > PartitionPruneResult with each PlannedStmt, this now passes a List of > PartitionPruneResult with an entry for each in > PlannedStmt.partPruneInfos. Rebased over 3b2db22fe. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com v23-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pl.patch Description: Binary data v23-0002-Optimize-AcquireExecutorLocks-by-locking-only-un.patch Description: Binary data
RE: Perform streaming logical transactions by background workers and parallel apply
On Wed, Oct 26, 2022 7:19 PM Amit Kapila wrote: > > On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada > wrote: > > > > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com > > wrote: > > > > I've started to review this patch. I tested v40-0001 patch and have > > one question: > > > > IIUC even when most of the changes in the transaction are filtered out > > in pgoutput (eg., by relation filter or row filter), the walsender > > sends STREAM_START. This means that the subscriber could end up > > launching parallel apply workers also for almost empty (and streamed) > > transactions. For example, I created three subscriptions each of which > > subscribes to a different table. When I loaded a large amount of data > > into one table, all three (leader) apply workers received START_STREAM > > and launched their parallel apply workers. > > > > The apply workers will be launched just the first time then we > maintain a pool so that we don't need to restart them. > > > However, two of them > > finished without applying any data. I think this behaviour looks > > problematic since it wastes workers and rather decreases the apply > > performance if the changes are not large. Is it worth considering a > > way to delay launching a parallel apply worker until we find out the > > amount of changes is actually large? > > > > I think even if changes are less there may not be much difference > because we have observed that the performance improvement comes from > not writing to file. > > > For example, the leader worker > > writes the streamed changes to files as usual and launches a parallel > > worker if the amount of changes exceeds a threshold or the leader > > receives the second segment. After that, the leader worker switches to > > send the streamed changes to parallel workers via shm_mq instead of > > files. > > > > I think writing to file won't be a good idea as that can hamper the > performance benefit in some cases and not sure if it is worth. > I tried to test some cases that only a small part of the transaction or an empty transaction is sent to subscriber, to see if using streaming parallel will bring performance degradation. The test was performed ten times, and the average was taken. The results are as follows. The details and the script of the test is attached. 10% of rows are sent -- HEAD24.4595 patched 18.4545 5% of rows are sent -- HEAD21.244 patched 17.9655 0% of rows are sent -- HEAD18.0605 patched 17.893 It shows that when only 5% or 10% of rows are sent to subscriber, using parallel apply takes less time than HEAD, and even if all rows are filtered there's no performance degradation. Regards Shi yu <>
Re: GUC values - recommended way to declare the C variables?
On Wed, Oct 26, 2022 at 09:14:37PM -0500, Justin Pryzby wrote: > It seems like you're reviewing the previous version of the patch, rather > than the one attached to the message you responded to (which doesn't > have anything to do with GUC_DEFAULT_COMPILE). It does not seem so as things stand, I have been looking at v5-0001-GUC-C-variable-sanity-check.patch as posted by Peter here: https://www.postgresql.org/message-id/cahut+puchjyxitgdtovhvdnjpbivllr49gwvs+8vwnfom4h...@mail.gmail.com In combination with a two-patch set as posted by you here: 0001-add-DYNAMIC_DEFAULT-for-settings-which-vary-by-.-con.patch 0002-WIP-test-guc-default-values.patch https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com These are the latest patch versions posted on their respective thread I am aware of, and based on the latest updates of each thread it still looked like there was a dependency between both. So, is that the case or not? If not, sorry if I misunderstood things. > I don't know what you meant by "make the CF bot happy" (?) It is in my opinion confusing to see that the v5 posted on this thread, which was marked as ready for committer as of https://commitfest.postgresql.org/40/3934/, seem to rely on a facility that it makes no use of. Hence it looks to me that this patch has been posted as-is to allow the CF bot to pass (I would have posted that as an isolated two-patch set with the first patch introducing the flag if need be). Anyway, per my previous comments in my last message of this thread as of https://www.postgresql.org/message-id/y1nnwftrnl3it...@paquier.xyz, I don't see a need for DYNAMIC_DEFAULT from the other thread, nor do I see a need to a style like that: +/* GUC variable */ +bool update_process_title = +#ifdef WIN32 + false; +#else + true; +#endif I think that it would be cleaner to use the same approach as checking_after_flush and similar GUCs with a centralized definition, rather than spreading such style in two places for each GUC that this patch touches (aka its declaration and its default value in guc_tables.c). In any case, the patch of this thread still needs some adjustments IMO. -- Michael signature.asc Description: PGP signature
Re: GUC values - recommended way to declare the C variables?
On Thu, Oct 27, 2022 at 11:06:56AM +0900, Michael Paquier wrote: > On Wed, Oct 26, 2022 at 06:31:56PM +1100, Peter Smith wrote: > > I re-checked all the GUC C vars which your patch flags as > > GUC_DEFAULT_COMPILE. For some of them, where it was not any trouble, I > > made the C var assignment use the same preprocessor rules as used by > > guc_tables. For others (mostly the string ones) I left the GUC C var > > untouched because the sanity checker function already has a rule not > > to complain about int GUC C vars which are 0 or string GUC vars which > > are NULL. > > I see. So you have on this thread an independent patch to make the CF > bot happy, still depend on the patch posted on [1] to bypass the > changes with variables whose boot values are compilation-dependent. It seems like you're reviewing the previous version of the patch, rather than the one attached to the message you responded to (which doesn't have anything to do with GUC_DEFAULT_COMPILE). I don't know what you meant by "make the CF bot happy" (?) -- Justin
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Oct 26, 2022 at 8:06 PM John Naylor wrote: > > On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada > wrote: > > > I've attached updated PoC patches for discussion and cfbot. From the > > previous version, I mainly changed the following things: > > Thank you for the comments! > > * Separate treatment of inner and leaf nodes > > Overall, this looks much better! > > > * Pack both the node kind and node count to an uint16 value. > > For this, I did mention a bitfield earlier as something we "could" do, but it > wasn't clear we should. After looking again at the node types, I must not > have thought through this at all. Storing one byte instead of four for the > full enum is a good step, but saving one more byte usually doesn't buy > anything because of padding, with a few exceptions like this example: > > node4: 4 + 4 + 4*8 = 40 > node4: 5 + 4+(7) + 4*8 = 48 bytes > > Even there, I'd rather not spend the extra cycles to access the members. And > with my idea of decoupling size classes from kind, the variable-sized kinds > will require another byte to store "capacity". Then, even if the kind gets > encoded in a pointer tag, we'll still have 5 bytes in the base type. So I > think we should assume 5 bytes from the start. (Might be 6 temporarily if I > work on size decoupling first). True. I'm going to start with 6 bytes and will consider reducing it to 5 bytes. Encoding the kind in a pointer tag could be tricky given DSA support so currently I'm thinking to pack the node kind and node capacity classes to uint8. > > (Side note, if you have occasion to use bitfields again in the future, C99 > has syntactic support for them, so no need to write your own shifting/masking > code). Thanks! > > > I've not done SIMD part seriously yet. But overall the performance > > seems good so far. If we agree with the current approach, I think we > > can proceed with the verification of decoupling node sizes from node > > kind. And I'll investigate DSA support. > > Sounds good. I have some additional comments about v7, and after these are > addressed, we can proceed independently with the above two items. Seeing the > DSA work will also inform me how invasive pointer tagging will be. There will > still be some performance tuning and cosmetic work, but it's getting closer. > I've made some progress on investigating DSA support. I've written draft patch for that and regression tests passed. I'll share it as a separate patch for discussion with v8 radix tree patch. While implementing DSA support, I realized that we may not need to use pointer tagging to distinguish between backend-local address or dsa_pointer. In order to get a backend-local address from dsa_pointer, we need to pass dsa_area like: node = dsa_get_address(tree->dsa, node_dp); As shown above, the dsa area used by the shared radix tree is stored in radix_tree struct, so we can know whether the radix tree is shared or not by checking (tree->dsa == NULL). That is, if it's shared we use a pointer to radix tree node as dsa_pointer, and if not we use a pointer as a backend-local pointer. We don't need to encode something in a pointer. > - > 0001: > > +#ifndef USE_NO_SIMD > +#include "port/pg_bitutils.h" > +#endif > > Leftover from an earlier version? > > +static inline int vector8_find(const Vector8 v, const uint8 c); > +static inline int vector8_find_ge(const Vector8 v, const uint8 c); > > Leftovers, causing compiler warnings. (Also see new variable shadow warning) Will fix. > > +#else /* USE_NO_SIMD */ > + Vector8 r = 0; > + uint8 *rp = (uint8 *) > + > + for (Size i = 0; i < sizeof(Vector8); i++) > + rp[i] = Min(((const uint8 *) )[i], ((const uint8 *) )[i]); > + > + return r; > +#endif > > As I mentioned a couple versions ago, this style is really awkward, and > potential non-SIMD callers will be better off writing their own byte-wise > loop rather than using this API. Especially since the "min" function exists > only as a workaround for lack of unsigned comparison in (at least) SSE2. > There is one existing function in this file with that idiom for non-assert > code (for completeness), but even there, inputs of current interest to us use > the uint64 algorithm. Agreed. Will remove non-SIMD code. > > 0002: > > + /* XXX: should not to use vector8_highbit_mask */ > + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << > sizeof(Vector8)); > > Hmm? It's my outdated memo, will remove. > > +/* > + * Return index of the first element in chunks in the given node that is > greater > + * than or equal to 'key'. Return -1 if there is no such element. > + */ > +static inline int > +node_32_search_ge(rt_node_base_32 *node, uint8 chunk) > > The caller must now have logic for inserting at the end: > > + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk); > + int16 count = NODE_GET_COUNT(n32); > + > + if (insertpos < 0) > + insertpos = count; /* insert to
Re: GUC values - recommended way to declare the C variables?
On Wed, Oct 26, 2022 at 06:31:56PM +1100, Peter Smith wrote: > I re-checked all the GUC C vars which your patch flags as > GUC_DEFAULT_COMPILE. For some of them, where it was not any trouble, I > made the C var assignment use the same preprocessor rules as used by > guc_tables. For others (mostly the string ones) I left the GUC C var > untouched because the sanity checker function already has a rule not > to complain about int GUC C vars which are 0 or string GUC vars which > are NULL. I see. So you have on this thread an independent patch to make the CF bot happy, still depend on the patch posted on [1] to bypass the changes with variables whose boot values are compilation-dependent. Is it right to believe that the only requirement here is GUC_DEFAULT_COMPILE but not GUC_DEFAULT_INITDB? The former is much more intuitive than the latter. Still, I see an inconsistency here in what you are doing here. sanity_check_GUC_C_var() would need to skip all the GUCs marked as GUC_DEFAULT_COMPILE, meaning that one could still be "fooled by a mismatched value" in these cases. We are talking about a limited set of them, but it seems to me that we have no need for this flag at all once the default GUC values are set with a #defined'd value, no? checkpoint_flush_after, bgwriter_flush_after, port and effective_io_concurrency do that, which is why v5-0001-GUC-C-variable-sanity-check.patch does its stuff only for maintenance_io_concurrency, update_process_title, assert_enabled and syslog_facility. I think that it would be simpler to have a default for these last four with a centralized definition, meaning that we would not need a GUC_DEFAULT_COMPILE at all, while the validation could be done for basically all the GUCs with default values assigned. In short, this patch has no need to depend on what's posted in [1]. [1]: https://www.postgresql.org/message-id/20221024220544.gj16...@telsasoft.com -- Michael signature.asc Description: PGP signature
Re: Some regression tests for the pg_control_*() functions
On Wed, Oct 26, 2022 at 01:41:12PM +0530, Bharath Rupireddy wrote: > We will have bigger problems when a backend corrupts the pg_control > file, no? The bigger problems could be that the server won't come up > or it behaves abnormally or some other. Possibly, yes. > Can't the CRC check detect any of the above corruptions? Do we have > any evidence of backend corrupting the pg_control file or any of the > above variables while running regression tests? It could be possible that the backend writes an incorrect data combination though its APIs, where the CRC is correct but the data is not (say a TLI of 0, as one example). > If the concern is backend corrupting the pg_control file and CRC check > can't detect it, then the extra checks (as proposed in the patch) must > be placed within the core (perhaps before writing/after reading the > pg_control file), not in regression tests for sure. Well, that depends on the level of protection you want. Now there are things in place already when it comes to recovery or at startup. Anyway, the recent experience with the 56-bit relfilenode thread is really that we don't check the execution of these functions at all, and that's the actual minimal requirement, so I have applied a patch based on count(*) > 0 for now to cover that. I am not sure if any of the checks for the control file fields are valuable, perhaps some are.. -- Michael signature.asc Description: PGP signature
Re: confused with name in the pic
On Wed, Oct 26, 2022 at 2:13 AM jack...@gmail.com wrote: > typedef struct A_Expr > > > > { > > > > pg_node_attr(custom_read_write) > > > > NodeTag type; > > > > A_Expr_Kind kind; /* see above */ > > > > List *name; /* possibly-qualified name of operator */ > > > > Node *lexpr; /* left argument, or NULL if none */ > > > > Node *rexpr; /* right argument, or NULL if none */ > > > > int location; /* token location, or -1 if unknown */ > > > > } A_Expr; > > > > I run a sql like select a,b from t3 where a > 1; and I get the parseTree > for selectStmt: > > > > why the name is '-' but not '>'? > > Given the general lack of interest in this so far I'd suggest you put together a minimal test case that includes a simple print-to-log command patched into HEAD showing the problematic parse tree in its internal form. Posting an image of some custom visualization that seems evidently produced by custom code that itself may be buggy against an unspecified server with no indication how any of this all works doesn't seem like enough detail and there is little reason to think that such an obvious bug could exist. I do agree that your expectation seems quite sound. Though I do not have the faintest idea how to actually go about reproducing your result even in the minimal way described above (though I know enough to know it is possible, just not where to patch). David J.
Re: Documentation for building with meson
Hi, On Wed, Oct 19, 2022 at 7:43 PM Justin Pryzby wrote: > On Wed, Oct 19, 2022 at 11:35:10AM -0700, samay sharma wrote: > > Creating a new thread focussed on adding docs for building Postgres with > > meson. This is a spinoff from the original thread [1] and I've attempted > to > > address all the feedback provided there in the attached patch. > > > > Please let me know your thoughts. > > It's easier to review rendered documentation. > I made a rendered copy available here: > > https://api.cirrus-ci.com/v1/artifact/task/6084297781149696/html_docs/html_docs/install-meson.html Thanks for your for review. Attached v2 of the patch here. > > > + Flex and Bison > + are needed to build PostgreSQL using > + meson. Be sure to get > + Flex 2.5.31 or later and > + Bison 1.875 or later from your package > manager. > + Other lex and > yacc > + programs cannot be used. > > These versions need to be updated, see also: 57bab3330: > - b086a47a270 "Bump minimum version of Bison to 2.3" > - 8b878bffa8d "Bump minimum version of Flex to 2.5.35" > Changed > > + will be enabled automatically if the required packages are found. > > should refer to files/libraries/headers/dependencies rather than > "packages" ? > Changed to dependencies > > + default is false that is to use > Readline. > > "that is to use" should be parenthesized or separate with commas, like > | default is false, that is to use Readline. > > zlib is mentioned twice, the first being "strongly recommended". > Is that intended? Also, basebackup can use zlib, too. > Yes, the first is in the requirements section where we just list packages required / recommended. The other mention is in the list of configure options. This is similar to how the documentation looks today for make / autoconf. Added pg_basebackup as a use case too. > > + If you have the binaries for certain by programs required to > build > > remove "by" ? > Done > > + Postgres (with or without optional flags) stored at non-standard > + paths, you could specify them manually to meson configure. The > complete > + list of programs for whom this is supported can be found by > running > > for *which > > Actually, I suggest to say: > |If a program required to build Postgres (with or without optional flags) > |is stored in a non-standard path, ... > Liked this framing better. Changed. > > + a build with a different value of these options. > > .. with different values .. > Done > > + the server, it is recommended to use atleast the > --buildtype=debug > > at least > Done > > +and it's options in the meson documentation. > > its > Done > > Maybe other things should have ? > > Git > Libxml2 > libxslt > visual studio > DTrace > ninja > > + Flex and Bison > > Maybe these should use ? > I'm unsure of the right protocol for this. I tried to follow the precedent set in the make / autoconf part of the documentation, which uses at certain places and at others. Is there a reference or guidance on which to use where or is it mostly a case by case decision? > + be installed with pip. > > Should be ? > Changed. > > This part is redundant with prior text: > " To use this option, you will need an implementation of the Gettext API. " > Modified. > > + Enabls use of the Zlib library > > typo: Enabls > Fixed. > > + This option is set to true by default and setting it to false will > > change "and" to ";" for spinlocks and atomics? > Done > > + Debug info is generated but the result is not optimized. > > Maybe say the "code" is not optimized ? > Changed > > + the tests can slow down the server significantly > > remove "can" > Done. > > + You can override the amount of parallel processes used with > > s/amount/number/ > Done > > + If you'd like to build with a backend other that ninja > > other *than > Fixed. > > + the GNU C library then you will additionally > > library comma > Added > > +argument. If no srcdir is given Meson will deduce > the > > given comma > Added > > + It should be noted that after the initial configure step > > step comma > Added > > +After the installation you can free disk space by removing the built > > installation comma > Added > > +To learn more about running the tests and how to interpret the results > +you can refer to the documentation for interpreting test results. > > interpret the results comma > "for interpreting test results" seems redundant. > Changed. > > + ninja install should work for most cases but if you'd like to use more > options > > cases comma > Added > > Starting with "Developer Options", this intermixes postgres > project-specific options like cassert and tap-tests with meson's stuff > like buildtype and werror. IMO there's too much detail about meson's > options, which I think is better left to that project's own > documentation, and postgres docs should include only a brief
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, On 2022-10-24 14:38:52 -0400, Melanie Plageman wrote: > > - "repossession" is a very unintuitive name for me. If we want something > > like > > it, can't we just name it reuse_failed or such? > > Repossession could be called eviction_failed or reuse_failed. > Do we think we will ever want to use it to count buffers we released > in other IOContexts (thus making the name eviction_failed better than > reuse_failed)? I've a somewhat radical proposal: Let's just not count any of this in the initial version. I think we want something, but clearly it's one of the harder aspects of this patch. Let's get the rest in, and then work on this is in isolation. > Speaking of IOCONTEXT_LOCAL, I was wondering if it is confusing to call > it IOCONTEXT_LOCAL since it refers to IO done for temporary tables. What > if, in the future, we want to track other IO done using data in local > memory? Fair point. However, I think 'tmp' or 'temp' would be worse, because there's other sources of temporary files that would be worth counting, consider e.g. tuplestore temporary files. 'temptable' isn't good because it's not just tables. 'temprel'? On balance I think local is better, but not sure. > Also, what if we want to track other IO done using data from shared memory > that is not in shared buffers? Would IOCONTEXT_SB and IOCONTEXT_TEMP be > better? Should IOContext literally describe the context of the IO being done > and there be a separate column which indicates the source of the data for > the IO? Like wal_buffer, local_buffer, shared_buffer? Then if it is not > block-oriented, it could be shared_mem, local_mem, or bypass? Hm. I don't think we'd need _buffer for WAL or such, because there's nothing else. > If we had another dimension to the matrix "data_src" which, with > block-oriented IO is equivalent to "buffer type", this could help with > some of the clarity problems. > > We could remove the "reused" column and that becomes: > > IOCONTEXT | DATA_SRC| IOOP > > strategy | strategy_buffer | EVICT > Having data_src and iocontext simplifies the meaning of all io > operations involving a strategy. Some operations are done on shared > buffers and some on existing strategy buffers and this would be more > clear without the addition of special columns for strategies. -1, I think this just blows up the complexity further, without providing much benefit. But: Perhaps a somewhat similar idea could be used to address the concerns in the preceding paragraphs. How about the following set of columns: backend_type: object: relation, temp_relation[, WAL, tempfiles, ...] iocontext: buffer_pool, bulkread, bulkwrite, vacuum[, bypass] read: written: extended: bytes_conversion: evicted: reused: files_synced: stats_reset: Greetings, Andres Freund
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
okay, so I realized v35 had an issue where I wasn't counting strategy evictions correctly. fixed in attached v36. This made me wonder if there is actually a way to add a test for evictions (in strategy and shared contexts) that is not flakey. On Sun, Oct 23, 2022 at 6:48 PM Maciek Sakrejda wrote: > > On Thu, Oct 20, 2022 at 10:31 AM Andres Freund wrote: > > - "repossession" is a very unintuitive name for me. If we want something > > like > > it, can't we just name it reuse_failed or such? > > +1, I think "repossessed" is awkward. I think "reuse_failed" works, > but no strong opinions on an alternate name. Also, re: repossessed, I can change it to reuse_failed but I do think it is important to give users a way to distinguish between bulkread rejections of dirty buffers and strategies failing to reuse buffers due to concurrent pinning (since the reaction to these two scenarios would likely be different). If we added another column called something like "claim_failed" which counts buffers which we failed to reuse because of concurrent pinning or usage, we could recommend use of this column together with "reuse_failed" to determine the cause of the failed reuses for a bulkread. We could also use "claim_failed" in IOContext shared to provide information on shared buffer contention. - Melanie From 0d5fc7da60f6b02259b8dd1d2eab25967cb9a95a Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 6 Oct 2022 12:23:38 -0400 Subject: [PATCH v36 3/5] Aggregate IO operation stats per BackendType Stats on IOOps for all IOContexts for a backend are tracked locally. Add functionality for backends to flush these stats to shared memory and accumulate them with those from all other backends, exited and live. Also add reset and snapshot functions used by cumulative stats system for management of these statistics. The aggregated stats in shared memory could be extended in the future with per-backend stats -- useful for per connection IO statistics and monitoring. Some BackendTypes will not flush their pending statistics at regular intervals and explicitly call pgstat_flush_io_ops() during the course of normal operations to flush their backend-local IO operation statistics to shared memory in a timely manner. Because not all BackendType, IOOp, IOContext combinations are valid, the validity of the stats is checked before flushing pending stats and before reading in the existing stats file to shared memory. Author: Melanie Plageman Reviewed-by: Andres Freund Reviewed-by: Justin Pryzby Reviewed-by: Kyotaro Horiguchi Reviewed-by: Maciek Sakrejda Reviewed-by: Lukas Fittl Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de --- doc/src/sgml/monitoring.sgml | 2 + src/backend/utils/activity/pgstat.c | 35 src/backend/utils/activity/pgstat_bgwriter.c | 7 +- .../utils/activity/pgstat_checkpointer.c | 7 +- src/backend/utils/activity/pgstat_io_ops.c| 164 ++ src/backend/utils/activity/pgstat_relation.c | 15 +- src/backend/utils/activity/pgstat_shmem.c | 4 + src/backend/utils/activity/pgstat_wal.c | 4 +- src/backend/utils/adt/pgstatfuncs.c | 4 +- src/include/miscadmin.h | 2 + src/include/pgstat.h | 88 ++ src/include/utils/pgstat_internal.h | 36 src/tools/pgindent/typedefs.list | 3 + 13 files changed, 365 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index e5d622d514..698f274341 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -5390,6 +5390,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i the pg_stat_bgwriter view, archiver to reset all the counters shown in the pg_stat_archiver view, +io to reset all the counters shown in the +pg_stat_io view, wal to reset all the counters shown in the pg_stat_wal view or recovery_prefetch to reset all the counters shown diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 1b97597f17..4becee9a6c 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { .snapshot_cb = pgstat_checkpointer_snapshot_cb, }, + [PGSTAT_KIND_IOOPS] = { + .name = "io_ops", + + .fixed_amount = true, + + .reset_all_cb = pgstat_io_ops_reset_all_cb, + .snapshot_cb = pgstat_io_ops_snapshot_cb, + }, + [PGSTAT_KIND_SLRU] = { .name = "slru", @@ -582,6 +591,7 @@ pgstat_report_stat(bool force) /* Don't expend a clock check if nothing to do */ if (dlist_is_empty() && + !have_ioopstats && !have_slrustats && !pgstat_have_pending_wal()) { @@ -628,6 +638,9 @@ pgstat_report_stat(bool force) /*
Re: Allow file inclusion in pg_hba and pg_ident files
On Wed, Oct 26, 2022 at 03:56:07PM +0900, Michael Paquier wrote: > > So, I have spent a good portion of today looking at what you have > here, applying 0001 and 0003 while fixing, rebasing and testing the > whole, discarding 0002 (we could do more for the line number and > source file in terms of the LOGs reported for a regexec failure). Thanks! > Now remains 0004, which is the core of the proposal, and while it > needs a rebase, Have you already done a rebase while working on the patch or are you intending to take care of it, or should I? Let's no both do the work :) > - The TAP test, which is half the size of the patch in line number. > Could it be possible to make it more edible, introducing a basic > infrastructure to check a set of rules in pg_hba.conf and > pg_ident.conf without the inclusion logic? Checks for error patterns > (that I agree we strongly lack tests for) look like something we'd > want to tackle independently of the inclusion logic, and it should be > built on top of a basic test infra, at least. I don't mind taking care of that, but before doing so I'd like to have some feedback on whether you're ok with my approach (per my initial email about it at [1]) or if you had some different ideas on how to do it. [1] https://www.postgresql.org/message-id/20220730080936.atyxodvwlmf2wnoc@jrouhaud
Re: Stack overflow issue
24.08.2022 20:58, Tom Lane writes: Nice work! I wonder if you can make the regex crashes reachable by reducing the value of max_stack_depth enough that it's hit before reaching the "regular expression is too complex" limit. regards, tom lane Hello everyone! It's been a while since me and Alexander Lakhin have published a list of functions that have a stack overflow illness. We are back to tell you more about such places. During our analyze we made a conclusion that some functions can be crashed without changing any of the parameters and some can be crashed only if we change some stuff. The first function crashes without any changes: # CheckAttributeType (n=6; printf "create domain dint as int; create domain dinta0 as dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as dinta$(( $i - 1 ))[]; "; done; ) | psql psql -c "create table t(f1 dinta6[]);" Some of the others crash if we change "max_locks_per_transaction" parameter: # findDependentObjects max_locks_per_transaction = 200 (n=1; printf "create table t (i int); create view v0 as select * from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select * from v$(( $i - 1 )); "; done; ) | psql psql -c "drop table t" # ATExecDropColumn max_locks_per_transaction = 300 (n=5; printf "create table t0 (a int, b int); "; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop b;" ) | psql # ATExecDropConstraint max_locks_per_transaction = 300 (n=5; printf "create table t0 (a int, b int, constraint bc check (b > 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop constraint bc;" ) | psql # ATExecAddColumn max_locks_per_transaction = 200 (n=5; printf "create table t0 (a int, b int);"; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 add column c int;" ) | psql # ATExecAlterConstrRecurse max_locks_per_transaction = 300 (n=5; printf "create table t(a int primary key); create table pt (a int primary key, foreign key(a) references t) partition by range (a);"; printf "create table pt0 partition of pt for values from (0) to (10) partition by range (a);"; for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$(( $i - 1 )) for values from ($i) to (10) partition by range (a); "; done; printf "alter table pt alter constraint pt_a_fkey deferrable initially deferred;" ) | psql This is where the fun begins. According to Tom Lane, a decrease in max_stack_depth could lead to new crashes, but it turned out that Alexander was able to find new crashes precisely due to the increase in this parameter. Also, we had ulimit -s set to 8MB as the default value. # eval_const_expressions_mutator max_stack_depth = '7000kB' (n=1; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf " collate \"C\" "; done; ) | psql If you didn’t have a crash, like me, when Alexander shared his find, then probably you configured your cluster with an optimization flag -Og. In the process of trying to break this function, we came to the conclusion that the maximum stack depth depends on the optimization flag (-O0/-Og). As it turned out, when optimizing, the function frame on the stack becomes smaller and because of this, the limit is reached more slowly, therefore, the system can withstand more torment. Therefore, this query will fail if you have a cluster configured with the -O0 optimization flag. The crash of the next function not only depends on the optimization flag, but also on a number of other things. While researching, we noticed that postgres enforces a distance ~400kB from max_stack_depth to ulimit -s. We thought we could hit the max_stack_depth limit and then hit the OS limit as well. Therefore, Alexander wrote a recursive SQL function, that eats up a stack within max_stack_depth, including a query that eats up the remaining ~400kB. And this causes a crash. # executeBoolItem max_stack_depth = '7600kB' create function infinite_recurse(i int) returns int as $$ begin raise notice 'Level %', i; begin perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' || repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath); exception when others then raise notice 'jsonb_path_query error at level %, %', i, sqlerrm; end; begin select infinite_recurse(i + 1) into i; exception when others then raise notice 'Max stack depth reached at level %, %', i, sqlerrm; end; return i; end; $$ language plpgsql; select infinite_recurse(1); To sum it all up, we have not yet decided on a general approach to such functions. Some functions are definitely subject to stack overflow. Some are definitely not. This can be seen from the code where the recurse flag is passed, or a function that checks the stack is
Re: Reducing duplicativeness of EquivalenceClass-derived clauses
Richard Guo writes: > On Wed, Oct 26, 2022 at 6:09 AM Tom Lane wrote: >> The only thing that I think might be controversial here is that >> I dropped the check for matching operator OID. To preserve that, >> we'd have needed to use get_commutator() in the reverse-match cases, >> which it seemed to me would be a completely unjustified expenditure >> of cycles. The operators we select for freshly-generated clauses >> will certainly always match those of previously-generated clauses. >> Maybe there's a chance that they'd not match those of ec_sources >> clauses (that is, the user-written clauses we started from), but >> if they don't and that actually makes any difference then surely >> we are talking about a buggy opclass definition. > The operator is chosen according to the two given EC members's data > type. Since we are dealing with the same pair of EC members, I think > the operator is always the same one. So it also seems no problem to drop > the check for operator. I wonder if we can even add an assertion if > we've found a RestrictInfo from ec_derives that the operator matches. Yeah, I considered that --- even if somehow an ec_sources entry isn't an exact match, ec_derives ought to be. However, it still didn't seem worth a get_commutator() call. We'd basically be expending cycles to check that select_equality_operator yields the same result with the same inputs as it did before, and that doesn't seem terribly interesting to check. I'm also not sure what's the point of allowing divergence from the requested operator in some but not all paths. I added a bit of instrumentation to count how many times we need to build new join clauses in create_join_clause. In the current core regression tests, I see this change reducing the number of new join clauses built here from 9673 to 5142 (out of 26652 calls). So not quite 50% savings, but pretty close to it. That should mean that this change is about a wash just in terms of the code it touches directly: each iteration of the search loops is nearly twice as expensive as before, but we'll only need to do about half as many. So whatever we save downstream is pure gravy. regards, tom lane
Re: Perform streaming logical transactions by background workers and parallel apply
On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada wrote: > > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com > wrote: > > I've started to review this patch. I tested v40-0001 patch and have > one question: > > IIUC even when most of the changes in the transaction are filtered out > in pgoutput (eg., by relation filter or row filter), the walsender > sends STREAM_START. This means that the subscriber could end up > launching parallel apply workers also for almost empty (and streamed) > transactions. For example, I created three subscriptions each of which > subscribes to a different table. When I loaded a large amount of data > into one table, all three (leader) apply workers received START_STREAM > and launched their parallel apply workers. > The apply workers will be launched just the first time then we maintain a pool so that we don't need to restart them. > However, two of them > finished without applying any data. I think this behaviour looks > problematic since it wastes workers and rather decreases the apply > performance if the changes are not large. Is it worth considering a > way to delay launching a parallel apply worker until we find out the > amount of changes is actually large? > I think even if changes are less there may not be much difference because we have observed that the performance improvement comes from not writing to file. > For example, the leader worker > writes the streamed changes to files as usual and launches a parallel > worker if the amount of changes exceeds a threshold or the leader > receives the second segment. After that, the leader worker switches to > send the streamed changes to parallel workers via shm_mq instead of > files. > I think writing to file won't be a good idea as that can hamper the performance benefit in some cases and not sure if it is worth. -- With Regards, Amit Kapila.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada wrote: > I've attached updated PoC patches for discussion and cfbot. From the > previous version, I mainly changed the following things: > > * Separate treatment of inner and leaf nodes Overall, this looks much better! > * Pack both the node kind and node count to an uint16 value. For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example: node4: 4 + 4 + 4*8 = 40 node4: 5 + 4+(7) + 4*8 = 48 bytes Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first). (Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code). > I've not done SIMD part seriously yet. But overall the performance > seems good so far. If we agree with the current approach, I think we > can proceed with the verification of decoupling node sizes from node > kind. And I'll investigate DSA support. Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer. - 0001: +#ifndef USE_NO_SIMD +#include "port/pg_bitutils.h" +#endif Leftover from an earlier version? +static inline int vector8_find(const Vector8 v, const uint8 c); +static inline int vector8_find_ge(const Vector8 v, const uint8 c); Leftovers, causing compiler warnings. (Also see new variable shadow warning) +#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = Min(((const uint8 *) )[i], ((const uint8 *) )[i]); + + return r; +#endif As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm. 0002: + /* XXX: should not to use vector8_highbit_mask */ + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8)); Hmm? +/* + * Return index of the first element in chunks in the given node that is greater + * than or equal to 'key'. Return -1 if there is no such element. + */ +static inline int +node_32_search_ge(rt_node_base_32 *node, uint8 chunk) The caller must now have logic for inserting at the end: + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk); + int16 count = NODE_GET_COUNT(n32); + + if (insertpos < 0) + insertpos = count; /* insert to the tail */ It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos(). + if (likely(NODE_HAS_FREE_SLOT(n128))) + { + node_inner_128_insert(n128, chunk, child); + break; + } + + /* grow node from 128 to 256 */ We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype): if (unlikely( ! has-free-slot)) grow-node; else { ...; break; } /* FALLTHROUGH */ + /* Descend the tree until a leaf node */ + while (shift >= 0) + { + rt_node*child; + + if (NODE_IS_LEAF(node)) + break; + + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, )) + child = rt_node_add_new_child(tree, parent, node, key); + + Assert(child); + + parent = node; + node = child; + shift -= RT_NODE_SPAN; + } Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us. + /* there is no key to delete */ + if (!rt_node_search_leaf(node,
Re: Reducing duplicativeness of EquivalenceClass-derived clauses
On Wed, Oct 26, 2022 at 6:09 AM Tom Lane wrote: > While fooling with my longstanding outer-join variables changes > (I am making progress on that, honest), I happened to notice that > equivclass.c is leaving some money on the table by generating > redundant RestrictInfo clauses. It already attempts to not generate > the same clause twice, which can save a nontrivial amount of work > because we cache selectivity estimates and so on per-RestrictInfo. > I realized though that it will build and return a clause like > "a.x = b.y" even if it already has "b.y = a.x". This is just > wasteful. It's always been the case that equivclass.c will > produce clauses that are ordered according to its own whims. > Consumers that need the operands in a specific order, such as > index scans or hash joins, are required to commute the clause > to be the way they want it while building the finished plan. > Therefore, it shouldn't matter which order of the operands we > return, and giving back the commutator clause if available could > potentially save as much as half of the selectivity-estimation > work we do with these clauses subsequently. > > Hence, PFA a patch that adjusts create_join_clause() to notice > commuted as well as exact matches among the EquivalenceClass's > existing clauses. This results in a number of changes visible in > regression test cases, but they're all clearly inconsequential. I think there is no problem with this idea, given the operands of EC-derived clauses are commutative, and it seems no one would actually rely on the order of the operands. I can see hashjoin/mergejoin would commute hash/merge joinclauses if needed with get_switched_clauses(). > > The only thing that I think might be controversial here is that > I dropped the check for matching operator OID. To preserve that, > we'd have needed to use get_commutator() in the reverse-match cases, > which it seemed to me would be a completely unjustified expenditure > of cycles. The operators we select for freshly-generated clauses > will certainly always match those of previously-generated clauses. > Maybe there's a chance that they'd not match those of ec_sources > clauses (that is, the user-written clauses we started from), but > if they don't and that actually makes any difference then surely > we are talking about a buggy opclass definition. The operator is chosen according to the two given EC members's data type. Since we are dealing with the same pair of EC members, I think the operator is always the same one. So it also seems no problem to drop the check for operator. I wonder if we can even add an assertion if we've found a RestrictInfo from ec_derives that the operator matches. Thanks Richard
Re: Have nodeSort.c use datum sorts single-value byref types
On Thu, 29 Sept 2022 at 18:12, David Rowley wrote: > In the attached patch, I've added a function named > tuplesort_getdatum_nocopy() which is the same as tuplesort_getdatum() > only without the datumCopy(). I opted for the new function rather than > a new parameter in the existing function just to reduce branching and > additional needless overhead. Per what was said over on [1], I've adjusted the patch to just add a 'copy' parameter to tuplesort_getdatum() instead of adding the tuplesort_getdatum_nocopy() function. I also adjusted some code in heapam_index_validate_scan() to pass copy=false to tuplesort_getdatum(). The datum in question here is a TID type, so this really only saves a datumCopy() / pfree on 32-bit systems. I wasn't too interested in speeding 32-bit systems up with this, it was more a case of being able to remove the #ifndef USE_FLOAT8_BYVAL / pfree code. I think this is a fairly trivial patch, so if nobody objects, I plan to push it in the next few days. David [1] https://www.postgresql.org/message-id/65629.1664460603%40sss.pgh.pa.us diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index a3414a76e8..41f1ca65d0 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -1879,17 +1879,13 @@ heapam_index_validate_scan(Relation heapRelation, } tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true, - _val, _isnull, NULL); + false, _val, _isnull, + NULL); Assert(tuplesort_empty || !ts_isnull); if (!tuplesort_empty) { itemptr_decode(, DatumGetInt64(ts_val)); indexcursor = - - /* If int8 is pass-by-ref, free (encoded) TID Datum memory */ -#ifndef USE_FLOAT8_BYVAL - pfree(DatumGetPointer(ts_val)); -#endif } else { diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index bcf3b1950b..986f761e87 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -879,7 +879,7 @@ process_ordered_aggregate_single(AggState *aggstate, */ while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set], - true, newVal, isNull, )) + true, false, newVal, isNull, )) { /* * Clear and select the working context for evaluation of the equality @@ -900,24 +900,33 @@ process_ordered_aggregate_single(AggState *aggstate, pertrans->aggCollation, oldVal, *newVal) { - /* equal to prior, so forget this one */ - if (!pertrans->inputtypeByVal && !*isNull) - pfree(DatumGetPointer(*newVal)); + MemoryContextSwitchTo(oldContext); + continue; } else { advance_transition_function(aggstate, pertrans, pergroupstate); - /* forget the old value, if any */ - if (!oldIsNull && !pertrans->inputtypeByVal) - pfree(DatumGetPointer(oldVal)); - /* and remember the new one for subsequent equality checks */ - oldVal = *newVal; + + MemoryContextSwitchTo(oldContext); + + /* +* Forget the old value, if any and remember the new one for +* subsequent equality checks. +*/ + if (!pertrans->inputtypeByVal) + { + if (!oldIsNull) + pfree(DatumGetPointer(oldVal)); + if (!*isNull) + oldVal = datumCopy(*newVal, pertrans->inputtypeByVal, + pertrans->inputtypeLen); + } + else + oldVal = *newVal; oldAbbrevVal = newAbbrevVal; oldIsNull = *isNull;
Re: replacing role-level NOINHERIT with a grant-level option
Hello, Thanks for reviewing. Committed. Let me return to this topic. After looking at the changes in this patch, I have a suggestion. The inheritance option for role memberships is important information to know if the role privileges will be available automatically or if a switch with a SET ROLE command is required. However, this information cannot be obtained with psql commands, specifically \du or \dg. Previously, you could see the inherit attribute of the role (its absence is shown with \du). Now you have to look in the pg_auth_members system catalog. My suggestion is to add information about pg_auth_members.inherit_option to the output of \du (\dg). If so, we can also add information about pg_auth_members.admin_option. Right now this information is not available in psql command output either. I thought about how exactly to represent these options in the output \du, but did not find a single solution. Considered the following choices: 1. Add \du+ command and for each membership in the role add values of two options. I haven't done a patch yet, but you can imagine the changes like this: CREATE ROLE alice LOGIN IN ROLE pg_read_all_data; \du+ alice List of roles Role name | Attributes | Member of ---++ alice | | {pg_read_all_data(admin=f inherit=t)} It looks long, but for \du+ it's not a problem. 2. I assume that the default values for these options will rarely change. In that case, we can do without \du+ and output only the changed values directly in the \du command. GRANT pg_read_all_data TO alice WITH INHERIT FALSE; 2a. \du alice List of roles Role name | Attributes | Member of ---++ alice | | {pg_read_all_data(inherit=f)} 2b. Similar to GRANT OPTION, we can use symbols instead of long text (inherit=f) for options. For example, for the ADMIN OPTION we can use "*" (again similar to GRANT OPTION), and for the missing INHERIT option something else, such as "-": GRANT pg_read_all_data TO alice WITH ADMIN TRUE; GRANT pg_write_all_data TO alice WITH INHERIT FALSE; \du alice List of roles Role name | Attributes | Member of ---++ alice | | {pg_read_all_data*-,pg_write_all_data-} But I think choices 2a and 2b are too complicated to understand. Especially because the two options have different default values. And even more. The default value for the INHERIT option depends on the value of the INHERIT attribute for the role. So I like the first choice with \du+ better. But perhaps there are other choices as well. If it's interesting, I'm ready to open a new thread (the commitfest entry for this topic is now closed) and prepare a patch. -- Pavel Luzanov
Re: Suppressing useless wakeups in walreceiver
On Sun, Oct 16, 2022 at 9:29 AM Nathan Bossart wrote: > > Here's a different take. Instead of creating structs and altering function > signatures, we can just make the wake-up times file-global, and we can skip > the changes related to reducing the number of calls to > GetCurrentTimestamp() for now. This results in a far less invasive patch. > (I still think reducing the number of calls to GetCurrentTimestamp() is > worthwhile, but I'm beginning to agree with Bharath that it should be > handled separately.) > > Thoughts? Thanks. I'm wondering if we need to extend the similar approach for logical replication workers in logical/worker.c LogicalRepApplyLoop()? A comment on the patch: Isn't it better we track the soonest wakeup time or min of wakeup[] array (update the min whenever the array element is updated in WalRcvComputeNextWakeup()) instead of recomputing everytime by looping for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU cycles, because the for loop, in which the below code is placed, runs till the walreceiver end of life cycle. We may wrap wakeup[] and min inside a structure for better code organization or just add min as another static variable alongside wakeup[]. +/* Find the soonest wakeup time, to limit our nap. */ +nextWakeup = PG_INT64_MAX; +for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) +nextWakeup = Min(wakeup[i], nextWakeup); +nap = Max(0, (nextWakeup - now + 999) / 1000); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
confused with name in the pic
typedef struct A_Expr { pg_node_attr(custom_read_write) NodeTag type; A_Expr_Kind kind; /* see above */ List *name; /* possibly-qualified name of operator */ Node *lexpr; /* left argument, or NULL if none */ Node *rexpr; /* right argument, or NULL if none */ int location; /* token location, or -1 if unknown */ } A_Expr; I run a sql like select a,b from t3 where a > 1; and I get the parseTree for selectStmt: why the name is '-' but not '>'? jack...@gmail.com 1.png Description: Binary data
Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant
On Fri, 14 Oct 2022 at 15:15, Richard Guo wrote: > The v3 patch looks good to me. Thank you for looking at that. One other thought I had about the duplicate "Limit" node in the final plan was that we could make the limit clause an Expr like LEAST(, 1). That way we could ensure we get at most 1 row, but perhaps less if the expression given in the LIMIT clause evaluated to 0. This will still work correctly when the existing limit evaluates to NULL. I'm still just not that keen on this idea as it means still having to either edit the parse's limitCount or store the limit details in a new field in PlannerInfo and use that when making the final LimitPath. However, I'm still not sure doing this is worth the extra complexity. If nobody else has any thoughts on this or the patch in general, then I plan to push it in the next few days. David
Re: Some regression tests for the pg_control_*() functions
On Wed, Oct 26, 2022 at 12:48 PM Michael Paquier wrote: > > On Wed, Oct 26, 2022 at 10:13:29AM +0530, Bharath Rupireddy wrote: > > +1 for improving the test coverage. Is there a strong reason to > > validate individual output columns rather than select count(*) > 0 > > from pg_control_(); sort of tests? If the intention is to validate > > the pg_controlfile contents, we have pg_controldata to look at and > > pg_control_() functions doing crc checks. > > And it could be possible that the control file finishes by writing > some incorrect data due to a bug in the backend. We will have bigger problems when a backend corrupts the pg_control file, no? The bigger problems could be that the server won't come up or it behaves abnormally or some other. > Adding a count(*) or > similar to get the number of fields of the function is basically the > same as checking its execution, still I'd like to think that having a > minimum set of checks would be kind of nice on top of that. Among all > the ones I wrote in the patch upthread, the following ones would be in > my minimalistic list: > - timeline_id > 0 > - timeline_id >= prev_timeline_id > - checkpoint_lsn >= redo_lsn > - data_page_checksum_version >= 0 > - Perhaps the various fields of pg_control_init() using their > lower-bound values. > - Perhaps pg_control_version and/or catalog_version_no > NN Can't the CRC check detect any of the above corruptions? Do we have any evidence of backend corrupting the pg_control file or any of the above variables while running regression tests? If the concern is backend corrupting the pg_control file and CRC check can't detect it, then the extra checks (as proposed in the patch) must be placed within the core (perhaps before writing/after reading the pg_control file), not in regression tests for sure. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: GUC values - recommended way to declare the C variables?
Thanks for the feedback. PSA the v5 patch. On Wed, Oct 26, 2022 at 7:04 AM Justin Pryzby wrote: > > +#ifdef USE_ASSERT_CHECKING > + sanity_check_GUC_C_var(hentry->gucvar); > +#endif > > => You can conditionally define that as an empty function so #ifdefs > aren't needed in the caller: > > void sanity_check_GUC_C_var() > { > #ifdef USE_ASSERT_CHECKING > ... > #endif > } > Fixed as suggested. > But actually, I don't think you should use my patch. You needed to > exclude update_process_title: > > src/backend/utils/misc/ps_status.c:bool update_process_title = true; > ... > src/backend/utils/misc/guc_tables.c-#ifdef WIN32 > src/backend/utils/misc/guc_tables.c-false, > src/backend/utils/misc/guc_tables.c-#else > src/backend/utils/misc/guc_tables.c-true, > src/backend/utils/misc/guc_tables.c-#endif > src/backend/utils/misc/guc_tables.c-NULL, NULL, NULL > > My patch would also exclude the 16 other GUCs with compile-time defaults > from your check. It'd be better not to exclude them; I think the right > solution is to change the C variable initialization to a compile-time > constant: > > #ifdef WIN32 > bool update_process_title = false; > #else > bool update_process_title = true; > #endif > > Or something more indirect like: > > #ifdef WIN32 > #define DEFAULT_PROCESS_TITLE false > #else > #define DEFAULT_PROCESS_TITLE true > #endif > > bool update_process_title = DEFAULT_PROCESS_TITLE; > > I suspect there's not many GUCs that would need to change - this might > be the only one. If this GUC were defined in the inverse (bool > skip_process_title), it wouldn't need special help, either. > I re-checked all the GUC C vars which your patch flags as GUC_DEFAULT_COMPILE. For some of them, where it was not any trouble, I made the C var assignment use the same preprocessor rules as used by guc_tables. For others (mostly the string ones) I left the GUC C var untouched because the sanity checker function already has a rule not to complain about int GUC C vars which are 0 or string GUC vars which are NULL. -- Kind Regards, Peter Smith. Fujitsu Australia v5-0001-GUC-C-variable-sanity-check.patch Description: Binary data
Re: Some regression tests for the pg_control_*() functions
On Wed, Oct 26, 2022 at 10:13:29AM +0530, Bharath Rupireddy wrote: > +1 for improving the test coverage. Is there a strong reason to > validate individual output columns rather than select count(*) > 0 > from pg_control_(); sort of tests? If the intention is to validate > the pg_controlfile contents, we have pg_controldata to look at and > pg_control_() functions doing crc checks. And it could be possible that the control file finishes by writing some incorrect data due to a bug in the backend. Adding a count(*) or similar to get the number of fields of the function is basically the same as checking its execution, still I'd like to think that having a minimum set of checks would be kind of nice on top of that. Among all the ones I wrote in the patch upthread, the following ones would be in my minimalistic list: - timeline_id > 0 - timeline_id >= prev_timeline_id - checkpoint_lsn >= redo_lsn - data_page_checksum_version >= 0 - Perhaps the various fields of pg_control_init() using their lower-bound values. - Perhaps pg_control_version and/or catalog_version_no > NN > If this isn't enough, we > can have the pg_control_validate() function to do all the necessary > checks and simplify the tests, no? There is no function like that. Perhaps that you mean to introduce something like that at the C level, but that does not seem necessary to me as long as a SQL is able to do the job for the most meaningful parts. -- Michael signature.asc Description: PGP signature
Re: Allow file inclusion in pg_hba and pg_ident files
On Wed, Oct 26, 2022 at 11:19:48AM +0800, Julien Rouhaud wrote: > That wouldn't be overdoing anymore if we remove the line number / filename > from > the fill_*_line prototypes right? So, I have spent a good portion of today looking at what you have here, applying 0001 and 0003 while fixing, rebasing and testing the whole, discarding 0002 (we could do more for the line number and source file in terms of the LOGs reported for a regexec failure). Now remains 0004, which is the core of the proposal, and while it needs a rebase, I have not been able to spend much time looking at its internals. In order to help with the follow-up steps, I have spotted a few areas that could be done first: - The diffs in guc.h/c for the introduction of GetDirConfFiles() to get a set of files in a directory (got to think a bit more about ParseConfigFp() when it comes to the keywords, but that comes to play with the parsing logic which is different). - The TAP test, which is half the size of the patch in line number. Could it be possible to make it more edible, introducing a basic infrastructure to check a set of rules in pg_hba.conf and pg_ident.conf without the inclusion logic? Checks for error patterns (that I agree we strongly lack tests for) look like something we'd want to tackle independently of the inclusion logic, and it should be built on top of a basic test infra, at least. -- Michael signature.asc Description: PGP signature
Re: Move backup-related code to xlogbackup.c/.h
On Mon, Oct 24, 2022 at 1:00 PM Michael Paquier wrote: > > On Wed, Oct 19, 2022 at 09:07:04PM +0530, Bharath Rupireddy wrote: > > XLogBackupResetRunning() seemed better. +1 for above function names. > > I see what you are doing here. XLogCtl would still live in xlog.c, > but we want to have functions that are able to manipulate some of its > fields. Right. > I am not sure to like that much because it introduces a > circling dependency between xlog.c and xlogbackup.c. As of HEAD, > xlog.c calls build_backup_content() from xlogbackup.c, which is fine > as xlog.c is kind of a central piece that feeds on the insert and > recovery pieces. However your patch makes some code paths of > xlogbackup.c call routines from xlog.c, and I don't think that we > should do that. If you're talking about header file dependency, there's already header file dependency between them - xlog.c includes xlogbackup.h for build_backup_content() and xlogbackup.c includes xlog.h for wal_segment_size. And, I think the same kind of dependency exists between xlog.c and xlogrecovery.c. Please note that we're trying to reduce xlog.c file size apart from centralizing backup related code. > > I'm okay either way. > > > > Please see the attached v8 patch set. > > Among all that, CleanupBackupHistory() is different, still it has a > dependency with some of the archiving pieces.. Is there a problem with that? This function is used solely by backup functions and it happens to use one of the archiving utility functions. Please see the other archiving utility functions being used elsewhere in the code, not only in xlog.c - for instance, KeepFileRestoredFromArchive() and XLogArchiveNotify(). I'm attaching the v9 patch set herewith after rebasing. Please review it further. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 385f42ebf6b85fd4a0ad6211428f4dd92e53b269 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 26 Oct 2022 05:15:37 + Subject: [PATCH v9] Add functions for xlogbackup.c to call back into xlog.c --- src/backend/access/transam/xlog.c | 175 - src/include/access/xlog.h | 1 + src/include/access/xlog_internal.h | 8 ++ 3 files changed, 131 insertions(+), 53 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..34260a2434 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8311,9 +8311,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces, * runningBackups, to ensure adequate interlocking against * XLogInsertRecord(). */ - WALInsertLockAcquireExclusive(); - XLogCtl->Insert.runningBackups++; - WALInsertLockRelease(); + XLogBackupSetRunning(); /* * Ensure we decrement runningBackups if we fail below. NB -- for this to @@ -8383,12 +8381,8 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces, * to restore starting from the checkpoint is precisely the REDO * pointer. */ - LWLockAcquire(ControlFileLock, LW_SHARED); - state->checkpointloc = ControlFile->checkPoint; - state->startpoint = ControlFile->checkPointCopy.redo; - state->starttli = ControlFile->checkPointCopy.ThisTimeLineID; - checkpointfpw = ControlFile->checkPointCopy.fullPageWrites; - LWLockRelease(ControlFileLock); + GetCheckpointLocation(>checkpointloc, >startpoint, + >starttli, ); if (backup_started_in_recovery) { @@ -8399,9 +8393,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces, * (i.e., since last restartpoint used as backup starting * checkpoint) contain full-page writes. */ -SpinLockAcquire(>info_lck); -recptr = XLogCtl->lastFpwDisableRecPtr; -SpinLockRelease(>info_lck); +recptr = XLogGetLastFPWDisableRecptr(); if (!checkpointfpw || state->startpoint <= recptr) ereport(ERROR, @@ -8434,13 +8426,7 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces, * taking a checkpoint right after another is not that expensive * either because only few buffers have been dirtied yet. */ - WALInsertLockAcquireExclusive(); - if (XLogCtl->Insert.lastBackupStart < state->startpoint) - { -XLogCtl->Insert.lastBackupStart = state->startpoint; -gotUniqueStartpoint = true; - } - WALInsertLockRelease(); + gotUniqueStartpoint = XLogBackupSetLastStart(state->startpoint); } while (!gotUniqueStartpoint); /* @@ -8549,6 +8535,15 @@ get_backup_status(void) return sessionBackupState; } +/* + * Utility routine to reset the session-level status of a backup running. + */ +void +reset_backup_status(void) +{ + sessionBackupState = SESSION_BACKUP_NONE; +} + /* * do_pg_backup_stop * @@ -8590,33 +8585,16 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive) errhint("wal_level must be set to \"replica\" or