Re: Tablesync early exit
On Tue, Apr 5, 2022 at 9:37 AM Peter Smith wrote: > > On Sat, Apr 2, 2022 at 5:17 PM Amit Kapila wrote: > > > > On Fri, Apr 1, 2022 at 1:52 PM Peter Smith wrote: > > > > > > On Wed, Mar 16, 2022 at 4:07 PM Amit Kapila > > > wrote: > > > > > > I think the STATE_CATCHUP guarantees the apply worker must have > > > received (or tried to receive) a message. See the previous answer. > > > > > > > Sorry, I intend to say till the sync worker has received any message. > > The point is that LSN till where the copy has finished might actually > > be later than some of the in-progress transactions on the server. It > > may not be a good idea to blindly skip those changes if the apply > > worker has already received those changes (say via a 'streaming' > > mode). Today, all such changes would be written to the file and > > applied at commit time but tomorrow, we can have an implementation > > where we can apply such changes (via some background worker) by > > skipping changes related to the table for which the table-sync worker > > is in-progress. Now, in such a scenario, unless, we allow the table > > sync worker to process more messages, we will end up losing some > > changes for that particular table. > > > > As per my understanding, this is safe as per the current code but it > > can't be guaranteed for future implementations and the amount of extra > > work is additional work to receive the messages for one transaction. I > > still don't think that it is a good idea to pursue this patch. > > IIUC you are saying that my patch is good today, but it may cause > problems in a hypothetical future if the rest of the replication logic > is implemented differently. > The approach I have alluded to above is already proposed earlier on -hackers [1] to make streaming transactions perform better. So, it is not completely hypothetical. [1] - https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru -- With Regards, Amit Kapila.
Re: logical decoding and replication of sequences
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra wrote: > > On 4/1/22 17:02, Tomas Vondra wrote: > > So, I investigated this a bit more, and I wrote a couple test_decoding > isolation tests (patch attached) demonstrating the issue. Actually, I > should say "issues" because it's a bit worse than you described ... > > The whole problem is in this chunk of code in sequence_decode(): > > > /* Skip the change if already processed (per the snapshot). */ > if (transactional && > !SnapBuildProcessChange(builder, xid, buf->origptr)) > return; > else if (!transactional && >(SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT || > SnapBuildXactNeedsSkip(builder, buf->origptr))) > return; > > /* Queue the increment (or send immediately if not transactional). */ > snapshot = SnapBuildGetOrBuildSnapshot(builder, xid); > ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr, > origin_id, target_node, transactional, > xlrec->created, tuplebuf); > > With the script you described, the increment is non-transactional, so we > end up in the second branch, return and thus discard the increment. > > But it's also possible the change is transactional, which can only > trigger the first branch. But it does not, so we start building the > snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is > >Assert(builder->state == SNAPBUILD_CONSISTENT); > > and we're still not in a consistent snapshot, so it just crashes and > burn :-( > > The sequences.spec file has two definitions of s2restart step, one empty > (resulting in non-transactional change), one with ALTER SEQUENCE (which > means the change will be transactional). > > > The really "funny" thing is this is not new code - this is an exact copy > from logicalmsg_decode(), and logical messages have all those issues > too. We may discard some messages, trigger the same Assert, etc. There's > a messages2.spec demonstrating this (s2message step defines whether the > message is transactional or not). > It seems to me that the Assert in SnapBuildGetOrBuildSnapshot() is wrong. It is required only for non-transactional logical messages. For transactional message(s), we decide at the commit time whether the snapshot has reached a consistent state and then decide whether to skip the entire transaction or not. So, the possible fix for Assert could be that we pass an additional parameter 'transactional' to SnapBuildGetOrBuildSnapshot() and then assert only when it is false. I have also checked the development thread for this work and it appears to be introduced for non-transactional cases only. See email[1], this new function and Assert was for the non-transactional case and later while rearranging the code, this problem got introduced. Now, for the non-transactional cases, I am not sure if there is a one-to-one mapping with the sequence case. The way sequences are dealt with on the subscriber-side (first we copy initial data and then replicate the incremental changes) appears more as we deal with the table and its incremental changes. There is some commonality with non-transactional messages w.r.t the case where we want sequence changes to be sent even on rollbacks unless some DDL has happened for them but if we see the overall solution it doesn't appear that we can use it similar to messages. I think this is the reason we are facing the other problems w.r.t to syncing sequences to subscribers including the problem reported by Sawada-San yesterday. Now, the particular case where we won't send a non-transactional logical message unless the snapshot is consistent could be considered as its behavior and may be documented better. I am not very sure about this as there is no example of the way sync for these messages happens in the core but if someone outside the core wants a different behavior and presents the case then we can probably try to enhance it. I feel the same is not true for sequences because it can cause the replica (subscriber) to go out of sync with the master (publisher). > So I guess we need to fix both places, perhaps in a similar way. > It depends but I think for logical messages we should do the minimal fix required for Asserts and probably document the current behavior bit better unless we think there is a case to make it behave similar to sequences. [1] - https://www.postgresql.org/message-id/56D4B3AD.5000207%402ndquadrant.com -- With Regards, Amit Kapila.
Re: Tablesync early exit
On Sat, Apr 2, 2022 at 5:17 PM Amit Kapila wrote: > > On Fri, Apr 1, 2022 at 1:52 PM Peter Smith wrote: > > > > On Wed, Mar 16, 2022 at 4:07 PM Amit Kapila wrote: > > > > I think the STATE_CATCHUP guarantees the apply worker must have > > received (or tried to receive) a message. See the previous answer. > > > > Sorry, I intend to say till the sync worker has received any message. > The point is that LSN till where the copy has finished might actually > be later than some of the in-progress transactions on the server. It > may not be a good idea to blindly skip those changes if the apply > worker has already received those changes (say via a 'streaming' > mode). Today, all such changes would be written to the file and > applied at commit time but tomorrow, we can have an implementation > where we can apply such changes (via some background worker) by > skipping changes related to the table for which the table-sync worker > is in-progress. Now, in such a scenario, unless, we allow the table > sync worker to process more messages, we will end up losing some > changes for that particular table. > > As per my understanding, this is safe as per the current code but it > can't be guaranteed for future implementations and the amount of extra > work is additional work to receive the messages for one transaction. I > still don't think that it is a good idea to pursue this patch. IIUC you are saying that my patch is good today, but it may cause problems in a hypothetical future if the rest of the replication logic is implemented differently. Anyway, it seems there is no chance of this getting committed, so it is time for me to stop flogging this dead horse. I will remove this from the CF. -- Kind Regards, Peter Smith Fujitsu Australia
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN
On Tue, Apr 05, 2022 at 10:40:04AM +0900, Masahiko Sawada wrote: > On Tue, Apr 5, 2022 at 1:31 AM Julien Rouhaud wrote: > > > > Yes. In normal circumstances it shouldn't need a lot of time to do that, > > but > > I'm not so sure with e.g. network filesystems. I'm not strongly in favor of > > counting it, especially since smgrextend doesn't either. > > Good point. I think that adding a new place to track I/O timing can be > a separate patch so probably we can work on it for PG16 or later. Agreed. > I've attached updated patches, please review it. It looks good to me, just one minor thing in 002: @@ -183,8 +184,10 @@ typedef struct Counters int64 local_blks_written; /* # of local disk blocks written */ int64 temp_blks_read; /* # of temp blocks read */ int64 temp_blks_written; /* # of temp blocks written */ - double blk_read_time; /* time spent reading, in msec */ - double blk_write_time; /* time spent writing, in msec */ + double blk_read_time; /* time spent reading blocks, in msec */ + double blk_write_time; /* time spent writing blocks, in msec */ + double temp_blk_read_time; /* time spent reading temp blocks, in msec */ + double temp_blk_write_time; /* time spent writing temp blocks, in msec */ maybe the comments should respectively be data file blocks and temp file blocks. This is a minor detail and the rest of the patch looks good to me, so I'm marking the patch as Ready for Committer!
Re: Skipping logical replication transactions on subscriber side
On Tue, Apr 5, 2022 at 10:46 AM Noah Misch wrote: > > On Tue, Apr 05, 2022 at 10:13:06AM +0900, Masahiko Sawada wrote: > > On Tue, Apr 5, 2022 at 9:21 AM Noah Misch wrote: > > > On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote: > > > > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch wrote: > > > > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote: > > > > > > How about a comment like: "It has to be kept at 8-byte alignment > > > > > > boundary so as to be accessed directly via C struct as it uses > > > > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms > > > > > > like AIX."? Can you please suggest a better comment if you don't > > > > > > like > > > > > > this one? > > > > > > > > > > I'd write it like this, though I'm not sure it's an improvement on > > > > > your words: > > > > > > > > > > When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte > > > > > alignment on > > > > > some of the C types that correspond to TYPALIGN_DOUBLE SQL types. > > > > > To ensure > > > > > catalog C struct layout matches catalog tuple layout, arrange for > > > > > the tuple > > > > > offset of each fixed-width, attalign='d' catalog column to be > > > > > divisible by 8 > > > > > unconditionally. Keep such columns before the first NameData > > > > > column of the > > > > > catalog, since packagers can override NAMEDATALEN to an odd number. > > > > > > > > Thanks! > > > > > > > > > > > > > > The best place for such a comment would be in one of > > > > > src/test/regress/sql/*sanity*.sql, next to a test written to detect > > > > > new > > > > > violations. > > > > > > > > Agreed. > > > > > > > > IIUC in the new test, we would need a new SQL function to calculate > > > > the offset of catalog columns including padding, is that right? Or do > > > > you have an idea to do that by using existing functionality? > > > > > > Something like this: > > > > > > select > > > attrelid::regclass, > > > attname, > > > array(select typname > > > from pg_type t join pg_attribute pa on t.oid = pa.atttypid > > > where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < > > > a.attnum order by pa.attnum) AS types_before, > > > (select sum(attlen) > > >from pg_type t join pg_attribute pa on t.oid = pa.atttypid > > >where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < > > > a.attnum) AS len_before > > > from pg_attribute a > > > join pg_class c on c.oid = attrelid > > > where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1 > > > order by attrelid::regclass::text, attnum; > > > attrelid │ attname│types_before > > > │ len_before > > > ─┼──┼─┼ > > > pg_sequence │ seqstart │ {oid,oid} > > > │ 8 > > > pg_sequence │ seqincrement │ {oid,oid,int8} > > > │ 16 > > > pg_sequence │ seqmax │ {oid,oid,int8,int8} > > > │ 24 > > > pg_sequence │ seqmin │ {oid,oid,int8,int8,int8} > > > │ 32 > > > pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8} > > > │ 40 > > > pg_subscription │ subskiplsn │ > > > {oid,oid,name,oid,bool,bool,bool,char,bool} │ 81 > > > (6 rows) > > > > > > That doesn't count padding, but hazardous column changes will cause a > > > diff in > > > the output. > > > > Yes, in this case, we can detect the violated column order even > > without considering padding. On the other hand, I think this > > calculation could not detect some patterns of order. For instance, > > suppose the column order is {oid, bool, bool, oid, bool, bool, oid, > > int8}, the len_before is 16 but offset of int8 column including > > padding is 20 on ALIGNOF_DOUBLE==4 environment. > > Correct. Feel free to make it more precise. If you do want to add a > function, it could be a regress.c function rather than an always-installed > part of PostgreSQL. Again, getting the buildfarm green is a priority; we can > always add tests later. Agreed. I'll update and submit the patch as soon as possible. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Apr 4, 2022 at 8:18 PM Andres Freund wrote: > The remaining patch are the warnings in vac_update_relstats(), correct? I > guess one could argue they should be LOG rather than WARNING, but I find the > project stance on that pretty impractical. So warning's ok with me. Right. The reason I used WARNINGs was because it matches vaguely related WARNINGs in vac_update_relstats()'s sibling function, vacuum_set_xid_limits(). > Not sure why you used errmsg_internal()? The usual reason for using errmsg_internal(), I suppose. I tend to do that with corruption related messages on the grounds that they're usually highly obscure issues that are (by definition) never supposed to happen. The only thing that a user can be expected to do with the information from the message is to report it to -bugs, or find some other similar report. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote: > On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan wrote: > > I also refined the WARNING patch in v15. It now actually issues > > WARNINGs (rather than PANICs, which were just a temporary debugging > > measure in v14). > > Going to commit this remaining patch tomorrow, barring objections. The remaining patch are the warnings in vac_update_relstats(), correct? I guess one could argue they should be LOG rather than WARNING, but I find the project stance on that pretty impractical. So warning's ok with me. Not sure why you used errmsg_internal()? Otherwise LGTM. Greetings, Andres Freund
Re: Window Function "Run Conditions"
On 2022-04-05 12:04:18 +1200, David Rowley wrote: > > This is afaics slightly cheaper than referencing a variable in a slot. > > I guess you must mean cheaper because it means there will be no > EEOP_*_FETCHSOME step? Otherwise it seems a fairly similar amount of > work. That, and slightly fewer indirections for accessing values IIRC.
Re: Handle infinite recursion in logical replication setup
On Tue, Apr 5, 2022 at 12:36 PM Amit Kapila wrote: > > On Tue, Apr 5, 2022 at 6:21 AM Peter Smith wrote: > > > > Here are my comments for the latest patch v6-0001. > > > > (I will post my v6-0002 review comments separately) > > > > PATCH v6-0001 comments > > == > > > > 1.1 General - Option name > > > > I still feel like the option name is not ideal. Unfortunately, this is > > important because any name change would impact lots of these patch > > files and docs, struct members etc. > > > > It was originally called "local_only", but I thought that as a > > SUBSCRIPTION option that was confusing because "local" means local to > > what? Really it is local to the publisher, not local to the > > subscriber, so that name seemed misleading. > > > > Then I suggested "publish_local_only". Although that resolved the > > ambiguity problem, other people thought it seemed odd to have the > > "publish" prefix for a subscription-side option. > > > > So now it is changed again to "subscribe_local_only" -- It's getting > > better but still, it is implied that the "local" means local to the > > publisher except there is nothing in the option name really to convey > > that meaning. IMO we here all understand the meaning of this option > > mostly by familiarity with this discussion thread, but I think a user > > coming to this for the first time will still be confused by the name. > > > > Below are some other name ideas. Maybe they are not improvements, but > > it might help other people to come up with something better. > > > > subscribe_publocal_only = true/false > > origin = pub_only/any > > adjacent_data_only = true/false > > direct_subscriptions_only = true/false > > ... > > > > I think local_only with a description should be okay considering other > current options. Some of the options like slot_name, binary, etc. are > publisher-specific which becomes clear only after reading the > description. I am not sure adding pub*/sub* to it is much helpful. So, > +1 to a simple boolean like 'local_only', 'data_local', or something > like that. +1 to use a simple boolean option. Thanks for the examples of the other options which are also really publisher-specific. Seen in that light, and with a clear description, probably the original "local_only" was fine after all. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Temporary tables versus wraparound... again
So here's an updated patch. I had to add a public method to multixact.c to expose the locally calculated OldestMultiXactId. It's possible we could use something even tighter (like the current next mxid since we're about to commit) but I didn't see a point in going further and it would have become more complex. I also added a branch in heapam_handler.c in ..._set_new_filenode() of temporary tables. It feels like a free win and it's more consistent. I'm not 100% on the tableam abstraction -- it's possible all of this change should have happened in heapam_handler somewhere? I don't think so but it does feel weird to be touching it and also doing the same thing elsewhere. I think this has addressed all the questions now. From 2e5d2c47288d27a620af09214c82fd66f61fb605 Mon Sep 17 00:00:00 2001 From: Greg Stark Date: Thu, 31 Mar 2022 15:48:38 -0400 Subject: [PATCH v6 1/3] Add warnings when old temporary tables are found to still be in use during autovacuum. Long lived sessions using temporary tables are required to vacuum them themselves. For the warning to be useful modify checkTempNamespaceStatus to return the backend pid using it so that we can inform super-user which pid to terminate. Otherwise it's quite tricky to determine as a user. Rename the function to avoid an incompatible ABI break. --- src/backend/access/transam/varsup.c | 12 --- src/backend/catalog/namespace.c | 9 +++-- src/backend/postmaster/autovacuum.c | 52 ++--- src/include/catalog/namespace.h | 4 +-- 4 files changed, 57 insertions(+), 20 deletions(-) diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index 748120a012..8b29573e9f 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -129,14 +129,16 @@ GetNewTransactionId(bool isSubXact) errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"", oldest_datname), errhint("Stop the postmaster and vacuum that database in single-user mode.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); else ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("database is not accepting commands to avoid wraparound data loss in database with OID %u", oldest_datoid), errhint("Stop the postmaster and vacuum that database in single-user mode.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); } else if (TransactionIdFollowsOrEquals(xid, xidWarnLimit)) { @@ -149,14 +151,16 @@ GetNewTransactionId(bool isSubXact) oldest_datname, xidWrapLimit - xid), errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); else ereport(WARNING, (errmsg("database with OID %u must be vacuumed within %u transactions", oldest_datoid, xidWrapLimit - xid), errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); } /* Re-acquire lock and start over */ diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index fafb9349cc..c1fd3ced95 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -3272,15 +3272,18 @@ isOtherTempNamespace(Oid namespaceId) /* * checkTempNamespaceStatus - is the given namespace owned and actively used - * by a backend? + * by a backend? Optionally return the pid of the owning backend if there is + * one. Returned pid is only meaningful when TEMP_NAMESPACE_IN_USE but note + * below about race conditions. * * Note: this can be used while scanning relations in pg_class to detect * orphaned temporary tables or namespaces with a backend connected to a * given database. The result may be out of date quickly, so the caller * must be careful how to handle this information. + * */ TempNamespaceStatus -checkTempNamespaceStatus(Oid namespaceId) +checkTempNamespaceStatusAndPid(Oid
Re: Handle infinite recursion in logical replication setup
Here are my comments for the latest patch v6-0002. PATCH v6-0002 comments == 2.1 General - should this be an independent patch? In many ways, I think most of this patch is unrelated to the other "local_only" patch (v6-0001). For example, IIUC even in the current HEAD, we could consider it to be a user error if multiple SUBSCRIPTIONS or multiple PUBLICATIONS of the same SUBSCRIPTION are replicating to the same TABLE on the same node and using "copy_data = on". So I think it would be ok to throw an ERROR if such a copy_data clash is detected, and then the user will have to change to use "copy_data = off" for some/all of them to avoid data duplication. The "local_only" option only adds some small logic to this new ERROR, but it's not really a prerequisite at all. e.g. this whole ERROR part of the patch can be a separate thread. ~~~ 2.2 General - can we remove the "force" enum? Now, because I consider the clashing "copy_data = on" ERROR to be a user error, I think that is something that the user can already take care of themselves just using the copy_data = off. I did not really like the modifying of the "copy_data" option from just boolean to some kind of hybrid boolean + "force". a) It smells a bit off to me. IMO replication is supposed to end up with the same (replicated) data on the standby machine but this "force" mode seems to be just helping the user to break that concept and say - "I know what I'm doing, and I don't care if I get lots of duplicated data in the replica table - just let me do it"... b) It also somehow feels like the new "force" was introduced mostly to make the code ERROR handling implementation simpler, rather than to make the life of the end-user better. Yes, if force is removed maybe the copy-clash-detection-code will need to be internally quite more complex than it is now, but that is how it should be, instead of putting extra burden on the user (e.g. by complicating the PG docs and giving them yet more alternatives to configure). I think any clashing copy_data options really is a user error, but also I think the current boolean copy_data true/false already gives the user a way to fix it. c) Actually, your new error hint messages are similar to my perspective: They already say "Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force". All I am saying is remove the "force", and the user can still fix the problem just by using "copy_data = off" appropriately. == So (from above) I am not much in favour of the copy_data becoming a hybrid enum and using "force", yet that is what most of this patch is implementing. Anyway, the remainder of my review comments here are for the code in its current form. Maybe if "force" can be removed most of the following comments end up being redundant. == 2.3 Commit message - wording This message is difficult to understand. I think that the long sentence "Now when user is trying..." can be broken into more manageable parts. This part "and throw an error so that user can handle the initial copy data." also seemed a bit vague. ~~~ 2.4 Commit message - more functions "This patch does couple of things:" IIUC, there seems a third thing implemented by this patch but not described by the comment. I think it also adds support for ALTER SUBSCRIPTION SET PUBLICATION WITH (subscribe_local_only) ~~~ 2.5 doc/src/sgml/ref/create_subscription.sgml - wording @@ -161,6 +161,13 @@ CREATE SUBSCRIPTION subscription_namefalse. + + If the tables in the publication were also subscribing to the data in + the publisher from other publishers, it will affect the + CREATE SUBSCRIPTION based on the value specified + for copy_data option. Refer to the + for details. + Is there is a simpler way to express all that? SUGGESTION There is some interation between the option "subscribe_local_only" and option "copy_data". Refer to the for details. ~~~ 2.6 doc/src/sgml/ref/create_subscription.sgml - whitespace @@ -213,18 +220,28 @@ CREATE SUBSCRIPTION subscription_name -copy_data (boolean) +copy_data (enum) Specifies whether to copy pre-existing data in the publications - that are being subscribed to when the replication starts. - The default is true. + that are being subscribed to when the replication starts. This + parameter may be either true, + false or force. The default is + true. That last line has trailing whitespace. ~~~ 2.7 doc/src/sgml/ref/create_subscription.sgml - wording + + If the tables in the publication were also subscribing to the data in + the publisher from other publishers, it will affect the + CREATE SUBSCRIPTION based on the value specified + for subscribe_local_only option. Refer to the + for details. + This is similar to review comment #2.5 which I
Re: shared-memory based stats collector - v68
Hi, On 2022-04-04 19:03:13 -0700, David G. Johnston wrote: > > > (if this is true...but given this is an optimization category I'm > > thinking > > > maybe it doesn't actually matter...) > > > > It is true. Not sure what you mean with "optimization category"? > > > > > I mean that distinguishing between stats that are fixed and those that are > variable implies that fixed kinds have a better performance (speed, memory) > characteristic than variable kinds (at least in part due to the presence of > changecount). If fixed kinds did not have a performance benefit then > having the variable kind implementation simply handle fixed kinds as well > (using the common struct header and storage in a hash table) would make the > implementation simpler since all statistics would report through the same > API. Yes, fixed-numbered stats are faster. > Coming back to this: > """ > + /* cluster-scoped object stats having a variable number of entries */ > + PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */ > + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */ > + PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd > spot to be closer to the database-scoped options) > + > + /* database-scoped object stats having a variable number of entries */ > + PGSTAT_KIND_RELATION, /* per-table statistics */ > + PGSTAT_KIND_FUNCTION, /* per-function statistics */ > + > + /* cluster-scoped stats having a fixed number of entries */ (maybe these > should go first, the variable following?) > + PGSTAT_KIND_ARCHIVER, > + PGSTAT_KIND_BGWRITER, > + PGSTAT_KIND_CHECKPOINTER, > + PGSTAT_KIND_SLRU, > + PGSTAT_KIND_WAL, > """ > > I see three "KIND_GROUP" categories here: > PGSTAT_KIND_CLUSTER (open to a different word here though...) > PGSTAT_KIND_DATABASE (we seem to agree on this above) > PGSTAT_KIND_GLOBAL (already used in the code) > > This single enum can replace the two booleans that, in combination, would > define 4 unique groups (of which only three are interesting - > database+fixed doesn't seem interesting and so is not given a name/value > here). The more I think about it, the less I think a split like that makes sense. The difference between PGSTAT_KIND_CLUSTER / PGSTAT_KIND_DATABASE is tiny. Nearly all code just deals with both together. I think all this is going to achieve is to making code more complicated. There is a *single* non-assert use of accessed_across_databases and now a single assertion involving it. What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve? Greetings, Andres Freund
Re: Handle infinite recursion in logical replication setup
On Tue, Apr 5, 2022 at 6:21 AM Peter Smith wrote: > > Here are my comments for the latest patch v6-0001. > > (I will post my v6-0002 review comments separately) > > PATCH v6-0001 comments > == > > 1.1 General - Option name > > I still feel like the option name is not ideal. Unfortunately, this is > important because any name change would impact lots of these patch > files and docs, struct members etc. > > It was originally called "local_only", but I thought that as a > SUBSCRIPTION option that was confusing because "local" means local to > what? Really it is local to the publisher, not local to the > subscriber, so that name seemed misleading. > > Then I suggested "publish_local_only". Although that resolved the > ambiguity problem, other people thought it seemed odd to have the > "publish" prefix for a subscription-side option. > > So now it is changed again to "subscribe_local_only" -- It's getting > better but still, it is implied that the "local" means local to the > publisher except there is nothing in the option name really to convey > that meaning. IMO we here all understand the meaning of this option > mostly by familiarity with this discussion thread, but I think a user > coming to this for the first time will still be confused by the name. > > Below are some other name ideas. Maybe they are not improvements, but > it might help other people to come up with something better. > > subscribe_publocal_only = true/false > origin = pub_only/any > adjacent_data_only = true/false > direct_subscriptions_only = true/false > ... > I think local_only with a description should be okay considering other current options. Some of the options like slot_name, binary, etc. are publisher-specific which becomes clear only after reading the description. I am not sure adding pub*/sub* to it is much helpful. So, +1 to a simple boolean like 'local_only', 'data_local', or something like that. -- With Regards, Amit Kapila.
Re: Window Function "Run Conditions"
> > > I'm pretty happy with this now. If anyone wants to have a look at > this, can they do so or let me know they're going to within the next > 24 hours. Otherwise I plan to move into commit mode with it. > > I just came to the office today to double check this patch. I probably can finish it very soon. But if you are willing to commit it sooner, I am totally fine with it. -- Best Regards Andy Fan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan wrote: > I also refined the WARNING patch in v15. It now actually issues > WARNINGs (rather than PANICs, which were just a temporary debugging > measure in v14). Going to commit this remaining patch tomorrow, barring objections. -- Peter Geoghegan
Re: generic plans and "initial" pruning
On Mon, Apr 4, 2022 at 9:55 PM Amit Langote wrote: > On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera wrote: > > I think the names ExecCreatePartitionPruneState and > > ExecInitPartitionPruning are too confusingly similar. Maybe the former > > should be renamed to somehow make it clear that it is a subroutine for > > the former. > > Ah, yes. I've taken out the "Exec" from the former. While at it, maybe it's better to rename ExecInitPruningContext() to InitPartitionPruneContext(), which I've done in the attached updated patch. -- Amit Langote EDB: http://www.enterprisedb.com v10-0001-Some-refactoring-of-runtime-pruning-code.patch Description: Binary data
Re: Lowering the ever-growing heap->pd_lower
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent wrote: > # Truncating lp_array during pruning > === > > The following adversarial load grows the heap relation, but with the > patch the relation keeps its size. The point being that HOT updates > can temporarily inflate the LP array significantly, and this patch can > actively combat that issue while we're waiting for the 2nd pass of > vacuum to arrive. I am sympathetic to the idea that giving the system a more accurate picture of how much free space is available on each heap page is an intrinsic good. This might help us in a few different areas. For example, the FSM cares about relatively small differences in available free space *among* heap pages that are "in competition" in RelationGetBufferForTuple(). Plus we have a heuristic based on PageGetHeapFreeSpace() in heap_page_prune_opt() to consider. We should definitely increase MaxHeapTuplesPerPage before too long, for a variety of reasons that I have talked about in the past. Its current value is 291 on all mainstream platforms, a value that's derived from accidental historic details -- which predate HOT. Obviously an increase in MaxHeapTuplesPerPage is likely to make the problem that the patch proposes to solve worse. I lean towards committing the patch now as work in that direction, in fact. It helps that this patch now seems relatively low risk. -- Peter Geoghegan
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro wrote: > On Tue, Apr 5, 2022 at 2:18 AM Robert Haas wrote: > > I'm not sure that it really matters, but with the idea that I proposed > > it's possible to "save" a pending writeback, if we notice that we're > > accessing the relation with a proper lock after the barrier arrives > > and before we actually try to do the writeback. With this approach we > > throw them out immediately, so they're just gone. I don't mind that > > because I don't think this will happen often enough to matter, and I > > don't think it will be very expensive when it does, but it's something > > to think about. > > The checkpointer never takes heavyweight locks, so the opportunity > you're describing can't arise. Hmm, oh, you probably meant the buffer interlocking in SyncOneBuffer(). It's true that my most recent patch throws away more requests than it could, by doing the level check at the end of the loop over all buffers instead of adding some kind of DropPendingWritebacks() in the barrier handler. I guess I could find a way to improve that, basically checking the level more often instead of at the end, but I don't know if it's worth it; we're still throwing out an arbitrary percentage of writeback requests.
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
At Mon, 4 Apr 2022 21:14:27 +0530, Dilip Kumar wrote in > On Mon, Apr 4, 2022 at 2:25 PM Kyotaro Horiguchi > wrote: > > > > At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi > > wrote in > > > I haven't found how the patch caused creation of a relation file that > > > is to be removed soon. However, I find that v19 patch fails by maybe > > > due to some change in Cluster.pm. It takes a bit more time to check > > > that.. > > > > I was a bit away, of course the wal-logged create database interfares > > with the patch here. But I haven't found that why it stops creating > > database directory under pg_tblspc. > > I did not understand what is the exact problem here, but the database > directory and the version file are created under the default > tablespace of the target database. However, other than the default > tablespace of the database, the database directory will be created > along with the smgrcreate() so that we do not create an unnecessary > directory under the tablespace where we do not have any data to be > copied. Thanks. Yeah, I suspected something like that but I didn't find a difference in the code I suspected to be related with, but it's was wrong. I took wrong steps trying to reveal that state and faced the wrong error message. With the correct steps, I could see that Storage/CREATE creates pg_tblspc/. So, if we create missing tablespace directory, we have no way otherthan creating it directly in pg_tblspc, which is violating the rule that there shouldn't be real directory in pg_tblspc (when allow_in_place_tablespaces is false). So, I have the following points in my mind for now. - We create the directory "since we know it is just tentative state". - Then, check that no directory in pg_tblspc when reaching consistency when allow_in_place_tablespaces is false. - Leave the log_invalid_page() mechanism alone as it is always result in a corrpt page if a differential WAL record is applied on a newly created page that should have been exist. However, while working on it, I found that I found that recovery faces missing tablespace directories *after* reaching consistency. I'm examining that further. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
On Mon, Apr 04, 2022 at 05:39:58PM -0700, Andres Freund wrote: > On my workstation it takes about 2.39s to run pg_amcheck on a regression > database with all thoroughness options enabled. With -j4 it's 0.62s. > > Without more thorough checking it's 1.24s and 0.30s with -j4. Okay. That sounds like an argument to enable that by default, with parallelism. -- Michael signature.asc Description: PGP signature
Re: New Object Access Type hooks
Mark Dilger writes: > On Apr 4, 2022, at 1:47 PM, Tom Lane wrote: >> Perhaps libpq should be trying harder to make those cases look alike, but >> this test is about server behavior not libpq behavior, so I'm inclined >> to just make it lax. > +1. > I've gotten this test failure only a few times in perhaps the last six > months, so if we narrow the opportunity for test failure without closing it > entirely, we're just making the test failures that much harder to diagnose. Done that way. regards, tom lane
Re: PATCH: add "--config-file=" option to pg_rewind
On Sun, Apr 03, 2022 at 09:31:32PM +0200, Michael Banck wrote: > The patch applies, make is ok, make check is ok, pg_rewind TAP tests > are ok. + appendPQExpBufferStr(postgres_cmd, " --config_file="); + appendShellString(postgres_cmd, config_file); Nit. This is a valid option for the postgres command, but I'd rather use a -c switch instead, mostly as a matter of self-documentation. In the tests, the only difference between the modes "archive_cli" and "archive" is the extra option given to the pg_rewind command, and that's a simple redirect to what pg_rewind would choose by default anyway. A more elegant solution would be to have an extra option in RewindTest::run_pg_rewind(), where any caller of the routine can feed extra options to the pg_rewind command. Now, in this case, we are not going to lose any coverage if the existing "archive" mode is changed so as we move away postgresql.conf from the target data folder and just use --config-file by default, no? I would make the choice of simplicity, by giving up on "archive_cli" and using primary-postgresql.conf.tmp as value for --config-file. There could be an argument for using --config-file for all the modes, as well. -- Michael signature.asc Description: PGP signature
Re: shared-memory based stats collector - v68
On Mon, Apr 4, 2022 at 3:44 PM Andres Freund wrote: > Hi, > > On 2022-04-04 15:24:24 -0700, David G. Johnston wrote: > > Replacing the existing assert(!kind->fixed_amount) with > > assert(!kind->accessed_across_databases) produces the same result as the > > later presently implies the former. > > I wasn't proposing to replace, but to add... > Right, but it seems redundant to have both when one implies the other. But I'm not hard set against it either, though my idea below make them both obsolete. > > > Now I start to dislike the behavioral aspect of the attribute and would > > rather just name it: kind->is_cluster_scoped (or something else that is > > descriptive of the stat category itself, not how it is used) > > I'm not in love with the name either. But cluster is just a badly > overloaded > word :(. > > system_wide? Or invert it and say: database_scoped? I think I like the > latter. > > I like database_scoped as well...but see my idea below that makes this obsolete. > > > Then reorganize the Kind documentation to note and emphasize these two > > primary descriptors: > > variable, which can be cluster or database scoped > > fixed, which are cluster scoped by definition > > Hm. There's not actually that much difference between cluster/non-cluster > wide > scope for most of the system. I'm not strongly against, but I'm also not > really seeing the benefit. > Not married to it myself, something to come back to when the dust settles. > > (if this is true...but given this is an optimization category I'm > thinking > > maybe it doesn't actually matter...) > > It is true. Not sure what you mean with "optimization category"? > > I mean that distinguishing between stats that are fixed and those that are variable implies that fixed kinds have a better performance (speed, memory) characteristic than variable kinds (at least in part due to the presence of changecount). If fixed kinds did not have a performance benefit then having the variable kind implementation simply handle fixed kinds as well (using the common struct header and storage in a hash table) would make the implementation simpler since all statistics would report through the same API. In that world, variability is simply a possibility that not every actual reporter has to use. That improved performance characteristic is what I meant by "optimization category". I question whether we should be publishing "fixed" and "variable" as concrete properties. I'm not presently against the current choice to do so, but as you say above, I'm also not really seeing the benefit. (goes and looks at all the places that use the fixed_amount field...sparking an idea) Coming back to this: """ + /* cluster-scoped object stats having a variable number of entries */ + PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */ + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */ + PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd spot to be closer to the database-scoped options) + + /* database-scoped object stats having a variable number of entries */ + PGSTAT_KIND_RELATION, /* per-table statistics */ + PGSTAT_KIND_FUNCTION, /* per-function statistics */ + + /* cluster-scoped stats having a fixed number of entries */ (maybe these should go first, the variable following?) + PGSTAT_KIND_ARCHIVER, + PGSTAT_KIND_BGWRITER, + PGSTAT_KIND_CHECKPOINTER, + PGSTAT_KIND_SLRU, + PGSTAT_KIND_WAL, """ I see three "KIND_GROUP" categories here: PGSTAT_KIND_CLUSTER (open to a different word here though...) PGSTAT_KIND_DATABASE (we seem to agree on this above) PGSTAT_KIND_GLOBAL (already used in the code) This single enum can replace the two booleans that, in combination, would define 4 unique groups (of which only three are interesting - database+fixed doesn't seem interesting and so is not given a name/value here). While the succinctness of the booleans has appeal the need for half of the booleans to end up being negated quickly tarnishes it. With the three groups, every assertion is positive in nature indicating which of the three groups are handled by the function. While that is probably a few more characters it seems like an easier read and is less complicated as it has fewer independent parts. At most you OR two kinds together which is succinct enough I would think. There are no gaps relative to the existing implementation that defines fixed_amount and accessed_across_databases - every call site using either of them can be transformed mechanically. David J.
Re: Handle infinite recursion in logical replication setup
On Mon, Apr 4, 2022 at 6:50 PM Ashutosh Bapat wrote: > > I didn't find this in https://commitfest.postgresql.org/37/. Is this > somewhere in the commitfest? I missed adding it earlier, I have added it to commitfest today: https://commitfest.postgresql.org/38/3610/ Regards, Vignesh
Re: Skipping logical replication transactions on subscriber side
On Tue, Apr 05, 2022 at 10:13:06AM +0900, Masahiko Sawada wrote: > On Tue, Apr 5, 2022 at 9:21 AM Noah Misch wrote: > > On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote: > > > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch wrote: > > > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote: > > > > > How about a comment like: "It has to be kept at 8-byte alignment > > > > > boundary so as to be accessed directly via C struct as it uses > > > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms > > > > > like AIX."? Can you please suggest a better comment if you don't like > > > > > this one? > > > > > > > > I'd write it like this, though I'm not sure it's an improvement on your > > > > words: > > > > > > > > When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte > > > > alignment on > > > > some of the C types that correspond to TYPALIGN_DOUBLE SQL types. To > > > > ensure > > > > catalog C struct layout matches catalog tuple layout, arrange for the > > > > tuple > > > > offset of each fixed-width, attalign='d' catalog column to be > > > > divisible by 8 > > > > unconditionally. Keep such columns before the first NameData column > > > > of the > > > > catalog, since packagers can override NAMEDATALEN to an odd number. > > > > > > Thanks! > > > > > > > > > > > The best place for such a comment would be in one of > > > > src/test/regress/sql/*sanity*.sql, next to a test written to detect new > > > > violations. > > > > > > Agreed. > > > > > > IIUC in the new test, we would need a new SQL function to calculate > > > the offset of catalog columns including padding, is that right? Or do > > > you have an idea to do that by using existing functionality? > > > > Something like this: > > > > select > > attrelid::regclass, > > attname, > > array(select typname > > from pg_type t join pg_attribute pa on t.oid = pa.atttypid > > where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < > > a.attnum order by pa.attnum) AS types_before, > > (select sum(attlen) > >from pg_type t join pg_attribute pa on t.oid = pa.atttypid > >where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < > > a.attnum) AS len_before > > from pg_attribute a > > join pg_class c on c.oid = attrelid > > where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1 > > order by attrelid::regclass::text, attnum; > > attrelid │ attname│types_before > > │ len_before > > ─┼──┼─┼ > > pg_sequence │ seqstart │ {oid,oid} > > │ 8 > > pg_sequence │ seqincrement │ {oid,oid,int8} > > │ 16 > > pg_sequence │ seqmax │ {oid,oid,int8,int8} > > │ 24 > > pg_sequence │ seqmin │ {oid,oid,int8,int8,int8} > > │ 32 > > pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8} > > │ 40 > > pg_subscription │ subskiplsn │ > > {oid,oid,name,oid,bool,bool,bool,char,bool} │ 81 > > (6 rows) > > > > That doesn't count padding, but hazardous column changes will cause a diff > > in > > the output. > > Yes, in this case, we can detect the violated column order even > without considering padding. On the other hand, I think this > calculation could not detect some patterns of order. For instance, > suppose the column order is {oid, bool, bool, oid, bool, bool, oid, > int8}, the len_before is 16 but offset of int8 column including > padding is 20 on ALIGNOF_DOUBLE==4 environment. Correct. Feel free to make it more precise. If you do want to add a function, it could be a regress.c function rather than an always-installed part of PostgreSQL. Again, getting the buildfarm green is a priority; we can always add tests later.
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN
On Tue, Apr 5, 2022 at 1:31 AM Julien Rouhaud wrote: > > On Tue, Apr 05, 2022 at 12:51:12AM +0900, Masahiko Sawada wrote: > > On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud wrote: > > > > > > Hmm, but AFAICS the json format would be stable as the counters are always > > > shown even if zero. So just doing the json format first and then the text > > > format should also work. Although if you're really unlucky there could > > > be a > > > cache invalidation in between so we could just ignore the text format. > > > But I > > > think we should at least keep a regression test with the json format, > > > with a > > > comment explain why only this one is tested. > > > > Fair point. By commit 7e12256b478 we disabled track_io_timing, but > > probably we can temporarily enable it and run one query with "buffers" > > and "format json" options. > > Yes, enabling it for just this query. It can't really find any problem with > the values themselves but at least the new code path would be partially > executed. > > > > > > > - not really your patch fault I guess, but I see that extendBufFile() > > > isn't > > > handled. There shouldn't be much activity there so maybe it's ok. > > > This is likely because smgr_extend is also not handled, but this one > > > seems > > > much more likely to take quite some time, and therefore should bump the > > > timing counters. > > > > You mean we should include the time for opening files as write time? > > Yes. In normal circumstances it shouldn't need a lot of time to do that, but > I'm not so sure with e.g. network filesystems. I'm not strongly in favor of > counting it, especially since smgrextend doesn't either. Good point. I think that adding a new place to track I/O timing can be a separate patch so probably we can work on it for PG16 or later. I've attached updated patches, please review it. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ v4-0001-Track-I-O-timing-for-temp-buffers.patch Description: Binary data v4-0002-pg_stat_statements-Track-I-O-timing-for-temp-bloc.patch Description: Binary data
Re: Add LZ4 compression in pg_dump
On Fri, Apr 01, 2022 at 03:06:40PM +, gkokola...@pm.me wrote: > I understand the itch. Indeed when LZ4 is added as compression method, this > block changes slightly. I went with the minimum amount changed. Please find > in 0001 of the attached this variable renamed as $gzip_program_exist. I > thought > that as prefix it will match better the already used $ENV{GZIP_PROGRAM}. Hmm. I have spent some time on that, and upon review I really think that we should skip the tests marked as dedicated to the gzip compression entirely if the build is not compiled with this option, rather than letting the code run a dump for nothing in some cases, relying on the default to uncompress the contents in others. In the latter case, it happens that we have already some checks like defaults_custom_format, but you already mentioned that. We should also skip the later parts of the tests if the compression program does not exist as we rely on it, but only if the command does not exist. This will count for LZ4. > I can see the overlap case. Yet, I understand the test_key as serving > different > purpose, as it is a key of %tests and %full_runs. I do not expect the database > content of the generated dump to change based on which compression method is > used. Contrary to the current LZ4 tests in pg_dump, what we have here is a check for a command-level run and not a data-level check. So what's introduced is a new concept, and we need a new way to control if the tests should be entirely skipped or not, particularly if we finish by not using test_key to make the difference. Perhaps the best way to address that is to have a new keyword in the $runs structure. The attached defines a new compile_option, that can be completed later for new compression methods introduced in the tests. So the idea is to mark all the tests related to compression with the same test_key, and the tests can be skipped depending on what compile_option requires. > In the attached version, I propose that the compression_cmd is converted into > a hash. It contains two keys, the program and the arguments. Maybe it is > easier > to read than before or than simply grabbing the first element of the array. Splitting the program and its arguments makes sense. At the end I am finishing with the attached. I also saw an overlap with the addition of --jobs for the directory format vs not using the option, so I have removed the case where --jobs was not used in the directory format. -- Michael From b705f77862c9e41a149ce078df638032903bce3d Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 5 Apr 2022 10:30:06 +0900 Subject: [PATCH v6] Extend compression coverage for pg_dump, pg_restore --- src/bin/pg_dump/Makefile | 2 + src/bin/pg_dump/t/002_pg_dump.pl | 93 +++- 2 files changed, 93 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile index 302f7e02d6..2f524b09bf 100644 --- a/src/bin/pg_dump/Makefile +++ b/src/bin/pg_dump/Makefile @@ -16,6 +16,8 @@ subdir = src/bin/pg_dump top_builddir = ../../.. include $(top_builddir)/src/Makefile.global +export GZIP_PROGRAM=$(GZIP) + override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS) LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index af5d6fa5a3..dda3649373 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -20,12 +20,22 @@ my $tempdir = PostgreSQL::Test::Utils::tempdir; # test_key indicates that a given run should simply use the same # set of like/unlike tests as another run, and which run that is. # +# compile_option indicates if the commands run depend on a compilation +# option, if any. This controls if tests are skipped if a dependency +# is not satisfied. +# # dump_cmd is the pg_dump command to run, which is an array of # the full command and arguments to run. Note that this is run # using $node->command_ok(), so the port does not need to be # specified and is pulled from $PGPORT, which is set by the # PostgreSQL::Test::Cluster system. # +# compress_cmd is the utility command for (de)compression, if any. +# Note that this should generally be used on pg_dump's output +# either to generate a text file to run the through the tests, or +# to test pg_restore's ability to parse manually compressed files +# that otherwise pg_dump does not compress on its own (e.g. *.toc). +# # restore_cmd is the pg_restore command to run, if any. Note # that this should generally be used when the pg_dump goes to # a non-text file and that the restore can then be used to @@ -54,6 +64,58 @@ my %pgdump_runs = ( "$tempdir/binary_upgrade.dump", ], }, + + # Do not use --no-sync to give test coverage for data sync. + compression_gzip_custom => { + test_key => 'compression', + compile_option => 'gzip', + dump_cmd => [ + 'pg_dump', '--format=custom', +
Re: Skipping logical replication transactions on subscriber side
On Tue, Apr 5, 2022 at 9:21 AM Noah Misch wrote: > > On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote: > > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch wrote: > > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote: > > > > How about a comment like: "It has to be kept at 8-byte alignment > > > > boundary so as to be accessed directly via C struct as it uses > > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms > > > > like AIX."? Can you please suggest a better comment if you don't like > > > > this one? > > > > > > I'd write it like this, though I'm not sure it's an improvement on your > > > words: > > > > > > When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte > > > alignment on > > > some of the C types that correspond to TYPALIGN_DOUBLE SQL types. To > > > ensure > > > catalog C struct layout matches catalog tuple layout, arrange for the > > > tuple > > > offset of each fixed-width, attalign='d' catalog column to be divisible > > > by 8 > > > unconditionally. Keep such columns before the first NameData column of > > > the > > > catalog, since packagers can override NAMEDATALEN to an odd number. > > > > Thanks! > > > > > > > > The best place for such a comment would be in one of > > > src/test/regress/sql/*sanity*.sql, next to a test written to detect new > > > violations. > > > > Agreed. > > > > IIUC in the new test, we would need a new SQL function to calculate > > the offset of catalog columns including padding, is that right? Or do > > you have an idea to do that by using existing functionality? > > Something like this: > > select > attrelid::regclass, > attname, > array(select typname > from pg_type t join pg_attribute pa on t.oid = pa.atttypid > where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < > a.attnum order by pa.attnum) AS types_before, > (select sum(attlen) >from pg_type t join pg_attribute pa on t.oid = pa.atttypid >where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < a.attnum) > AS len_before > from pg_attribute a > join pg_class c on c.oid = attrelid > where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1 > order by attrelid::regclass::text, attnum; > attrelid │ attname│types_before > │ len_before > ─┼──┼─┼ > pg_sequence │ seqstart │ {oid,oid} > │ 8 > pg_sequence │ seqincrement │ {oid,oid,int8} > │ 16 > pg_sequence │ seqmax │ {oid,oid,int8,int8} > │ 24 > pg_sequence │ seqmin │ {oid,oid,int8,int8,int8} > │ 32 > pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8} > │ 40 > pg_subscription │ subskiplsn │ {oid,oid,name,oid,bool,bool,bool,char,bool} > │ 81 > (6 rows) > > That doesn't count padding, but hazardous column changes will cause a diff in > the output. Yes, in this case, we can detect the violated column order even without considering padding. On the other hand, I think this calculation could not detect some patterns of order. For instance, suppose the column order is {oid, bool, bool, oid, bool, bool, oid, int8}, the len_before is 16 but offset of int8 column including padding is 20 on ALIGNOF_DOUBLE==4 environment. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Handle infinite recursion in logical replication setup
Here are my comments for the latest patch v6-0001. (I will post my v6-0002 review comments separately) PATCH v6-0001 comments == 1.1 General - Option name I still feel like the option name is not ideal. Unfortunately, this is important because any name change would impact lots of these patch files and docs, struct members etc. It was originally called "local_only", but I thought that as a SUBSCRIPTION option that was confusing because "local" means local to what? Really it is local to the publisher, not local to the subscriber, so that name seemed misleading. Then I suggested "publish_local_only". Although that resolved the ambiguity problem, other people thought it seemed odd to have the "publish" prefix for a subscription-side option. So now it is changed again to "subscribe_local_only" -- It's getting better but still, it is implied that the "local" means local to the publisher except there is nothing in the option name really to convey that meaning. IMO we here all understand the meaning of this option mostly by familiarity with this discussion thread, but I think a user coming to this for the first time will still be confused by the name. Below are some other name ideas. Maybe they are not improvements, but it might help other people to come up with something better. subscribe_publocal_only = true/false origin = pub_only/any adjacent_data_only = true/false direct_subscriptions_only = true/false ... (FYI, the remainder of these review comments will assume the option is still called "subscribe_local_only") ~~~ 1.2 General - inconsistent members and args IMO the struct members and args should also be named for close consistency with whatever the option name is. Currently the option is called "subscription_local_only". So I think the members/args would be better to be called "local_only" instead of "only_local". ~~~ 1.3 Commit message - wrong option name The commit message refers to the option name as "publish_local_only" instead of the option name that is currently implemented. ~~~ 1.4 Commit message - wording The wording seems a bit off. Below is suggested simpler wording which I AFAIK conveys the same information. BEFORE Add an option publish_local_only which will subscribe only to the locally generated data in the publisher node. If subscriber is created with this option, publisher will skip publishing the data that was subscribed from other nodes. It can be created using following syntax: ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port=' PUBLICATION pub1 with (publish_local_only = on); SUGGESTION This patch adds a new SUBSCRIPTION boolean option "subscribe_local_only". The default is false. When a SUBSCRIPTION is created with this option enabled, the publisher will only publish data that originated at the publisher node. Usage: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port=' PUBLICATION pub1 with (subscribe_local_only = true); ~~~ 1.5 doc/src/sgml/ref/create_subscription.sgml - "generated" changes. + + Specifies whether the subscription will request the publisher to send + locally generated changes or both the locally generated changes and + the replicated changes that was generated from other nodes. The + default is false. + For some reason, it seemed a bit strange to me to use the term "generated" changes. Maybe better to refer to the origin of changes? SUGGESTION Specifies whether the publisher should send only changes that originated locally at the publisher node, or send any publisher node changes regardless of their origin. The default is false. ~~~ 1.6 src/backend/replication/pgoutput/pgoutput.c - LOGICALREP_PROTO_TWOPHASE_VERSION_NUM @@ -496,6 +509,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, else ctx->twophase_opt_given = true; + if (data->only_local && data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("requested proto_version=%d does not support subscribe_local_only, need %d or higher", + data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM))); I thought this code should not be using LOGICALREP_PROTO_TWOPHASE_VERSION_NUM. Shouldn't there be some newly introduced constant like LOGICALREP_PROTO_LOCALONLY_VERSION_NUM which you will use here? ~~~ 1.7 src/bin/pg_dump/pg_dump.c - 15 @@ -4451,11 +4452,13 @@ getSubscriptions(Archive *fout) if (fout->remoteVersion >= 15) appendPQExpBufferStr(query, " s.subtwophasestate,\n" - " s.subdisableonerr\n"); + " s.subdisableonerr,\n" + " s.sublocal\n"); else appendPQExpBuffer(query, " '%c' AS subtwophasestate,\n" - " false AS subdisableonerr\n", + " false AS subdisableonerr,\n" + " false AS s.sublocal\n", LOGICALREP_TWOPHASE_STATE_DISABLED); I think this local_only feature is unlikely to get into the PG15 release, so this code should be split out into
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
Hi, On 2022-04-05 08:46:06 +0900, Michael Paquier wrote: > On Sun, Apr 03, 2022 at 11:53:03AM -0700, Andres Freund wrote: > > It seems $subject would have a chance of catching some of these bugs, as > > well > > as exposing amcheck to a database with a bit more varied content? > > Makes sense to me to extend that. > > > Depending on the cost it might make sense to do this optionally, via > > PG_TEST_EXTRA? > > Yes, it would be good to check the difference in run-time before > introducing more. A logical dump of the regression database is no > more than 15MB if I recall correctly, so my guess is that most of the > runtime is still going to be eaten by the run of pg_regress. On my workstation it takes about 2.39s to run pg_amcheck on a regression database with all thoroughness options enabled. With -j4 it's 0.62s. Without more thorough checking it's 1.24s and 0.30s with -j4. Greetings, Andres Freund
Re: Extensible Rmgr for Table AMs
On Mon, 2022-04-04 at 11:15 +0530, Bharath Rupireddy wrote: > 1) Why can't rmid be chosen by the extensions in sequential order > from > (129 till 255), say, to start with a columnar extension can choose > 129, another extension can register 130 and so on right? I'm not sure what you mean by "chosen by the extensions in sequential order". If you mean: (a) that it would depend on the library load order, and that postgres would assign the ID; then that won't work because the rmgr IDs need to be stable across restarts and config changes (b) that there should be a convention where people reserve IDs on the wiki page in sequential order; then that's fine with me > 2) RM_MAX_ID is now UINT8_MAX - do we need to make it configurable? > or > do we think that only a few (127) custom rmgrs can exist? 127 should be plenty for quite a long time. If we need more, we can come up with a different solution (which is OK because the WAL format changes in new major versions). > 3) Do we need to talk about extensible rmgrs and > https://wiki.postgresql.org/wiki/ExtensibleRmgr in documentation > somewhere? This will help extension developers to refer when needed. We don't have user-facing documentation for every extension hook, and I don't think it would be a good idea to document this one (at least in the user-facing docs). Otherwise, we'd have to document the whole resource manager interface, and that seems like way too much of the internals. > 4) Do we need to change this description? > > The default value of this setting is the empty string, which > disables > the feature. It can be set to all to > check all > records, or to a comma-separated list of resource managers to > check > only records originating from those resource > managers. Currently, > the supported resource managers are heap, > heap2, btree, > hash, > gin, gist, > sequence, > spgist, brin, and > generic. Onl > superusers can change this setting. Yes, you're right, I updated the docs for pg_waldump and the wal_consistency_checking GUC. > 5) Do we need to add a PGDLLIMPORT qualifier to RegisterCustomRmgr() > (possibly for RmgrStartup, RmgrTable, RmgrCleanup as well?) so that > the Windows versions of extensions can use this feature? That seems to only be required for variables, not functions. I don't think we want to mark RmgrTable this way, because it's not intended to be accessed by extensions directly. It's accessed by GetRmgr(), which is 'static inline', but that's also not intended to be used by extensions. > 6) Should the below strcmp pg_strcmp? The custom rmgrs can still use > rm_name with different cases and below code fail to catch it? > + if (!strcmp(existing_rmgr->rm_name, rmgr->rm_name)) > + ereport(PANIC, > + (errmsg("custom rmgr \"%s\" has the same name as builtin rmgr", > + existing_rmgr->rm_name))); There are already a lot of places where users can choose identifiers that differ only in uppercase/lowercase. I don't see a reason to make an exception here. > 7) What's the intention of the below code? It seems like below it > checks if there's already a rmgr with the given name (RM_MAX_ID). Had > it been RM_MAX_BUILTIN_ID instead of RM_MAX_ID, the error message > does make sense. Is the intention here to not have duplicate rmgrs in > the entire RM_MAX_ID(both builtin and custom)? Thank you. I redid a lot of the error messages and checked them out to make sure they are helpful. > 8) Per https://wiki.postgresql.org/wiki/ExtensibleRmgr, 128 is > reserved, can we have it as a macro? Or something like > (RM_MAX_ID/2+1) It's already defined as RM_EXPERIMENTAL_ID. Is that what you meant or did I misunderstand? I don't see the point in defining it as RM_MAX_ID/2+1. > 9) Thinking if there's a way to test the code, maybe exposing > RegisterCustomRmgr as a function? I think we need to have a dummy > extension that will be used for testing this kind of patches/features > but that's a separate discussion IMO. It's already exposed as a C function in xlog_internal.h. You mean as a SQL function? I don't think that makes sense. It can only be called while processing shared_preload_libraries (on server startup) anyway. Other changes this version: * implemented Andres's proposed performance change[1] * fix wal_consistency_checking * general cleanup Regards, Jeff Davis [1] https://postgr.es/m/20220404043337.ocjnni7hknjsi...@alap3.anarazel.de From 7f6acaf93d279baa464ae3ba2be173a01d969402 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sat, 6 Nov 2021 13:01:38 -0700 Subject: [PATCH] Extensible rmgr. Allow extensions to specify a new custom rmgr, which allows specialized WAL. This is meant to be used by a custom Table Access Method, which would not otherwise be able to offer logical decoding/replication. It may also be used by new Index Access Methods. Prior to this commit, only Generic WAL was available, which offers support for recovery and
Re: Skipping logical replication transactions on subscriber side
On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote: > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch wrote: > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote: > > > How about a comment like: "It has to be kept at 8-byte alignment > > > boundary so as to be accessed directly via C struct as it uses > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms > > > like AIX."? Can you please suggest a better comment if you don't like > > > this one? > > > > I'd write it like this, though I'm not sure it's an improvement on your > > words: > > > > When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment > > on > > some of the C types that correspond to TYPALIGN_DOUBLE SQL types. To > > ensure > > catalog C struct layout matches catalog tuple layout, arrange for the > > tuple > > offset of each fixed-width, attalign='d' catalog column to be divisible > > by 8 > > unconditionally. Keep such columns before the first NameData column of > > the > > catalog, since packagers can override NAMEDATALEN to an odd number. > > Thanks! > > > > > The best place for such a comment would be in one of > > src/test/regress/sql/*sanity*.sql, next to a test written to detect new > > violations. > > Agreed. > > IIUC in the new test, we would need a new SQL function to calculate > the offset of catalog columns including padding, is that right? Or do > you have an idea to do that by using existing functionality? Something like this: select attrelid::regclass, attname, array(select typname from pg_type t join pg_attribute pa on t.oid = pa.atttypid where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < a.attnum order by pa.attnum) AS types_before, (select sum(attlen) from pg_type t join pg_attribute pa on t.oid = pa.atttypid where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < a.attnum) AS len_before from pg_attribute a join pg_class c on c.oid = attrelid where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1 order by attrelid::regclass::text, attnum; attrelid │ attname│types_before │ len_before ─┼──┼─┼ pg_sequence │ seqstart │ {oid,oid} │ 8 pg_sequence │ seqincrement │ {oid,oid,int8} │ 16 pg_sequence │ seqmax │ {oid,oid,int8,int8} │ 24 pg_sequence │ seqmin │ {oid,oid,int8,int8,int8}│ 32 pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8} │ 40 pg_subscription │ subskiplsn │ {oid,oid,name,oid,bool,bool,bool,char,bool} │ 81 (6 rows) That doesn't count padding, but hazardous column changes will cause a diff in the output.
Re: Granting SET and ALTER SYSTE privileges for GUCs
> On Apr 4, 2022, at 5:12 PM, Tom Lane wrote: > > Wrote it already, no need for you to do it. Thanks! — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Granting SET and ALTER SYSTE privileges for GUCs
Mark Dilger writes: > On Apr 4, 2022, at 2:26 PM, Tom Lane wrote: >> So I think that instead of what you've got here, we should >> (1) apply the map_old_guc_names[] mapping, which is constant >> (for any one PG release anyway) >> (2) smash to lower case >> (3) verify validity per valid_variable_name. >> >> This also simplifies life on the lookup side, where it's sufficient >> to do steps (1) and (2) before performing a catalog search. >> >> Thoughts? > That sounds right. Do you already have something like that coded, or would > you like me to post a patch? Wrote it already, no need for you to do it. regards, tom lane
Re: Extensible Rmgr for Table AMs
On Sun, 2022-04-03 at 21:33 -0700, Andres Freund wrote: > I still think the easiest and fastest would be to just make RmgrTable > longer, > and not const. When registering, copy the caller provided struct into > the > respective RmgrData element. Yes, we'd waste a bit of space at the > end of the > array, but it's typically not going to be touched and thus not be > backed by > "actual" memory. Sounds good to me. I tried to break down the performance between these approaches and didn't get a clear signal, so I'll go with your intuition here. Posting new patch in response to Bharath's review, which will include this change. Note that GetRmgr() also has an unlikely branch where it tests for the validity of the rmgr before using it, so that it can throw a nice error message if someone forgot to include the module in shared_preload_libraries. I expect this will be highly predictable and therefore not a problem. Regards, Jeff Davis
Re: Window Function "Run Conditions"
Thanks for having a look at this. On Wed, 30 Mar 2022 at 11:16, Andres Freund wrote: > On 2022-03-29 15:11:52 +1300, David Rowley wrote: > > One thing which I'm not sure about with the patch is how I'm handling > > the evaluation of the runcondition in nodeWindowAgg.c. Instead of > > having ExecQual() evaluate an OpExpr such as "row_number() over (...) > > <= 10", I'm replacing the WindowFunc with the Var in the targetlist > > that corresponds to the given WindowFunc. This saves having to double > > evaluate the WindowFunc. Instead, the value of the Var can be taken > > directly from the slot. I don't know of anywhere else we do things > > quite like that. The runcondition is slightly similar to HAVING > > clauses, but HAVING clauses don't work this way. > > Don't HAVING clauses actually work pretty similar? Yes, they don't have a Var, > but for expression evaluation purposes an Aggref is nearly the same as a plain > Var: > > EEO_CASE(EEOP_INNER_VAR) > { > int attnum = op->d.var.attnum; > > /* > * Since we already extracted all referenced columns from the > * tuple with a FETCHSOME step, we can just grab the value > * directly out of the slot's decomposed-data arrays. But let's > * have an Assert to check that that did happen. > */ > Assert(attnum >= 0 && attnum < innerslot->tts_nvalid); > *op->resvalue = innerslot->tts_values[attnum]; > *op->resnull = innerslot->tts_isnull[attnum]; > > EEO_NEXT(); > } > vs > EEO_CASE(EEOP_AGGREF) > { > /* > * Returns a Datum whose value is the precomputed aggregate value > * found in the given expression context. > */ > int aggno = op->d.aggref.aggno; > > Assert(econtext->ecxt_aggvalues != NULL); > > *op->resvalue = econtext->ecxt_aggvalues[aggno]; > *op->resnull = econtext->ecxt_aggnulls[aggno]; > > EEO_NEXT(); > } > > specifically we don't re-evaluate expressions? Thanks for highlighting the similarities. I'm feeling better about the choice now. I've made another pass over the patch and updated a few comments and made a small code change to delay the initialisation of a variable. I'm pretty happy with this now. If anyone wants to have a look at this, can they do so or let me know they're going to within the next 24 hours. Otherwise I plan to move into commit mode with it. > This is afaics slightly cheaper than referencing a variable in a slot. I guess you must mean cheaper because it means there will be no EEOP_*_FETCHSOME step? Otherwise it seems a fairly similar amount of work. David From f5dab05872f0f06c05fae1ee2024285b89de8c44 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Tue, 5 Apr 2022 12:00:40 +1200 Subject: [PATCH v5] Teach planner and executor about monotonic window funcs Window functions such as row_number() always return a value higher than the one previously in any given partition. If a query were to only request the first few row numbers, then traditionally we would continue evaluating the WindowAgg node until all tuples are exhausted. However, it is possible if someone, say only wanted all row numbers <= 10, then we could just stop once we get number 11. Here we implement means to do just that. This is done by way of adding a pg_proc.prosupport function to various of the built-in window functions and adding supporting code to allow the support function to inform the planner if the function is monotonically increasing, monotonically decreasing, both or neither. The planner is then able to make use of that information and possibly allow the executor to short-circuit execution by way of adding a "run condition" to the WindowAgg to instruct it to run while this condition is true and stop when it becomes false. When there are multiple WindowAgg nodes to evaluate then this complicates the situation as if we were to stop execution on a lower-level WindowAgg then an upper-level WindowAgg may not receive all of the tuples it should. This may lead to incorrect query results. To get around this problem all non-top-level WindowAggs go into "passthrough" mode when their runcondition is no longer true. This means that they continue to pull tuples from their subnode but no longer evaluate their window functions. Only the top-level WindowAgg node may stop when the runcondition is no longer true. Here we add prosupport functions to allow this to work for; row_number(), rank(), dense_rank(), count(*) and count(expr). It appears technically possible to do the same for min() and max(), however, it seems unlikely to be useful enough, so that's not done here. Author: David Rowley Reviewed-by: Andy Fan, Zhihong Yu Discussion: https://postgr.es/m/caaphdvqvp3at8++yf8ij06sdcoo1s_b2yoat9d4nf+mobzs...@mail.gmail.com ---
Re: Granting SET and ALTER SYSTE privileges for GUCs
> On Apr 4, 2022, at 2:26 PM, Tom Lane wrote: > > Thanks. As I'm working through this, I'm kind of inclined to drop > the has_parameter_privilege() variants that take an OID as object > identifier. This gets back to the fact that I don't think > pg_parameter_acl OIDs have any outside use; we wouldn't even have > them except that we need a way to track their role dependencies > in pg_shdepend. AFAICS users will only ever be interested in > looking up a GUC by name. Any objections? None. > Another thought here is that I see you're expending some code > to store the canonical name of a GUC in pg_parameter_acl, but > I think that's probably going too far. In particular, for the > legacy mixed-case names like "DateStyle", what ends up in the > table is the mixed-case name, which seems problematic. It would > definitely be problematic if an extension used such a name, > because we might or might not be aware of the idiosyncratic > casing at the time a GRANT is issued. I'm thinking that we > really want to avoid looking up custom GUCs at all during GRANT, > because that can't do anything except create hazards. Yikes. It took a few tries to see what you mean. Yes, if the GRANT happens before the LOAD, that can have bad consequences. So I agree something should be changed. > So I think that instead of what you've got here, we should > (1) apply the map_old_guc_names[] mapping, which is constant >(for any one PG release anyway) > (2) smash to lower case > (3) verify validity per valid_variable_name. > > This also simplifies life on the lookup side, where it's sufficient > to do steps (1) and (2) before performing a catalog search. > > Thoughts? That sounds right. Do you already have something like that coded, or would you like me to post a patch? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
On Sun, Apr 03, 2022 at 11:53:03AM -0700, Andres Freund wrote: > It seems $subject would have a chance of catching some of these bugs, as well > as exposing amcheck to a database with a bit more varied content? Makes sense to me to extend that. > Depending on the cost it might make sense to do this optionally, via > PG_TEST_EXTRA? Yes, it would be good to check the difference in run-time before introducing more. A logical dump of the regression database is no more than 15MB if I recall correctly, so my guess is that most of the runtime is still going to be eaten by the run of pg_regress. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Hi Fabien, > Hello Tatsuo-san, > >> interval_start num_transactions sum_latency sum_latency_2 min_latency >> max_latency >> sum_lag sum_lag_2 min_lag max_lag skipped >> failures serialization_failures deadlock_failures retried retries > > I would suggest to reorder the last chunk to: > >... retried retries failures serfail dlfail > > because I intend to add connection failures handling at some point, > and it would make more sense to add the corresponding count at the end > with other fails. Ok, I have adjusted the patch. V2 patch attached. Best reagards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index ebdb4b3f46..d1818ff316 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -2401,7 +2401,9 @@ END; format is used for the log files: -interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures deadlock_failures } sum_lag sum_lag_2 min_lag max_lag skippedretried retries +interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency +sum_lag sum_lag_2 min_lag max_lag skipped +retried retries failures serialization_failures deadlock_failures where @@ -2417,41 +2419,55 @@ END; and max_latency is the maximum latency within the interval, failures is the number of transactions that ended - with a failed SQL command within the interval. If you use option - --failures-detailed, instead of the sum of all failed - transactions you will get more detailed statistics for the failed - transactions grouped by the following types: - serialization_failures is the number of - transactions that got a serialization error and were not retried after this, - deadlock_failures is the number of transactions - that got a deadlock error and were not retried after this. + with a failed SQL command within the interval. + + The next fields, sum_lag, sum_lag_2, min_lag, - and max_lag, are only present if the --rate - option is used. + and max_lag, only meaningful if the --rate + option is used. Otherwise, they are all 0.0. They provide statistics about the time each transaction had to wait for the previous one to finish, i.e., the difference between each transaction's scheduled start time and the time it actually started. The next field, skipped, - is only present if the --latency-limit option is used, too. + is only meaningful if the --latency-limit option is used, too. Otherwise it is 0. It counts the number of transactions skipped because they would have started too late. - The retried and retries - fields are present only if the --max-tries option is not - equal to 1. They report the number of retried transactions and the sum of all - retries after serialization or deadlock errors within the interval. - Each transaction is counted in the interval when it was committed. + + + The retried + and retries fields are only meaningful if + the --max-tries option is not equal to 1. Otherwise they + are 0. They report the number of retried transactions and the sum of all + retries after serialization or deadlock errors within the interval. Each + transaction is counted in the interval when it was committed. + + + failures is the sum of all failed transactions. + If --failures-detailed is specified, instead of the sum of + all failed transactions you will get more detailed statistics for the + failed transactions grouped by the following types: + serialization_failures is the number of + transactions that got a serialization error and were not retried after this, + deadlock_failures is the number of transactions + that got a deadlock error and were not retried after this. + If --failures-detailed is not + specified, serialization_failures + and deadlock_failures are always 0. - Here is some example output: + Here is some example output with following options: -1345828501 5601 1542744 483552416 61 2573 0 -1345828503 7884 1979812 565806736 60 1479 0 -1345828505 7208 1979422 567277552 59 1391 0 -1345828507 7685 1980268 569784714 60 1398 0 -1345828509 7073 1979779 573489941 236 1411 0 - +pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 +--latency-limit=10 --failures-detailed --max-tries=10 test + + + +1649114136 5815 27552565 177846919143 1078 21716 2756787 7264696105 0 9661 0 7854 31472 4022 4022 0 +1649114146 5958 28460110 182785513108 1083 20391 2539395 6411761497 0 7268 0 8127 32595 4101 4101 0 + + Notice that while the plain (unaggregated) log file shows which script diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index acf3e56413..4d4b979e4f 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -4494,6 +4494,17 @@ doLog(TState *thread,
Re: shared-memory based stats collector - v67
Hi, On 2022-03-23 10:42:03 -0700, Andres Freund wrote: > On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote: > > + /* > > +* When starting with crash recovery, reset pgstat data - it might not > > be > > +* valid. Otherwise restore pgstat data. It's safe to do this here, > > +* because postmaster will not yet have started any other processes > > +* > > +* TODO: With a bit of extra work we could just start with a pgstat file > > +* associated with the checkpoint redo location we're starting from. > > +*/ > > + if (ControlFile->state == DB_SHUTDOWNED || > > + ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY) > > + pgstat_restore_stats(); > > + else > > + pgstat_discard_stats(); > > + > > > > Before there, InitWalRecovery changes the state to > > DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or > > DB_IN_PRODUCTION. So the stat seems like always discarded on standby. > > Hm. I though it worked at some point. I guess there's a reason this commit is > a separate commit marked WIP ;) FWIW, it had gotten broken by commit be1c00ab13a7c2c9299d60cb5a9d285c40e2506c Author: Heikki Linnakangas Date: 2022-02-16 09:22:44 +0200 Move code around in StartupXLOG(). because that moved the spot where ControlFile->state = DB_IN_CRASH_RECOVERY is set to an earlier location. Greetings, Andres Freund
Re: pushdown of joinquals beyond group by/distinct on
On Tue, 5 Apr 2022 at 07:40, Arne Roland wrote: > can someone point out to me, why we don't consider pushdowns of the joinqual > for these queries beyond the distinct on? > > When the qual matches the distinct clause, it should be possible to generate > both parametrized and non parametrized subplans for the same query. The same > should hold true for aggregates, if the group by clause matches. Is there any > specific reason we aren't doing that already? Your example shows that it's not always beneficial to pushdown such quals. In all cases where we currently consider qual pushdowns, we do so without any costing. This is done fairly early in planning before we have any visibility as to if it would be useful or not. With your example case, if we unconditionally rewrote the subquery to be laterally joined and pushed the condition into the subquery then we could slow down a bunch of cases as the planner would be forced into using a parameterized nested loop. I don't really see how we could properly cost this short of performing the join search twice. The join search is often the most costly part of planning. When you consider that there might be many quals to push and/or many subqueries to do this to, the number of times we'd need to perform the join search might explode fairly quickly. That wouldn't be great for queries where there are many join-levels to search. It might be possible if we could come up with some heuristics earlier in planning to determine if it's going to be a useful transformation to make. However, that seems fairly difficult in the absence of any cardinality estimations. David
Re: shared-memory based stats collector - v68
Hi, On 2022-04-04 15:24:24 -0700, David G. Johnston wrote: > Replacing the existing assert(!kind->fixed_amount) with > assert(!kind->accessed_across_databases) produces the same result as the > later presently implies the former. I wasn't proposing to replace, but to add... > Now I start to dislike the behavioral aspect of the attribute and would > rather just name it: kind->is_cluster_scoped (or something else that is > descriptive of the stat category itself, not how it is used) I'm not in love with the name either. But cluster is just a badly overloaded word :(. system_wide? Or invert it and say: database_scoped? I think I like the latter. > Then reorganize the Kind documentation to note and emphasize these two > primary descriptors: > variable, which can be cluster or database scoped > fixed, which are cluster scoped by definition Hm. There's not actually that much difference between cluster/non-cluster wide scope for most of the system. I'm not strongly against, but I'm also not really seeing the benefit. > (if this is true...but given this is an optimization category I'm thinking > maybe it doesn't actually matter...) It is true. Not sure what you mean with "optimization category"? Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Tue, Apr 5, 2022 at 2:18 AM Robert Haas wrote: > On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro wrote: > > Another idea would be to call a new function DropPendingWritebacks(), > > and also tell all the SMgrRelation objects to close all their internal > > state (ie the fds + per-segment objects) but not free the main > > SMgrRelationData object, and for good measure also invalidate the > > small amount of cached state (which hasn't been mentioned in this > > thread, but it kinda bothers me that that state currently survives, so > > it was one unspoken reason I liked the smgrcloseall() idea). Since > > DropPendingWritebacks() is in a rather different module, perhaps if we > > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > > something else, because it's not really a SMGR-only thing anymore. > > I'm not sure that it really matters, but with the idea that I proposed > it's possible to "save" a pending writeback, if we notice that we're > accessing the relation with a proper lock after the barrier arrives > and before we actually try to do the writeback. With this approach we > throw them out immediately, so they're just gone. I don't mind that > because I don't think this will happen often enough to matter, and I > don't think it will be very expensive when it does, but it's something > to think about. The checkpointer never takes heavyweight locks, so the opportunity you're describing can't arise. This stuff happens entirely within the scope of a call to BufferSync(), which is called by CheckPointBuffer(). BufferSync() does WritebackContextInit() to set up the object that accumulates the writeback requests, and then runs for a while performing a checkpoint (usually spread out over time), and then at the end does IssuePendingWritebacks(). A CFI() can be processed any time up until the IssuePendingWritebacks(), but not during IssuePendingWritebacks(). That's the size of the gap we need to cover. I still like the patch I posted most recently. Note that it's not quite as I described above: there is no DropPendingWritebacks(), because that turned out to be impossible to implement in a way that the barrier handler could call -- it needs access to the writeback context. Instead I had to expose a counter so that IssuePendingWritebacks() could detect whether any smgrreleaseall() calls had happened while the writeback context was being filled up with writeback requests, by comparing with the level recorded therein.
Re: shared-memory based stats collector - v68
On Mon, Apr 4, 2022 at 2:54 PM Andres Freund wrote: > > > As the existing function only handles functions and relations why not > just > > perform a specific Kind check for them? Generalizing to assert on > whether > > or not the function works on fixed or variable Kinds seems beyond its > > present state. Or could it be used, as-is, for databases, replication > > slots, and subscriptions today, and we just haven't migrated those areas > to > > use the now generalized function? > > It couldn't quite be used for those, because it really only makes sense for > objects "within a database", because it wants to reset the timestamp of the > pg_stat_database row too (I don't like that behaviour as-is, but that's the > topic of another thread as you know...). > > It will work for other per-database stats though, once we have them. > > > > Even then, unless we do expand the > > definition of the this publicly facing function is seems better to > > precisely define what it requires as an input Kind by checking for > RELATION > > or FUNCTION specifically. > > I don't see a benefit in adding a restriction on it that we'd just have to > lift again? > > How about adding a > Assert(!pgstat_kind_info_for(kind)->accessed_across_databases) > > and extending the function comment to say that it's used for per-database > stats and that it resets both the passed-in stats object as well as > pg_stat_database? > > I could live with adding that, but... Replacing the existing assert(!kind->fixed_amount) with assert(!kind->accessed_across_databases) produces the same result as the later presently implies the former. Now I start to dislike the behavioral aspect of the attribute and would rather just name it: kind->is_cluster_scoped (or something else that is descriptive of the stat category itself, not how it is used) Then reorganize the Kind documentation to note and emphasize these two primary descriptors: variable, which can be cluster or database scoped fixed, which are cluster scoped by definition (if this is true...but given this is an optimization category I'm thinking maybe it doesn't actually matter...) + /* cluster-scoped object stats having a variable number of entries */ + PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */ + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */ + PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd spot to be closer to the database-scoped options) + + /* database-scoped object stats having a variable number of entries */ + PGSTAT_KIND_RELATION, /* per-table statistics */ + PGSTAT_KIND_FUNCTION, /* per-function statistics */ + + /* cluster-scoped stats having a fixed number of entries */ (maybe these should go first, the variable following?) + PGSTAT_KIND_ARCHIVER, + PGSTAT_KIND_BGWRITER, + PGSTAT_KIND_CHECKPOINTER, + PGSTAT_KIND_SLRU, + PGSTAT_KIND_WAL, David J.
Re: Granting SET and ALTER SYSTE privileges for GUCs
On 4/4/22 17:26, Tom Lane wrote: > Mark Dilger writes: >> [ v15 patch ] > Thanks. As I'm working through this, I'm kind of inclined to drop > the has_parameter_privilege() variants that take an OID as object > identifier. This gets back to the fact that I don't think > pg_parameter_acl OIDs have any outside use; we wouldn't even have > them except that we need a way to track their role dependencies > in pg_shdepend. AFAICS users will only ever be interested in > looking up a GUC by name. Any objections? Not from me cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: shared-memory based stats collector - v68
Hi, On 2022-04-04 14:25:57 -0700, David G. Johnston wrote: > > You mentioned this as a restriction above - I'm not seeing it as such? I'd > > like to write out stats more often in the future (e.g. in the > > checkpointer), > > but then it'd not be written out with this function... > > > > > Yeah, the idea only really works if you can implement "last one out, shut > off the lights". I think I was subconsciously wanting this to work that > way, but the existing process is good. Preserving stats more than we do today (the patch doesn't really affect that) will require a good chunk more work. My idea for it is that we'd write the file out as part of a checkpoint / restartpoint, with a name including the redo-lsn. Then when recovery starts, it can use the stats file associated with that to start from. Then we'd loose at most 1 checkpoint's worth of stats during a crash, not more. There's a few non-trivial corner cases to solve, around stats objects getting dropped concurrently with creating that serialized snapshot. Solvable, but not trivial. > > > + * I also am unsure, off the top of my head, whether both replication > > > slots and subscriptions, > > > + * which are fixed, can be reset singly (today, and/or whether this > > patch > > > enables that capability) > > > + */ > > > > FWIW, neither are implemented as fixed amount stats. > > > That was a typo, I meant to write variable. My point was that of these 5 > kinds that will pass the assertion test only 2 of them are actually handled > by the function today. > > + PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */ > + PGSTAT_KIND_RELATION, /* per-table statistics */ > + PGSTAT_KIND_FUNCTION, /* per-function statistics */ > + PGSTAT_KIND_REPLSLOT, /* per-slot statistics */ > + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */ > As the existing function only handles functions and relations why not just > perform a specific Kind check for them? Generalizing to assert on whether > or not the function works on fixed or variable Kinds seems beyond its > present state. Or could it be used, as-is, for databases, replication > slots, and subscriptions today, and we just haven't migrated those areas to > use the now generalized function? It couldn't quite be used for those, because it really only makes sense for objects "within a database", because it wants to reset the timestamp of the pg_stat_database row too (I don't like that behaviour as-is, but that's the topic of another thread as you know...). It will work for other per-database stats though, once we have them. > Even then, unless we do expand the > definition of the this publicly facing function is seems better to > precisely define what it requires as an input Kind by checking for RELATION > or FUNCTION specifically. I don't see a benefit in adding a restriction on it that we'd just have to lift again? How about adding a Assert(!pgstat_kind_info_for(kind)->accessed_across_databases) and extending the function comment to say that it's used for per-database stats and that it resets both the passed-in stats object as well as pg_stat_database? Greetings, Andres Freund
Re: psql - add SHOW_ALL_RESULTS option
On 02.04.22 15:26, Fabien COELHO wrote: Again, after the SendQuery refactoring extraction. I'm doing this locally, so don't feel obliged to send more of these. ;-) Good for me :-) This has been committed. I reduced some of your stylistic changes in order to keep the surface area of this complicated patch small. We can apply some of those later if you are interested. Right now, let's let it settle a bit.
Re: Granting SET and ALTER SYSTE privileges for GUCs
Mark Dilger writes: > [ v15 patch ] Thanks. As I'm working through this, I'm kind of inclined to drop the has_parameter_privilege() variants that take an OID as object identifier. This gets back to the fact that I don't think pg_parameter_acl OIDs have any outside use; we wouldn't even have them except that we need a way to track their role dependencies in pg_shdepend. AFAICS users will only ever be interested in looking up a GUC by name. Any objections? Another thought here is that I see you're expending some code to store the canonical name of a GUC in pg_parameter_acl, but I think that's probably going too far. In particular, for the legacy mixed-case names like "DateStyle", what ends up in the table is the mixed-case name, which seems problematic. It would definitely be problematic if an extension used such a name, because we might or might not be aware of the idiosyncratic casing at the time a GRANT is issued. I'm thinking that we really want to avoid looking up custom GUCs at all during GRANT, because that can't do anything except create hazards. So I think that instead of what you've got here, we should (1) apply the map_old_guc_names[] mapping, which is constant (for any one PG release anyway) (2) smash to lower case (3) verify validity per valid_variable_name. This also simplifies life on the lookup side, where it's sufficient to do steps (1) and (2) before performing a catalog search. Thoughts? regards, tom lane
Re: shared-memory based stats collector - v68
On Mon, Apr 4, 2022 at 2:06 PM Andres Freund wrote: > > > My first encounter with pg_stat_exists_stat() didn't draw my attention as > > being problematic so I'd say we just stick with it. As a SQL user > reading: > > WHERE exists (...) is somewhat natural; using "have" or back-to-back > > stat_stat is less appealing. > > There are a number of other *_exists functions, albeit not within > pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()? > > Works for me. > A way to get back to the old behaviour seems good, > and the function idea doesn't provide that. > Makes sense. > (merged the typos that I hadn't already fixed based on Justin / Thomas' > feedback) > > > > @@ -371,7 +371,9 @@ pgstat_discard_stats(void) > > /* > > * pgstat_before_server_shutdown() needs to be called by exactly one > > process > > * during regular server shutdowns. Otherwise all stats will be lost. > > - * > > + * XXX: What bad things happen if this is invoked by more than one > process? > > + * I'd presume stats are not actually lost in that case. Can we just > > 'no-op' > > + * subsequent calls and say "at least once at shutdown, as late as > > possible" > > What's the reason behind this question? There really shouldn't be a second > call (and there's only a single callsite). As is you'd get an assertion > failure about things already having been shutdown. > Mostly OCD I guess, "exactly one" has two failure modes - zero, and > 1; and the "Otherwise" only covers the zero mode. > I don't think we want to relax that, because in all the realistic scenarios > that I can think of that'd open us up to loosing stats that were generated > after the first writeout of the stats data. > You mentioned this as a restriction above - I'm not seeing it as such? I'd > like to write out stats more often in the future (e.g. in the > checkpointer), > but then it'd not be written out with this function... > > Yeah, the idea only really works if you can implement "last one out, shut off the lights". I think I was subconsciously wanting this to work that way, but the existing process is good. > > > @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid > > objoid) > > > > Assert(!pgstat_kind_info_for(kind)->fixed_amount); > > > > + /* > > + * More of a conceptual observation here - the fact that something is > > fixed does not imply > > + * that it is not fixed at a value greater than zero and thus could have > > single subentries > > + * that could be addressed. > > pgstat_reset_single_counter() is a pre-existing function (with a > pre-existing > name, but adapted signature in the patch), it's currently only used for > functions and relation stats. > > > > + * I also am unsure, off the top of my head, whether both replication > > slots and subscriptions, > > + * which are fixed, can be reset singly (today, and/or whether this > patch > > enables that capability) > > + */ > > FWIW, neither are implemented as fixed amount stats. That was a typo, I meant to write variable. My point was that of these 5 kinds that will pass the assertion test only 2 of them are actually handled by the function today. + PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */ + PGSTAT_KIND_RELATION, /* per-table statistics */ + PGSTAT_KIND_FUNCTION, /* per-function statistics */ + PGSTAT_KIND_REPLSLOT, /* per-slot statistics */ + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */ > There's one example of fixed amount stats that can be reset more > granularly, > namely slru. That can be done via pg_stat_reset_slru(). > > Right, hence the conceptual disconnect. It doesn't affect the implementation, everything is working just fine, but is something to ponder for future maintainers getting up to speed here. As the existing function only handles functions and relations why not just perform a specific Kind check for them? Generalizing to assert on whether or not the function works on fixed or variable Kinds seems beyond its present state. Or could it be used, as-is, for databases, replication slots, and subscriptions today, and we just haven't migrated those areas to use the now generalized function? Even then, unless we do expand the definition of the this publicly facing function is seems better to precisely define what it requires as an input Kind by checking for RELATION or FUNCTION specifically. David J.
Re: shared-memory based stats collector - v68
Hi, On 2022-04-04 13:45:40 -0700, David G. Johnston wrote: > I didn't take the time to fixup all the various odd typos in the general > code comments; none of them reduced comprehension appreciably. I may do so > when/if I do another pass. Cool. > My first encounter with pg_stat_exists_stat() didn't draw my attention as > being problematic so I'd say we just stick with it. As a SQL user reading: > WHERE exists (...) is somewhat natural; using "have" or back-to-back > stat_stat is less appealing. There are a number of other *_exists functions, albeit not within pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()? > I would suggest we do away with stats_fetch_consistency "snapshot" mode and > instead add a function that can be called that would accomplish the same > thing but in "cache" mode. Future iterations of that function could accept > patterns, allowing for something between "one" and "everything". I don't want to do that. We had a lot of discussion around what consistency model we want, and Tom was adamant that there needs to be a mode that behaves like the current consistency model (which is what snapshot behaves like, with very minor differences). A way to get back to the old behaviour seems good, and the function idea doesn't provide that. (merged the typos that I hadn't already fixed based on Justin / Thomas' feedback) > @@ -371,7 +371,9 @@ pgstat_discard_stats(void) > /* > * pgstat_before_server_shutdown() needs to be called by exactly one > process > * during regular server shutdowns. Otherwise all stats will be lost. > - * > + * XXX: What bad things happen if this is invoked by more than one process? > + * I'd presume stats are not actually lost in that case. Can we just > 'no-op' > + * subsequent calls and say "at least once at shutdown, as late as > possible" What's the reason behind this question? There really shouldn't be a second call (and there's only a single callsite). As is you'd get an assertion failure about things already having been shutdown. I don't think we want to relax that, because in all the realistic scenarios that I can think of that'd open us up to loosing stats that were generated after the first writeout of the stats data. You mentioned this as a restriction above - I'm not seeing it as such? I'd like to write out stats more often in the future (e.g. in the checkpointer), but then it'd not be written out with this function... > @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid > objoid) > > Assert(!pgstat_kind_info_for(kind)->fixed_amount); > > + /* > + * More of a conceptual observation here - the fact that something is > fixed does not imply > + * that it is not fixed at a value greater than zero and thus could have > single subentries > + * that could be addressed. pgstat_reset_single_counter() is a pre-existing function (with a pre-existing name, but adapted signature in the patch), it's currently only used for functions and relation stats. > + * I also am unsure, off the top of my head, whether both replication > slots and subscriptions, > + * which are fixed, can be reset singly (today, and/or whether this patch > enables that capability) > + */ FWIW, neither are implemented as fixed amount stats. There's afaics no limit at all for the number of existing subscriptions (although some would either need to be disabled or you'd get errors). While there is a limit on the number of slots, that's a configurable limit. So replication slot stats are also implemented as variable amount stats (that used to be different, wasn't nice). There's one example of fixed amount stats that can be reset more granularly, namely slru. That can be done via pg_stat_reset_slru(). Thanks, Andres
Re: New Object Access Type hooks
> On Apr 4, 2022, at 1:47 PM, Tom Lane wrote: > > Yeah, it's plausible to get a failure on either the write or read side > depending on timing. > > Perhaps libpq should be trying harder to make those cases look alike, but > this test is about server behavior not libpq behavior, so I'm inclined > to just make it lax. +1. I've gotten this test failure only a few times in perhaps the last six months, so if we narrow the opportunity for test failure without closing it entirely, we're just making the test failures that much harder to diagnose. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: New Object Access Type hooks
Mark Dilger writes: >> On Apr 4, 2022, at 12:05 PM, Tom Lane wrote: >> So scratch that. Maybe we'd better add "could not send data to server" >> to the regex? > If it fails in pqsecure_raw_write(), you get either "server closed the > connection unexpectedly" or "could not send data to server". Do we need to > support pgtls_write() or pg_GSS_write(), which have different error messages? Don't see why, since this test sets up a new cluster in which neither is enabled. > Is it possible that pgFlush will call pqSendSome which calls pqReadData > before trying to write anything, and get back a "could not receive data from > server" from pqsecure_raw_read()? Yeah, it's plausible to get a failure on either the write or read side depending on timing. Perhaps libpq should be trying harder to make those cases look alike, but this test is about server behavior not libpq behavior, so I'm inclined to just make it lax. regards, tom lane
Re: shared-memory based stats collector - v68
On Sun, Apr 3, 2022 at 9:16 PM Andres Freund wrote: > > Please take a look! > > I didn't take the time to fixup all the various odd typos in the general code comments; none of them reduced comprehension appreciably. I may do so when/if I do another pass. I did skim over the entire patch set and, FWIW, found it to be quite understandable. I don't have the experience to comment on the lower-level details like locking and such but the medium picture stuff makes sense to me both as a user and a developer. I did leave a couple of comments about parts that at least piqued my interest (reset single stats) or seemed like an undesirable restriction that was under addressed (before server shutdown called exactly once). I agree with Thomas's observation regarding PGSTAT_KIND_LAST. I also think that leaving it starting at 1 makes sense - maybe just fix the name and comment to better reflect its actual usage in core. I concur also with changing usages of " / " to ", or" My first encounter with pg_stat_exists_stat() didn't draw my attention as being problematic so I'd say we just stick with it. As a SQL user reading: WHERE exists (...) is somewhat natural; using "have" or back-to-back stat_stat is less appealing. I would suggest we do away with stats_fetch_consistency "snapshot" mode and instead add a function that can be called that would accomplish the same thing but in "cache" mode. Future iterations of that function could accept patterns, allowing for something between "one" and "everything". I'm also not an immediate fan of "fetch_consistency"; with the function suggestion it is basically "cache" and "no-cache" so maybe: stats_use_transaction_cache ? (haven't thought hard or long on this one...) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 22d0a1e491..e889c11d9e 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -2123,7 +2123,7 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser PgStatsData - Waiting fo shared memory stats data access + Waiting for shared memory stats data access SerializableXactHash diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 2689d0962c..bc7bdf8064 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -4469,7 +4469,7 @@ PostgresMain(const char *dbname, const char *username) /* * (4) turn off the idle-in-transaction, idle-session and - * idle-state-update timeouts if active. We do this before step (5) so + * idle-stats-update timeouts if active. We do this before step (5) so * that any last-moment timeout is certain to be detected in step (5). * * At most one of these timeouts will be active, so there's no need to diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index dbd55a065d..370638b33b 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -5,7 +5,7 @@ * Provides the infrastructure to collect and access cumulative statistics, * e.g. per-table access statistics, of all backends in shared memory. * - * Most statistics updates are first first accumulated locally in each process + * Most statistics updates are first accumulated locally in each process * as pending entries, then later flushed to shared memory (just after commit, * or by idle-timeout). * @@ -371,7 +371,9 @@ pgstat_discard_stats(void) /* * pgstat_before_server_shutdown() needs to be called by exactly one process * during regular server shutdowns. Otherwise all stats will be lost. - * + * XXX: What bad things happen if this is invoked by more than one process? + * I'd presume stats are not actually lost in that case. Can we just 'no-op' + * subsequent calls and say "at least once at shutdown, as late as possible" * We currently only write out stats for proc_exit(0). We might want to change * that at some point... But right now pgstat_discard_stats() would be called * during the start after a disorderly shutdown, anyway. @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid objoid) Assert(!pgstat_kind_info_for(kind)->fixed_amount); + /* + * More of a conceptual observation here - the fact that something is fixed does not imply + * that it is not fixed at a value greater than zero and thus could have single subentries + * that could be addressed. + * I also am unsure, off the top of my head, whether both replication slots and subscriptions, + * which are fixed, can be reset singly (today, and/or whether this patch enables that capability) + */ + /* Set the reset timestamp for the whole database */ pgstat_reset_database_timestamp(MyDatabaseId, ts); David J.
Re: Mingw task for Cirrus CI
Hi, On 2022-03-30 17:26:18 -0700, Andres Freund wrote: > Hi, > > On 2022-03-22 19:00:42 +0300, Melih Mutlu wrote: > > Rebased it. > > I also removed the temp installation task and > > used NoDefaultCurrentDirectoryInExePath env variable instead. > > Hm. But you're still using a separate build directory, from what I can see? > The NoDefaultCurrentDirectoryInExePath thing should only have an effect when > not using a separate build directory, no? Melih chatted with me about making this work. Turns out it doesn't readily - pg_ctl still fails. The reason that NoDefaultCurrentDirectoryInExePath doesn't suffice to get a working in-tree build, is that we break it ourselves: int find_my_exec(const char *argv0, char *retpath) ... #ifdef WIN32 /* Win32 checks the current directory first for names without slashes */ join_path_components(retpath, cwd, argv0); if (validate_exec(retpath) == 0) return resolve_symlinks(retpath); #endif So even if windows doesn't actually use the current path, and the current pg_ctl process isn't the one from the current directory, we *still* return that. Gah. Maybe we should just use GetModuleFileName()? But even after that the tests don't work. Commands started via IPC::Run do, but when using system() they don't. Looks like perl parses the path the itself :(. - Andres
Re: New Object Access Type hooks
> On Apr 4, 2022, at 12:05 PM, Tom Lane wrote: > > I wrote: >> The "terminating connection" warning absolutely should get through, > > ... oh, no, that's not guaranteed at all, since it's sent from quickdie(). > So scratch that. Maybe we'd better add "could not send data to server" > to the regex? If it fails in pqsecure_raw_write(), you get either "server closed the connection unexpectedly" or "could not send data to server". Do we need to support pgtls_write() or pg_GSS_write(), which have different error messages? Can anybody run the tests with TLS or GSS enabled? I assume the test framework prevents this, but I didn't check too closely Is it possible that pgFlush will call pqSendSome which calls pqReadData before trying to write anything, and get back a "could not receive data from server" from pqsecure_raw_read()? It's a bit hard to prove to myself which paths might be followed through this code. Thoughts? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Pluggable toaster
On Mon, Apr 4, 2022 at 4:05 PM Nikita Malakhov wrote: > - Is 'git apply' not a valid way to apply such patches? I have found that it never works. This case is no exception: [rhaas pgsql]$ git apply ~/Downloads/1_toaster_interface_v4.patch /Users/rhaas/Downloads/1_toaster_interface_v4.patch:253: trailing whitespace. toasterapi.o /Users/rhaas/Downloads/1_toaster_interface_v4.patch:1276: trailing whitespace. { /Users/rhaas/Downloads/1_toaster_interface_v4.patch:1294: trailing whitespace. * CREATE TOASTER name HANDLER handler_name /Users/rhaas/Downloads/1_toaster_interface_v4.patch:2261: trailing whitespace. * va_toasterdata could contain varatt_external structure for old Toast /Users/rhaas/Downloads/1_toaster_interface_v4.patch:3047: trailing whitespace. SELECT attnum, attname, atttypid, attstorage, tsrname error: patch failed: src/backend/commands/tablecmds.c:42 error: src/backend/commands/tablecmds.c: patch does not apply error: patch failed: src/backend/commands/tablecmds.c:943 error: src/backend/commands/tablecmds.c: patch does not apply error: patch failed: src/backend/commands/tablecmds.c:973 error: src/backend/commands/tablecmds.c: patch does not apply error: patch failed: src/backend/commands/tablecmds.c:44 error: src/backend/commands/tablecmds.c: patch does not apply I would really encourage you to use 'git format-patch' to generate a stack of patches. But there is no point in reposting 30+ patches that haven't been properly refactored into separate chunks. You need to maintain a branch, periodically rebased over master, with some probably-small number of patches on it, each of which is a logically independent patch with its own commit message, its own clear purpose, etc. And then generate patches to post from there using 'git format-patch'. Look into using 'git rebase -i --autosquash' and 'git commit --fixup' to maintain the branch, if you're not already familiar with those things. Also, it is a really good idea when you post the patch set to include in the email a clear description of the overall purpose of the patch set and what each patch does toward that goal. e.g. "The overall goal of this patch set is to support faster-than-light travel. Currently, PostgreSQL does not know anything about the speed of light, so 0001 adds some code for speed-of-light detection. Building on this, 0002 adds general support for disabling physical laws of the universe. Then, 0003 makes use of this support to disable specifically the speed of light." Perhaps you want a little more text than that for each patch, depending on the situation, but this gives you the idea, I hope. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pluggable toaster
Hi, Thank you very much for your review, I'd like to get it much earlier. I'm currently working on cleaning up code, and would try to comment as much as possible. This patch set is really a large set of functionality, it was divided into 4 logically complete parts, but anyway these parts contain a lot of changes themselves. - Yes, you're right, new syntax is added. I'm also working on the documentation part for it; - Patches actually consist of a lot of minor commits. As I see we have to squash them to provide parts as 2-3 main commits without any unnecessary garbage; - Is 'git apply' not a valid way to apply such patches? We'll try to straighten these issues out asap. Thank you! On Mon, Apr 4, 2022 at 7:40 PM Robert Haas wrote: > On Mon, Apr 4, 2022 at 12:13 PM Nikita Malakhov wrote: > > I'm really sorry. Messed up some code while merging rebased branches > with previous (v1) > > patches issued in December and haven't noticed that seg fault because of > corrupted code > > while running check-world. > > I've fixed messed code in Dummy Toaster package and checked twice - > all's correct now, > > patches are applied correctly and tests are clean. > > Thank you for pointing out the error and for your patience! > > Hi, > > This patch set doesn't seem anywhere close to committable to me. For > example: > > - It apparently adds a new command called CREATE TOASTER but that > command doesn't seem to be documented anywhere. > > - contrib/dummy_toaster is added in patch 1 with a no implementation > and code comments that say "Bloom index utilities" and then those > comments are fixed and an implementation is added in later patches. > > - What is supposedly patch 1 is actually 32 patch files concatenated > together. It doesn't apply properly either with 'patch' or 'git am' so > I don't understand what we would even do with this. > > - None of these patches have appropriate commit messages. > > - Many of these patches add 'XXX' comments or errors or other > obviously unfinished bits of code. Some of this may be cleaned up by > later patches but it's hard to tell because right now one can't even > apply the patch set properly. Even if that were possible, it's the job > of the person submitting the patch to organize the patch into > independent, separately committable chunks that are self-contained and > have good comments and good commit messages for each one. > > I don't think based on the status of this patch set that it's even > possible to provide useful feedback on the design at this point, never > mind getting anything committed. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
pushdown of joinquals beyond group by/distinct on
Hi, can someone point out to me, why we don't consider pushdowns of the joinqual for these queries beyond the distinct on? When the qual matches the distinct clause, it should be possible to generate both parametrized and non parametrized subplans for the same query. The same should hold true for aggregates, if the group by clause matches. Is there any specific reason we aren't doing that already? Regards Arne distinct_on_lateral.sql Description: distinct_on_lateral.sql
Re: Higher level questions around shared memory stats
Hi, On 2022-03-30 14:08:41 -0700, Andres Freund wrote: > On 2022-03-30 14:42:23 -0400, Robert Haas wrote: > > On Tue, Mar 29, 2022 at 5:01 PM Andres Freund wrote: > > > I can think of these different times: > > > > > > - Last time stats were removed due to starting up in crash recovery > > > - Last time stats were created from scratch, because no stats data file > > > was > > > present at startup > > > - Last time stats were thrown away due to corruption > > > - Last time a subset of stats were reset using one of the pg_reset* > > > functions > > > > > > Makes sense? > > > > Yes. Possibly that last could be broken in to two: when all stats were > > last reset, when some stats were last reset. > > Believe it or not, we don't currently have a function to reset all stats. We > should definitely add that though, because the invocation to reset all stats > gets more ridiculous^Wcomplicated with each release. I assume we'd want all of these to be persistent (until the next time stats are lost, of course)? We currently have the following SQL functions showing reset times: pg_stat_get_bgwriter_stat_reset_time pg_stat_get_db_stat_reset_time and kind of related: pg_stat_get_snapshot_timestamp There's also a few functions showing the last time something has happened, like pg_stat_get_last_analyze_time(). Trying to fit those patterns we could add functions like: pg_stat_get_stats_last_cleared_recovery_time pg_stat_get_stats_last_cleared_corrupted_time pg_stat_get_stats_last_not_present_time pg_stat_get_stats_last_partially_reset_time Not great, but I don't immediately have a better idea. Maybe it'd look better as a view / SRF? pg_stat_stats / pg_stat_get_stats()? Potential column names: - cleared_recovery_time - cleared_corrupted_time - not_preset_time - partially_reset_time It's not a lot of time to code these up. However, it's late in cycle, and they're not suddenly needed due to shared memory stats, so I have a few doubts about adding them now? Greetings, Andres Freund
Re: Pluggable toaster
Thanks for the review Robert. I think that gives some pretty actionable advice on how to improve the patch and it doesn't seem likely to get much more in this cycle. I'll mark the patch Returned with Feedback. Hope to see it come back with improvements in the next release.
Re: shared-memory based stats collector - v68
Hi, On 2022-04-03 21:15:16 -0700, Andres Freund wrote: > - collect who reviewed earlier revisions I found reviews by - Tomas Vondra - Arthur Zakirov - Antonin Houska There's also reviews by Fujii and Alvaro, but afaics just for parts that were separately committed. Greetings, Andres Freund
Re: New Object Access Type hooks
I wrote: > The "terminating connection" warning absolutely should get through, ... oh, no, that's not guaranteed at all, since it's sent from quickdie(). So scratch that. Maybe we'd better add "could not send data to server" to the regex? regards, tom lane
Re: New Object Access Type hooks
Mark Dilger writes: > # Running: pg_ctl kill QUIT 2083 > ok 4 - killed process with SIGQUIT > # pump_until: process terminated unexpectedly when searching for > "(?^m:WARNING: terminating connection because of crash of another server > process|server closed the connection unexpectedly|connection to server was > lost)" with stream: "psql::9: WARNING: terminating connection because > of unexpected SIGQUIT signal > # psql::9: could not send data to server: Socket is not connected > # " > not ok 5 - psql query died successfully after SIGQUIT And there we have it: the test wasn't updated for the new backend message spelling, and we're seeing a different frontend behavior. Evidently the backend is dying before we're able to send the "SELECT 1;" to it. I'm not quite sure whether it's a libpq bug that it doesn't produce the "connection to server was lost" message here, but in any case I suspect that we shouldn't be checking for the second and third regex alternatives. The "terminating connection" warning absolutely should get through, and if it doesn't we want to know about it. So my proposal for a fix is to change the regex to be just "WARNING: terminating connection because of unexpected SIGQUIT signal". regards, tom lane
Re: New Object Access Type hooks
> On Apr 4, 2022, at 10:44 AM, Mark Dilger wrote: > > I'm writing a parallel test just for this. Will get back to you. Ok, that experiment didn't accomplish anything, beyond refreshing my memory regarding Cluster.pm preferring sockets over ports. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
On Mon, Apr 4, 2022 at 2:16 PM Andres Freund wrote: > On 2022-04-04 10:02:37 -0400, Robert Haas wrote: > > It does a good job, I think, checking all the things that a human being > > could potentially spot just by looking at an individual page. > > I think there's a few more things that'd be good to check. For example amcheck > doesn't verify that HOT chains are reasonable, which can often be spotted > looking at an individual page. Which is a bit unfortunate, given how many bugs > we had in that area. > > Stuff to check around that: > - target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set > - In a valid ctid chain within a page (i.e. xmax = xmin): > - tuples have HEAP_UPDATED set > - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements > > I think it'd also be good to check for things like visible tuples following > invisible ones. Interesting. *takes notes* -- Robert Haas EDB: http://www.enterprisedb.com
Re: JSON constructors and window functions
On 4/4/22 11:43, Tom Lane wrote: > Andrew Dunstan writes: >> On 4/3/22 22:46, Andrew Dunstan wrote: >>> On 4/3/22 20:11, Andres Freund wrote: I don't think you're allowed to free stuff in a finalfunc - we might reuse the transition state for further calls to the aggregate. >>> Doh! Of course! I'll fix it in the morning. Thanks. >> I've committed a fix for this. I didn't find something to clean out the >> hash table, so I just removed the 'hash_destroy' and left it at that. >> All the test I did came back with expected results. >> Maybe a hash_reset() is something worth having assuming it's possible? I >> note that simplehash has a reset function. > But removing the hash entries would be just as much of a problem > wouldn't it? > > Yes, quite possibly. It looks from some experimentation as though, unlike my naive preconception, it doesn't process each frame again from the beginning, so losing the hash entries could indeed be an issue here. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: JSON constructors and window functions
On 4/4/22 12:33, Andres Freund wrote: > Hi, > > On 2022-04-04 11:54:23 -0400, Greg Stark wrote: >> Are we missing regression tests using these functions as window functions? > So far, yes. > > ISTM that 4eb97988796 should have at least included the crashing statement as > a test... The statement can be simpler too: > > SELECT json_objectagg(k : v with unique keys) OVER (ORDER BY k) FROM (VALUES > (1,1), (2,2)) a(k,v); > > is sufficient to trigger the crash for me, without even using asan (after > reverting the bugfix, of course). > I will add some regression tests. >> Hm. I suppose it's possible to write a general purpose regression test >> that loops over all aggregate functions and runs them as window >> functions and aggregates over the same data sets and compares results. >> At least for the case of aggregate functions with a single parameter >> belonging to a chosen set of data types. > I was wondering about that too. Hardest part would be to come up with values > to pass to the aggregates. > > I don't think it'd help in this case though, since it depends on special case > grammar stuff to even be reached. json_objectagg(k : v with unique > keys). "Normal" use of aggregates can't even reach the problematic path > afaics. > It can, as Jaime's original post showed. But on further consideration I'm thinking this area needs some rework. ISTM that it might be a whole lot simpler and comprehensible to generate the json first without bothering about null values or duplicate keys and then in the finalizer check for null values to be skipped and duplicate keys. That way we would need to keep far less state for the aggregation functions, although it might be marginally less efficient. Thoughts? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
Hi, On 2022-04-04 10:02:37 -0400, Robert Haas wrote: > It does a good job, I think, checking all the things that a human being > could potentially spot just by looking at an individual page. I think there's a few more things that'd be good to check. For example amcheck doesn't verify that HOT chains are reasonable, which can often be spotted looking at an individual page. Which is a bit unfortunate, given how many bugs we had in that area. Stuff to check around that: - target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set - In a valid ctid chain within a page (i.e. xmax = xmin): - tuples have HEAP_UPDATED set - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements I think it'd also be good to check for things like visible tuples following invisible ones. Greetings, Andres Freund
Re: New Object Access Type hooks
> On Apr 4, 2022, at 11:07 AM, Tom Lane wrote: > > I was hoping to see regress_log_013_crash_restart, though. regress_log_013_crash_restart Description: Binary data — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: New Object Access Type hooks
Mark Dilger writes: > The test logs are attached. Server log looks as-expected: 2022-04-04 09:11:51.087 PDT [2084] 013_crash_restart.pl LOG: statement: SELECT pg_sleep(3600); 2022-04-04 09:11:51.094 PDT [2083] 013_crash_restart.pl WARNING: terminating connection because of unexpected SIGQUIT signal 2022-04-04 09:11:51.095 PDT [2070] LOG: server process (PID 2083) exited with exit code 2 2022-04-04 09:11:51.095 PDT [2070] DETAIL: Failed process was running: INSERT INTO alive VALUES($$in-progress-before-sigquit$$) RETURNING status; 2022-04-04 09:11:51.095 PDT [2070] LOG: terminating any other active server processes I was hoping to see regress_log_013_crash_restart, though. regards, tom lane
Re: New Object Access Type hooks
> On Apr 4, 2022, at 10:51 AM, Tom Lane wrote: > > Oh, I'm barking up the wrong tree. This test must have been run > against HEAD between 6da65a3f9 (23 Feb) and 2beb4acff (31 Mar), when > pump_until indeed didn't print this essential information :-( > > If you just got this failure, could you look in the log to > see if there's a pump_until report? I was running `make -j12 check-world` against my local patched version of master: commit 80399fa5f208c4acd4ec194c47e534ba8dd3ae7c (HEAD -> 0001) Author: Mark Dilger Date: Mon Mar 28 13:35:11 2022 -0700 Allow grant and revoke of privileges on parameters Add new SET and ALTER SYSTEM privileges for configuration parameters (GUCs), and a new catalog, pg_parameter_acl, for tracking grants of these privileges. The privilege to SET a parameter marked USERSET is inherent in that parameter's marking and cannot be revoked. This saves cycles when performing SET operations, as looking up privileges in the catalog can be skipped. If we find that administrators need to revoke SET privilege on a particular variable from public, that variable can be redefined in future releases as SUSET with a default grant of SET to PUBLIC issued. commit 4eb9798879680dcc0e3ebb301cf6f925dfa69422 (origin/master, origin/HEAD, master) Author: Andrew Dunstan Date: Mon Apr 4 10:12:30 2022 -0400 Avoid freeing objects during json aggregate finalization Commit f4fb45d15c tried to free memory during aggregate finalization. This cause issues, particularly when used as a window function, so stop doing that. Per complaint by Jaime Casanova and diagnosis by Andres Freund Discussion: https://postgr.es/m/YkfeMNYRCGhySKyg@ahch-to The test logs are attached. 013_crash_restart_primary.log Description: Binary data — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: shared-memory based stats collector - v68
Hi, On 2022-04-05 01:16:04 +1200, Thomas Munro wrote: > On Mon, Apr 4, 2022 at 4:16 PM Andres Freund wrote: > > Please take a look! > > A few superficial comments: > > > [PATCH v68 01/31] pgstat: consistent function header formatting. > > [PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h. > > +1 Planning to commit these after making another coffee and proof reading them some more. > > [PATCH v68 03/31] dshash: revise sequential scan support. > > Logic looks good. That is, > lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense. Just > some comment trivia: > > + * dshash_seq_term needs to be called when a scan finished. The caller may > + * delete returned elements midst of a scan by using dshash_delete_current() > + * if exclusive = true. > > s/scan finished/scan is finished/ > s/midst of/during/ (or /in the middle of/, ...) > > > [PATCH v68 04/31] dsm: allow use in single user mode. > > LGTM. > + Assert(IsUnderPostmaster || !IsPostmasterEnvironment); > (Not this patch's fault, but I wish we had a more explicit way to say "am > single user".) Agreed. > > [PATCH v68 05/31] pgstat: stats collector references in comments > > LGTM. I could think of some alternative suggested names for this subsystem, > but don't think it would be helpful at this juncture so I will refrain :-) Heh. I did start a thread about it a while ago :) > > [PATCH v68 08/31] pgstat: introduce PgStat_Kind enum. > > +#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE > +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL > +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1) > > It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of > kinds, > because there is no kind 0. For the two users of it... maybe just use > pgstat_kind_infos[] = {...}, and > global_valid[PGSTAT_KIND_LAST + 1]? Maybe the whole justification for not defining an invalid kind is moot now. There's not a single switch covering all kinds of stats left, and I hope that we don't introduce one again... > > [PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / > > drop. > > + /* > +* Dropping the statistics for objects that dropped transactionally itself > +* needs to be transactional. ... > > Hard to parse. How about: "Objects are dropped transactionally, so > related statistics need to be dropped transactionally too." Not all objects are dropped transactionally. But I agree it reads awkwardly. I now, incorporating feedback from Justin as well, rephrased it to: /* * Statistics for transactionally dropped objects need to be * transactionally dropped as well. Collect the stats dropped in the * current (sub-)transaction and only execute the stats drop when we know * if the transaction commits/aborts. To handle replicas and crashes, * stats drops are included in commit / abort records. */ A few too many "drop"s in there, but maybe that's unavoidable. > +/* > + * Whenever the for a dropped stats entry could not be freed (because > + * backends still have references), this is incremented, causing backends > + * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed. > + */ > +pg_atomic_uint64 gc_count; > > Whenever the ...? * Whenever statistics for dropped objects could not be freed - because * backends still have references - the dropping backend calls * pgstat_request_entry_refs_gc() incrementing this counter. Eventually * that causes backends to run pgstat_gc_entry_refs(), allowing memory to * be reclaimed. > Would it be better to call this variable gc_request_count? Agreed. > + * Initialize refcount to 1, marking it as valid / not tdroped. The entry > > s/tdroped/dropped/ > > + * further if a longer lived references is needed. > > s/references/reference/ Fixed. > +/* > + * There are legitimate cases where the old stats entry might not > + * yet have been dropped by the time it's reused. The easiest > case > + * are replication slot stats. But oid wraparound can lead to > + * other cases as well. We just reset the stats to their plain > + * state. > + */ > +shheader = pgstat_reinit_entry(kind, shhashent); > > This whole comment is repeated in pgstat_reinit_entry and its caller. I guess I felt as indecisive about where to place it between the two locations when I wrote it as I do now. Left it at the callsite for now. > +/* > + * XXX: Might be worth adding some frobbing of the allocation before > + * freeing, to make it easier to detect use-after-free style bugs. > + */ > +dsa_free(pgStatLocal.dsa, pdsa); > > FWIW dsa_free() clobbers memory in assert builds, just like pfree(). Oh. I could swear I saw that not being the case a while ago. But clearly it is the case. Removed. > +static Size
Re: New Object Access Type hooks
I wrote: > Is our CI setup failing to capture stderr from TAP tests?? Oh, I'm barking up the wrong tree. This test must have been run against HEAD between 6da65a3f9 (23 Feb) and 2beb4acff (31 Mar), when pump_until indeed didn't print this essential information :-( If you just got this failure, could you look in the log to see if there's a pump_until report? regards, tom lane
Re: New Object Access Type hooks
> On Apr 4, 2022, at 10:41 AM, Tom Lane wrote: > > Is our CI setup failing to capture stderr from TAP tests?? I'm looking into the way our TAP test infrastructure assigns port numbers to nodes, and whether that is reasonable during parallel test runs with nodes stopping and starting again. On casual inspection, that doesn't seem ok, because the Cluster.pm logic to make sure nodes get unique ports doesn't seem to be thinking about other parallel tests running. It will notice if another node is already bound to the port, but if another node has been killed and not yet restarted, won't things go off the rails? I'm writing a parallel test just for this. Will get back to you. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: New Object Access Type hooks
I wrote: > Usually the test would succeed anyway because of matching the > second or third regex alternative, but I wonder if there is > some other spelling of libpq's complaint that shows up > occasionally. It'd be nice if we could see the contents of > $killme_stderr upon failure. OK, now I'm confused, because pump_until is very clearly *trying* to report exactly that: if (not $proc->pumpable()) { diag("pump_until: process terminated unexpectedly when searching for \"$until\" with stream: \"$$stream\""); return 0; } and if I intentionally break the regex then I do see this output when running the test by hand: # Running: pg_ctl kill QUIT 1922645 ok 4 - killed process with SIGQUIT # pump_until: process terminated unexpectedly when searching for "(?^m:WARNING: terminating connection because of crash of another server process|server closed the connection foounexpectedly|connection to server was lost)" with stream: "psql::9: WARNING: terminating connection because of unexpected SIGQUIT signal # psql::9: server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # " not ok 5 - psql query died successfully after SIGQUIT Is our CI setup failing to capture stderr from TAP tests?? regards, tom lane
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 4/4/22 11:42 AM, Nathan Bossart wrote: I noticed a couple of other things that can be removed. Since we no longer wait on exclusive backup mode during smart shutdown, we can change connsAllowed (in postmaster.c) to a boolean and remove CAC_SUPERUSER. We can also remove a couple of related notes in the documentation. I've done all this in the attached patch. These changes look good to me. IMV it is a real bonus how much the state machine has been simplified. I've also run this patch through the pgbackrest regression tests without any problems. Regards, -- -David da...@pgmasters.net
Re: New Object Access Type hooks
Mark Dilger writes: > I just got a crash in this test again. Are you still interested? I still > have the logs. No core file appears to have been generated. > The test failure is > not ok 5 - psql query died successfully after SIGQUIT Hmm ... I can see one problem with that test: ok( pump_until( $killme, $psql_timeout, \$killme_stderr, qr/WARNING: terminating connection because of crash of another server process|server closed the connection unexpectedly|connection to server was lost/m ), "psql query died successfully after SIGQUIT"); It's been a little while since that message looked like that. Nowadays you get WARNING: terminating connection because of unexpected SIGQUIT signal server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Usually the test would succeed anyway because of matching the second or third regex alternative, but I wonder if there is some other spelling of libpq's complaint that shows up occasionally. It'd be nice if we could see the contents of $killme_stderr upon failure. regards, tom lane
Re: A test for replay of regression tests
I wrote: > I think the fundamental instability here is that this TAP test is > setting shared_buffers small enough to allow the syncscan logic > to kick in where it does not in normal testing. Maybe we should > just disable syncscan in this test script? Did that, we'll see how much it helps. regards, tom lane
Re: Pluggable toaster
On Mon, Apr 4, 2022 at 12:13 PM Nikita Malakhov wrote: > I'm really sorry. Messed up some code while merging rebased branches with > previous (v1) > patches issued in December and haven't noticed that seg fault because of > corrupted code > while running check-world. > I've fixed messed code in Dummy Toaster package and checked twice - all's > correct now, > patches are applied correctly and tests are clean. > Thank you for pointing out the error and for your patience! Hi, This patch set doesn't seem anywhere close to committable to me. For example: - It apparently adds a new command called CREATE TOASTER but that command doesn't seem to be documented anywhere. - contrib/dummy_toaster is added in patch 1 with a no implementation and code comments that say "Bloom index utilities" and then those comments are fixed and an implementation is added in later patches. - What is supposedly patch 1 is actually 32 patch files concatenated together. It doesn't apply properly either with 'patch' or 'git am' so I don't understand what we would even do with this. - None of these patches have appropriate commit messages. - Many of these patches add 'XXX' comments or errors or other obviously unfinished bits of code. Some of this may be cleaned up by later patches but it's hard to tell because right now one can't even apply the patch set properly. Even if that were possible, it's the job of the person submitting the patch to organize the patch into independent, separately committable chunks that are self-contained and have good comments and good commit messages for each one. I don't think based on the status of this patch set that it's even possible to provide useful feedback on the design at this point, never mind getting anything committed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: JSON constructors and window functions
Hi, On 2022-04-04 11:54:23 -0400, Greg Stark wrote: > Are we missing regression tests using these functions as window functions? So far, yes. ISTM that 4eb97988796 should have at least included the crashing statement as a test... The statement can be simpler too: SELECT json_objectagg(k : v with unique keys) OVER (ORDER BY k) FROM (VALUES (1,1), (2,2)) a(k,v); is sufficient to trigger the crash for me, without even using asan (after reverting the bugfix, of course). > Hm. I suppose it's possible to write a general purpose regression test > that loops over all aggregate functions and runs them as window > functions and aggregates over the same data sets and compares results. > At least for the case of aggregate functions with a single parameter > belonging to a chosen set of data types. I was wondering about that too. Hardest part would be to come up with values to pass to the aggregates. I don't think it'd help in this case though, since it depends on special case grammar stuff to even be reached. json_objectagg(k : v with unique keys). "Normal" use of aggregates can't even reach the problematic path afaics. Greetings, Andres Freund
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN
On Tue, Apr 05, 2022 at 12:51:12AM +0900, Masahiko Sawada wrote: > On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud wrote: > > > > Hmm, but AFAICS the json format would be stable as the counters are always > > shown even if zero. So just doing the json format first and then the text > > format should also work. Although if you're really unlucky there could be a > > cache invalidation in between so we could just ignore the text format. But > > I > > think we should at least keep a regression test with the json format, with a > > comment explain why only this one is tested. > > Fair point. By commit 7e12256b478 we disabled track_io_timing, but > probably we can temporarily enable it and run one query with "buffers" > and "format json" options. Yes, enabling it for just this query. It can't really find any problem with the values themselves but at least the new code path would be partially executed. > > > > - not really your patch fault I guess, but I see that extendBufFile() isn't > > handled. There shouldn't be much activity there so maybe it's ok. > > This is likely because smgr_extend is also not handled, but this one seems > > much more likely to take quite some time, and therefore should bump the > > timing counters. > > You mean we should include the time for opening files as write time? Yes. In normal circumstances it shouldn't need a lot of time to do that, but I'm not so sure with e.g. network filesystems. I'm not strongly in favor of counting it, especially since smgrextend doesn't either. > IIUC smgrextend() writes data while extending file whereas > extendBufFile() doesn't do that but just opens a new file. Note that smgrextend can also call register_dirty_segment(), which can also take some time, so we can't just assume that all the time there is IO time.
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
> On Apr 4, 2022, at 9:27 AM, Peter Geoghegan wrote: > > I'd really like it if amcheck had HOT chain verification. That's the > other area where catching bugs passively with assertions and whatnot > is clearly not good enough. I agree, and was hoping to get around to this in the postgres 15 development cycle. Alas, that did not happen. Worse, I have several other projects that will keep me busy for the next few months, at least. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
On Mon, Apr 4, 2022 at 7:02 AM Robert Haas wrote: > Yeah, I was very excited about verify_heapam(). There is a lot more > stuff that we could check, but a lot of those things would be much > more expensive to check. If anything I understated the value of verify_heapam() with this kind of work before. Better to show just how valuable it is using an example. Let's introduce a fairly blatant bug to lazyvacuum.c. This change makes VACUUM fail to account for the fact that skipping a skippable range with an all-visible page makes it unsafe to advance relfrozenxid: --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -1371,8 +1371,6 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, else { *skipping_current_range = true; -if (skipsallvis) -vacrel->skippedallvis = true; } return next_unskippable_block; If I run "make check-world", the tests all pass! But when I run pg_amcheck against an affected "regression" database, it will complain about relfrozenxid related corruption in several different tables. > It does a good job, I think, checking all the > things that a human being could potentially spot just by looking at an > individual page. I love the idea of using it in regression testing in > more places. It might find bugs in amcheck, which would be good, but I > think it's even more likely to help us find bugs in other code. I'd really like it if amcheck had HOT chain verification. That's the other area where catching bugs passively with assertions and whatnot is clearly not good enough. -- Peter Geoghegan
Re: JSON constructors and window functions
Hi, On 2022-04-04 11:43:31 -0400, Tom Lane wrote: > Andrew Dunstan writes: > > On 4/3/22 22:46, Andrew Dunstan wrote: > >> On 4/3/22 20:11, Andres Freund wrote: > >>> I don't think you're allowed to free stuff in a finalfunc - we might > >>> reuse the > >>> transition state for further calls to the aggregate. > > >> Doh! Of course! I'll fix it in the morning. Thanks. > > > I've committed a fix for this. I didn't find something to clean out the > > hash table, so I just removed the 'hash_destroy' and left it at that. > > All the test I did came back with expected results. > > Maybe a hash_reset() is something worth having assuming it's possible? I > > note that simplehash has a reset function. > > But removing the hash entries would be just as much of a problem > wouldn't it? I think so. I guess we could mark it as FINALFUNC_MODIFY = READ_WRITE. But I don't see a reason why it'd be needed here. Is it a problem that skipped_keys is reset in the finalfunc? I don't know how these functions work. So far I don't understand why JsonUniqueBuilderState->skipped_keys is long lived... Greetings, Andres Freund
Re: New Object Access Type hooks
> On Mar 21, 2022, at 10:03 PM, Thomas Munro wrote: > > On Fri, Mar 18, 2022 at 4:22 PM Mark Dilger > wrote: >> (FYI, I got a test failure from src/test/recovery/t/013_crash_restart.pl >> when testing v1-0001. I'm not sure yet what that is about.) > > Doesn't look like 0001 has anything to do with that... Are you on a > Mac? Did it look like this recent failure from CI? > > https://cirrus-ci.com/task/4686108033286144 > https://api.cirrus-ci.com/v1/artifact/task/4686108033286144/log/src/test/recovery/tmp_check/log/regress_log_013_crash_restart > https://api.cirrus-ci.com/v1/artifact/task/4686108033286144/log/src/test/recovery/tmp_check/log/013_crash_restart_primary.log > > I have no idea what is going on there, but searching for discussion > brought me here... I just got a crash in this test again. Are you still interested? I still have the logs. No core file appears to have been generated. The test failure is not ok 5 - psql query died successfully after SIGQUIT — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: JSON constructors and window functions
Are we missing regression tests using these functions as window functions? Hm. I suppose it's possible to write a general purpose regression test that loops over all aggregate functions and runs them as window functions and aggregates over the same data sets and compares results. At least for the case of aggregate functions with a single parameter belonging to a chosen set of data types.
Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN
On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud wrote: > > Hi, > > On Tue, Mar 01, 2022 at 11:35:32AM +0900, Masahiko Sawada wrote: > > On Wed, Jan 19, 2022 at 5:52 PM Julien Rouhaud wrote: > > > > > > It seems that the regression tests aren't entirely stable, per cfbot: > > > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3298. > > > > > > The failures look like: > > > > > > diff -U3 > > > /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out > > > /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out > > > --- > > > /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out > > > 2022-01-19 03:50:37.087931908 + > > > +++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out > > > 2022-01-19 03:57:41.013616137 + > > > @@ -512,9 +512,10 @@ > > > I/O Timings: temp read=N.N write=N.N > > > -> Function Scan on generate_series (cost=N.N..N.N rows=N width=N) > > > (actual time=N.N..N.N rows=N loops=N) > > > I/O Timings: temp read=N.N write=N.N > > > + I/O Timings: shared/local read=N.N > > > Planning Time: N.N ms > > > Execution Time: N.N ms > > > -(6 rows) > > > +(7 rows) > > > > > > select explain_filter('explain (analyze, buffers, format json) select > > > count(*) from generate_series(1, 10)'); > > > explain_filter > > > > > > > > > I don't see any obvious error in the code, so I'm wondering if it's simply > > > the result of populating the cache with generate_series() info. > > > > Thank you for reporting. > > > > You're right, the regression test with track_io_timing = on is not > > entirely stable because we may read catalog data during planning time, > > resulting in an additional line in the EXPLAIN output. I've removed > > the regression tests. > > Hmm, but AFAICS the json format would be stable as the counters are always > shown even if zero. So just doing the json format first and then the text > format should also work. Although if you're really unlucky there could be a > cache invalidation in between so we could just ignore the text format. But I > think we should at least keep a regression test with the json format, with a > comment explain why only this one is tested. Fair point. By commit 7e12256b478 we disabled track_io_timing, but probably we can temporarily enable it and run one query with "buffers" and "format json" options. > > > I've attached updated patches. I've included an improvement of > > pg_stat_statement support to support temp I/O timing. > > Great! > > Some other considerations: > > - should we update pg_stat_statements documentation (and comments) for > blk_(read|write)_time to mention that it's for *data file* blocks rather > than > simply blocks? It seems obvious now that we'd have > temp_blk_(read|write)_time, but still. This part should probably be > backpatched. Agreed. > > - not really your patch fault I guess, but I see that extendBufFile() isn't > handled. There shouldn't be much activity there so maybe it's ok. > This is likely because smgr_extend is also not handled, but this one seems > much more likely to take quite some time, and therefore should bump the > timing counters. You mean we should include the time for opening files as write time? IIUC smgrextend() writes data while extending file whereas extendBufFile() doesn't do that but just opens a new file. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Granting SET and ALTER SYSTE privileges for GUCs
> On Apr 4, 2022, at 8:36 AM, Tom Lane wrote: > > Mark Dilger writes: >> If we want to backtrack to v8, that's fine. I can rebase that, port >> some of the other changes from v14 to it, and repost it as v15. > > Are you working on that? I've set aside time this week to hopefully > get this over the finish line, but I don't want to find out that > I've been duplicating your effort. Yes, I expect to be posting the latest in maybe an hour? I believe the latest patch (just reviewing, adjusting code comments, etc.) that I'm preparing to post has all the changes we've discussed, aside from your parameterId renaming. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Mon, Apr 4, 2022 at 2:25 PM Kyotaro Horiguchi wrote: > > At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi > wrote in > > I haven't found how the patch caused creation of a relation file that > > is to be removed soon. However, I find that v19 patch fails by maybe > > due to some change in Cluster.pm. It takes a bit more time to check > > that.. > > I was a bit away, of course the wal-logged create database interfares > with the patch here. But I haven't found that why it stops creating > database directory under pg_tblspc. I did not understand what is the exact problem here, but the database directory and the version file are created under the default tablespace of the target database. However, other than the default tablespace of the database, the database directory will be created along with the smgrcreate() so that we do not create an unnecessary directory under the tablespace where we do not have any data to be copied. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: JSON constructors and window functions
Andrew Dunstan writes: > On 4/3/22 22:46, Andrew Dunstan wrote: >> On 4/3/22 20:11, Andres Freund wrote: >>> I don't think you're allowed to free stuff in a finalfunc - we might reuse >>> the >>> transition state for further calls to the aggregate. >> Doh! Of course! I'll fix it in the morning. Thanks. > I've committed a fix for this. I didn't find something to clean out the > hash table, so I just removed the 'hash_destroy' and left it at that. > All the test I did came back with expected results. > Maybe a hash_reset() is something worth having assuming it's possible? I > note that simplehash has a reset function. But removing the hash entries would be just as much of a problem wouldn't it? regards, tom lane
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
I noticed a couple of other things that can be removed. Since we no longer wait on exclusive backup mode during smart shutdown, we can change connsAllowed (in postmaster.c) to a boolean and remove CAC_SUPERUSER. We can also remove a couple of related notes in the documentation. I've done all this in the attached patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 5d79d79472252207eea5b7dcc52010736da10296 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v10 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 230 +--- doc/src/sgml/func.sgml| 111 +--- doc/src/sgml/high-availability.sgml | 6 +- doc/src/sgml/monitoring.sgml | 4 +- doc/src/sgml/ref/pg_ctl-ref.sgml | 6 +- doc/src/sgml/ref/pgupgrade.sgml | 2 +- doc/src/sgml/runtime.sgml | 8 +- src/backend/access/transam/xlog.c | 493 ++ src/backend/access/transam/xlogfuncs.c| 253 ++--- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/catalog/system_functions.sql | 18 +- src/backend/postmaster/postmaster.c | 75 +-- src/backend/replication/basebackup.c | 20 +- src/backend/utils/init/postinit.c | 18 - src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_ctl/pg_ctl.c | 30 -- src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 7 +- src/include/catalog/pg_control.h | 2 +- src/include/catalog/pg_proc.dat | 28 +- src/include/libpq/libpq-be.h | 3 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +- .../t/010_logical_decoding_timelines.pl | 4 +- 24 files changed, 213 insertions(+), 1177 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..5b7139c7df 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -881,23 +873,23 @@ test ! -f /mnt/server/archivedir/000100A90065 cp pg_wal/0 Connect to the server (it does not matter which database) as a user with - rights to run pg_start_backup (superuser, or a user who has been granted + rights to run pg_backup_start (superuser, or a user who has been granted EXECUTE on the function) and issue the command: -SELECT pg_start_backup('label', false, false); +SELECT pg_backup_start(label => 'label', fast => false); where label is any string you want to use to uniquely identify this backup operation. The connection - calling pg_start_backup must be maintained until the end of + calling pg_backup_start must be maintained until the end of the backup, or the backup will be automatically aborted. - By default, pg_start_backup can take a long time to finish. - This is because it performs a checkpoint, and the I/O - required for the checkpoint will be spread out over a significant - period of time, by default half your inter-checkpoint interval - (see the configuration parameter + By default, pg_backup_start can take a long time to finish. + This is because it waits for the next checkpoint to complete, and the I/O + required for the checkpoint might be spread out over a significant + period of time (see the configuration parameters + and ). This is usually what you want, because it minimizes the impact on query processing. If you want to start the backup as soon as @@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false); issue an immediate checkpoint using as much I/O as available. - - The third parameter being false tells - pg_start_backup to initiate a non-exclusive base backup. - @@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false); In the same connection as before, issue the command: -SELECT * FROM pg_stop_backup(false, true); +SELECT * FROM pg_backup_stop(wait_for_archive => true); This terminates backup mode. On a primary, it
Re: PATCH: add "--config-file=" option to pg_rewind
Just reposting the previous patches to straighten out the cfbot. --- /usr/lib/python3/dist-packages/patroni/postgresql/__init__.py 2022-02-18 13:16:15.0 + +++ __init__.py 2022-04-03 19:17:29.952665383 + @@ -798,7 +798,8 @@ return True def get_guc_value(self, name): -cmd = [self.pgcommand('postgres'), '-D', self._data_dir, '-C', name] +cmd = [self.pgcommand('postgres'), '-D', self._data_dir, '-C', name, + '--config-file={}'.format(self.config.postgresql_conf)] try: data = subprocess.check_output(cmd) if data: --- /usr/lib/python3/dist-packages/patroni/postgresql/rewind.py 2022-02-18 13:16:15.0 + +++ rewind.py 2022-04-03 19:21:14.479726127 + @@ -314,6 +314,7 @@ cmd = [self._postgresql.pgcommand('pg_rewind')] if self._postgresql.major_version >= 13 and restore_command: cmd.append('--restore-target-wal') +cmd.append('--config-file={0}/postgresql.conf'.format(self._postgresql.config._config_dir)) cmd.extend(['-D', self._postgresql.data_dir, '--source-server', dsn]) while True: doc/src/sgml/ref/pg_rewind.sgml | 12 + src/bin/pg_rewind/pg_rewind.c | 24 - src/bin/pg_rewind/t/001_basic.pl | 1 + src/bin/pg_rewind/t/RewindTest.pm | 105 +++--- 4 files changed, 101 insertions(+), 41 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 33e6bb64ad..62fcb71825 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -241,6 +241,18 @@ PostgreSQL documentation + + --config-file=filepath + + +When using the -c / --restore-target-wal option, the restore_command +is extracted from the configuration of the target cluster. If the postgresql.conf +of that cluster is not in the target data directory (or if you want to use an alternative config), +use this option to provide a (relative or absolute) path to the postgresql.conf to be used. + + + + --debug diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index b39b5c1aac..2e83c7ee8e 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -61,6 +61,7 @@ char *datadir_target = NULL; char *datadir_source = NULL; char *connstr_source = NULL; char *restore_command = NULL; +char *config_file = NULL; static bool debug = false; bool showprogress = false; @@ -87,6 +88,8 @@ usage(const char *progname) printf(_("Options:\n")); printf(_(" -c, --restore-target-wal use restore_command in target configuration to\n" " retrieve WAL files from archives\n")); + printf(_(" --config-file=FILE path to postgresql.conf if it resides outside\n")); + printf(_(" the target data directory (for -c)\n")); printf(_(" -D, --target-pgdata=DIRECTORY existing data directory to modify\n")); printf(_(" --source-pgdata=DIRECTORY source data directory to synchronize with\n")); printf(_(" --source-server=CONNSTRsource server to synchronize with\n")); @@ -114,13 +117,14 @@ main(int argc, char **argv) {"write-recovery-conf", no_argument, NULL, 'R'}, {"source-pgdata", required_argument, NULL, 1}, {"source-server", required_argument, NULL, 2}, + {"debug", no_argument, NULL, 3}, {"no-ensure-shutdown", no_argument, NULL, 4}, + {"config-file", required_argument, NULL, 5}, {"version", no_argument, NULL, 'V'}, {"restore-target-wal", no_argument, NULL, 'c'}, {"dry-run", no_argument, NULL, 'n'}, {"no-sync", no_argument, NULL, 'N'}, {"progress", no_argument, NULL, 'P'}, - {"debug", no_argument, NULL, 3}, {NULL, 0, NULL, 0} }; int option_index; @@ -205,6 +209,10 @@ main(int argc, char **argv) case 4: no_ensure_shutdown = true; break; + + case 5: +config_file = pg_strdup(optarg); +break; } } @@ -1061,6 +1069,13 @@ getRestoreCommand(const char *argv0) appendPQExpBufferStr(postgres_cmd, " -D "); appendShellString(postgres_cmd, datadir_target); + /* add --config_file switch only if requested */ + if (config_file != NULL) + { + appendPQExpBufferStr(postgres_cmd, " --config_file="); + appendShellString(postgres_cmd, config_file); + } + /* add -C switch, for restore_command */ appendPQExpBufferStr(postgres_cmd, " -C restore_command"); @@ -1139,6 +1154,13 @@ ensureCleanShutdown(const char *argv0) appendPQExpBufferStr(postgres_cmd, " --single -F -D "); appendShellString(postgres_cmd, datadir_target); + /* add --config_file switch only if requested */ + if (config_file != NULL) + { + appendPQExpBufferStr(postgres_cmd, " --config_file="); + appendShellString(postgres_cmd, config_file); + } + /* finish with the database name, and a properly quoted
Re: Granting SET and ALTER SYSTE privileges for GUCs
Mark Dilger writes: > If we want to backtrack to v8, that's fine. I can rebase that, port > some of the other changes from v14 to it, and repost it as v15. Are you working on that? I've set aside time this week to hopefully get this over the finish line, but I don't want to find out that I've been duplicating your effort. regards, tom lane
Re: Add non-blocking version of PQcancel
Hereby what I consider the final version of this patch. I don't have any changes planned myself (except for ones that come up during review). Things that changed since the previous iteration: 1. postgres_fdw now uses the non-blocking cancellation API (including test). 2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD. 0002-Add-non-blocking-version-of-PQcancel.patch Description: 0002-Add-non-blocking-version-of-PQcancel.patch
Re: JSON constructors and window functions
On 4/3/22 22:46, Andrew Dunstan wrote: > On 4/3/22 20:11, Andres Freund wrote: >> Hi, >> >> On 2022-04-03 18:56:39 -0400, Andrew Dunstan wrote: >>> Haven't found the issue yet :-( It happens on the second call for the >>> partition to json_check_unique_key(). >>> >>> Here's a more idiomatic and self-contained query that triggers the problem. >>> >>> >>> select json_objectagg('10' : ref_0.level2 with unique keys) >>> over (partition by ref_0.parent_no order by ref_0.level2) >>> from (values (1::int,1::int),(1,2),(2,1),(2,2)) as ref_0(parent_no,level2); >> The hash was created in a context that's already freed. >> > [...] >> >> I don't think you're allowed to free stuff in a finalfunc - we might reuse >> the >> transition state for further calls to the aggregate. >> > > Doh! Of course! I'll fix it in the morning. Thanks. > > I've committed a fix for this. I didn't find something to clean out the hash table, so I just removed the 'hash_destroy' and left it at that. All the test I did came back with expected results. Maybe a hash_reset() is something worth having assuming it's possible? I note that simplehash has a reset function. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: basebackup/lz4 crash
On Thu, Mar 31, 2022 at 3:19 AM Dipesh Pandit wrote: > Patch attached. Committed. Thanks. -- Robert Haas EDB: http://www.enterprisedb.com
Re: use has_privs_of_role() for pg_hba.conf
On Mon, Apr 04, 2022 at 09:36:13AM -0400, Joshua Brindle wrote: > Good catch, I think this is a logical followup to the previous > has_privs_of_role patch. > > Reviewed and +1 Thanks! I created a commitfest entry for this: https://commitfest.postgresql.org/38/3609/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On Mon, Apr 04, 2022 at 09:56:26AM -0400, David Steele wrote: > Minor typo in the docs: > > + * capable of doing an online backup, but exclude then just in case. > > Should be: > > capable of doing an online backup, but exclude them just in case. fixed -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 8960a15630290e49af4906a6951eced563efce7d Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 1 Dec 2021 23:50:49 + Subject: [PATCH v9 1/1] remove exclusive backup mode --- doc/src/sgml/backup.sgml | 230 +--- doc/src/sgml/func.sgml| 111 +--- doc/src/sgml/high-availability.sgml | 6 +- doc/src/sgml/monitoring.sgml | 4 +- doc/src/sgml/ref/pgupgrade.sgml | 2 +- src/backend/access/transam/xlog.c | 493 ++ src/backend/access/transam/xlogfuncs.c| 253 ++--- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/catalog/system_functions.sql | 18 +- src/backend/postmaster/postmaster.c | 46 +- src/backend/replication/basebackup.c | 20 +- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 4 + src/bin/pg_ctl/pg_ctl.c | 30 -- src/bin/pg_rewind/filemap.c | 6 +- src/include/access/xlog.h | 7 +- src/include/catalog/pg_control.h | 2 +- src/include/catalog/pg_proc.dat | 28 +- src/include/miscadmin.h | 4 - src/test/perl/PostgreSQL/Test/Cluster.pm | 56 +- .../t/010_logical_decoding_timelines.pl | 4 +- 20 files changed, 199 insertions(+), 1127 deletions(-) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 0d69851bb1..5b7139c7df 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065 cp pg_wal/0 sequence, and that the success of a step is verified before proceeding to the next step. - -Low level base backups can be made in a non-exclusive or an exclusive -way. The non-exclusive method is recommended and the exclusive one is -deprecated and will eventually be removed. - - - -Making a Non-Exclusive Low-Level Backup - A non-exclusive low level backup is one that allows other + A low level backup allows other concurrent backups to be running (both those started using the same backup API and those started using ). @@ -881,23 +873,23 @@ test ! -f /mnt/server/archivedir/000100A90065 cp pg_wal/0 Connect to the server (it does not matter which database) as a user with - rights to run pg_start_backup (superuser, or a user who has been granted + rights to run pg_backup_start (superuser, or a user who has been granted EXECUTE on the function) and issue the command: -SELECT pg_start_backup('label', false, false); +SELECT pg_backup_start(label => 'label', fast => false); where label is any string you want to use to uniquely identify this backup operation. The connection - calling pg_start_backup must be maintained until the end of + calling pg_backup_start must be maintained until the end of the backup, or the backup will be automatically aborted. - By default, pg_start_backup can take a long time to finish. - This is because it performs a checkpoint, and the I/O - required for the checkpoint will be spread out over a significant - period of time, by default half your inter-checkpoint interval - (see the configuration parameter + By default, pg_backup_start can take a long time to finish. + This is because it waits for the next checkpoint to complete, and the I/O + required for the checkpoint might be spread out over a significant + period of time (see the configuration parameters + and ). This is usually what you want, because it minimizes the impact on query processing. If you want to start the backup as soon as @@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false); issue an immediate checkpoint using as much I/O as available. - - The third parameter being false tells - pg_start_backup to initiate a non-exclusive base backup. - @@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false); In the same connection as before, issue the command: -SELECT * FROM pg_stop_backup(false, true); +SELECT * FROM pg_backup_stop(wait_for_archive => true); This terminates backup mode. On a primary, it also performs an automatic switch to the next WAL segment. On a standby, it is not possible to @@ -937,7 +925,7 @@ SELECT * FROM pg_stop_backup(false, true); ready to archive. - The pg_stop_backup will return one row with three + pg_backup_stop will return
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for good measure also invalidate the > small amount of cached state (which hasn't been mentioned in this > thread, but it kinda bothers me that that state currently survives, so > it was one unspoken reason I liked the smgrcloseall() idea). Since > DropPendingWritebacks() is in a rather different module, perhaps if we > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > something else, because it's not really a SMGR-only thing anymore. I'm not sure that it really matters, but with the idea that I proposed it's possible to "save" a pending writeback, if we notice that we're accessing the relation with a proper lock after the barrier arrives and before we actually try to do the writeback. With this approach we throw them out immediately, so they're just gone. I don't mind that because I don't think this will happen often enough to matter, and I don't think it will be very expensive when it does, but it's something to think about. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?
On Sun, Apr 3, 2022 at 10:10 PM Peter Geoghegan wrote: > I meant to tell the authors of verify_heapam() (also CC'd) that it > really helped with my recent VACUUM project. While the assertions that > I wrote in vacuumlazy.c might catch certain bugs like this, > verify_heapam() is much more effective in practice. Yeah, I was very excited about verify_heapam(). There is a lot more stuff that we could check, but a lot of those things would be much more expensive to check. It does a good job, I think, checking all the things that a human being could potentially spot just by looking at an individual page. I love the idea of using it in regression testing in more places. It might find bugs in amcheck, which would be good, but I think it's even more likely to help us find bugs in other code. -- Robert Haas EDB: http://www.enterprisedb.com