Re: Release SPI plans for referential integrity with DISCARD ALL
Hi Corey, Thank you for sharing. > Amit's patch is now available in this thread [1]. I'm curious if it has any > effect on your memory pressure issue. > I just found that thread. I'll check the patch. -- Best regards, Yuzuko Hosoya NTT Open Source Software Center
Re: simplifying foreign key/RI checks
On Tue, Jan 19, 2021 at 12:00 PM Zhihong Yu wrote: > + if (mapped_partkey_attnums[i] == pk_attnums[j]) > + { > + partkey_vals[i] = pk_vals[j]; > + partkey_isnull[i] = pk_nulls[j] == 'n' ? true : false; > + i++; > + break; > > The way counter (i) is incremented is out of my expectation. > In the rare case, where some i doesn't have corresponding pk_attnums[j], > wouldn't there be a dead loop ? > > I think the goal of adding the assertion should be not loop infinitely even > if the invariant is not satisfied. > > I guess a counter other than i would be better for this purpose. I have done that in v3. Thanks. -- Amit Langote EDB: http://www.enterprisedb.com
Re: simplifying foreign key/RI checks
On Tue, Jan 19, 2021 at 3:46 PM Corey Huinker wrote: > v2 patch applies and passes make check and make check-world. Perhaps, given > the missing break at line 418 without any tests failing, we could add another > regression test if we're into 100% code path coverage. As it is, I think the > compiler warning was a sufficient alert. Thanks for the review. I will look into checking the coverage. > The code is easy to read, and the comments touch on the major points of what > complexities arise from partitioned tables. > > A somewhat pedantic complaint I have brought up off-list is that this patch > continues the pattern of the variable and function names making the > assumption that the foreign key is referencing the primary key of the > referenced table. Foreign key constraints need only reference a unique index, > it doesn't have to be the primary key. Granted, that unique index is behaving > exactly as a primary key would, so conceptually it is very similar, but > keeping with the existing naming (pk_rel, pk_type, etc) can lead a developer > to think that it would be just as correct to find the referenced relation and > get the primary key index from there, which would not always be correct. This > patch correctly grabs the index from the constraint itself, so no problem > there. I decided not to deviate from pk_ terminology so that the new code doesn't look too different from the other code in the file. Although, I guess we can at least call the main function ri_ReferencedKeyExists() instead of ri_PrimaryKeyExists(), so I've changed that. > I like that this patch changes the absolute minimum of the code in order to > get a very significant performance benefit. It does so in a way that should > reduce resource pressure found in other places [1]. This will in turn reduce > the performance penalty of "doing the right thing" in terms of defining > enforced foreign keys. It seems to get a clearer performance boost than was > achieved with previous efforts at statement level triggers. > > [1] > https://www.postgresql.org/message-id/cakkq508z6r5e3jdqhfpwszsajlpho3oyyoamfesaupto5vg...@mail.gmail.com Thanks. I hadn't noticed [1] before today, but after looking it over, it seems that what is being proposed there can still be of use. As long as SPI is used in ri_trigger.c, it makes sense to consider any tweaks addressing its negative impact, especially if they are not very invasive. There's this patch too from the last month: https://commitfest.postgresql.org/32/2930/ > This patch completely sidesteps the DELETE case, which has more insidious > performance implications, but is also far less common, and whose solution > will likely be very different. Yeah, we should continue looking into the ways to make referenced-side RI checks be less bloated. I've attached the updated patch. -- Amit Langote EDB: http://www.enterprisedb.com v3-0002-Avoid-using-SPI-for-some-RI-checks.patch Description: Binary data v3-0001-Export-get_partition_for_tuple.patch Description: Binary data
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, On 18.01.21 23:42, Tom Lane wrote: David Geier writes: On 18.01.21 19:46, Tom Lane wrote: Hm. I agree that we shouldn't simply assume that ss_currentRelation isn't null. However, we cannot make search_plan_tree() descend through non-leaf CustomScan nodes, because we don't know what processing is involved there. We need to find a scan that is guaranteed to return rows that are one-to-one with the cursor output. This is why the function doesn't descend through join or aggregation nodes, and I see no argument by which we should assume we know more about what a customscan node will do than we know about those. That makes sense. Thanks for the explanation. OK, cool. I was afraid you'd argue that you really needed your CustomScan node to be transparent in such cases. We could imagine inventing an additional custom-scan-provider callback to embed the necessary knowledge, but I'd rather not add the complexity until someone has a use-case. I was thinking about that. Generally, having such possibility would be very useful. Unfortunately, I believe that in my specific case even having such functionality wouldn't allow for the query to work with CURRENT OF, because my CSP behaves similarly to a materialize node. My understanding is only such plans are supported which consist of nodes that guarantee that the tuple returned by the plan is the unmodified tuple a scan leaf node is currently positioned on? Still, if there's interest I would be happy to draft a patch. Instead of a separate CSP callback, we could also provide an additional flag like CUSTOMPATH_SUPPORT_CURRENT_OF. The advantage of the callback would be that we could delay the decision until execution time where potentially more information is available. I updated the patch to match your proposal. WFM, will push in a bit. regards, tom lane Best regards, David Swarm64
RE: Parallel INSERT (INTO ... SELECT ...)
From: Amit Kapila > Tsunakawa-San, does this address your concerns around locking the > target relation in the required cases? It would be good to test but I > don't see any problems in the scenarios you mentioned. Thank you, understood. RevalidateCachedQuery() does parse analysis, that's the trick. Regards Takayuki Tsunakawa
Re: Release SPI plans for referential integrity with DISCARD ALL
On Wed, Jan 13, 2021 at 1:03 PM Corey Huinker wrote: > In addition to that, a following case would be solved with this approach: >> When many processes are referencing many tables defined foreign key >> constraints thoroughly, a huge amount of memory will be consumed >> regardless of whether referenced tables are partitioned or not. >> >> Attached the patch. Any thoughts? >> > > Amit Langote has done some great work at eliminating SPI from > INSERT/UPDATE triggers entirely, thus reducing the number of cached plans > considerably. > > I think he was hoping to have a patch formalized this week, if time > allowed. > > It doesn't have DELETE triggers in it, so this patch might still have good > value for deletes on a commonly used enumeration table. > > However, our efforts might be better focused on eliminating SPI from > delete triggers as well, an admittedly harder task. > Amit's patch is now available in this thread [1]. I'm curious if it has any effect on your memory pressure issue. [1] https://www.postgresql.org/message-id/ca+hiwqgkfjfydeq5vhph6eqpkjsbfpddy+j-kxyfepqedts...@mail.gmail.com
Re: ResourceOwner refactoring
On Mon, Jan 18, 2021 at 02:22:33PM +0200, Heikki Linnakangas wrote: > diff --git a/src/common/cryptohash_openssl.c b/src/common/cryptohash_openssl.c > index 551ec392b60..642e72d8c0f 100644 > --- a/src/common/cryptohash_openssl.c > +++ b/src/common/cryptohash_openssl.c [...] > +/* ResourceOwner callbacks to hold JitContexts */ Slight copy-paste error here. > /* >* Ensure, while the spinlock's not yet held, that there's a free > - * refcount entry. > + * refcount entry and that the resoure owner has room to remember the > + * pin. s/resoure/resource/. -- Michael signature.asc Description: PGP signature
Re: simplifying foreign key/RI checks
On Mon, Jan 18, 2021 at 9:45 PM Amit Langote wrote: > On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu wrote: > > > > Hi, > > I was looking at this statement: > > > > insert into f select generate_series(1, 200, 2); > > > > Since certain generated values (the second half) are not in table p, > wouldn't insertion for those values fail ? > > I tried a scaled down version (1000th) of your example: > > > > yugabyte=# insert into f select generate_series(1, 2000, 2); > > ERROR: insert or update on table "f" violates foreign key constraint > "f_a_fkey" > > DETAIL: Key (a)=(1001) is not present in table "p". > > Sorry, a wrong copy-paste by me. Try this: > > create table p (a numeric primary key); > insert into p select generate_series(1, 200); > create table f (a bigint references p); > > -- Unpatched > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 6527.652 ms (00:06.528) > > update f set a = a + 1; > UPDATE 100 > Time: 8108.310 ms (00:08.108) > > -- Patched: > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 3312.193 ms (00:03.312) > > update f set a = a + 1; > UPDATE 100 > Time: 4292.807 ms (00:04.293) > > > For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch : > > > > +* Collect partition key values from the unique key. > > > > At the end of the nested loop, should there be an assertion that > partkey->partnatts partition key values have been found ? > > This can be done by using a counter (initialized to 0) which is > incremented when a match is found by the inner loop. > > I've updated the patch to add the Assert. Thanks for taking a look. > > -- > Amit Langote > EDB: http://www.enterprisedb.com v2 patch applies and passes make check and make check-world. Perhaps, given the missing break at line 418 without any tests failing, we could add another regression test if we're into 100% code path coverage. As it is, I think the compiler warning was a sufficient alert. The code is easy to read, and the comments touch on the major points of what complexities arise from partitioned tables. A somewhat pedantic complaint I have brought up off-list is that this patch continues the pattern of the variable and function names making the assumption that the foreign key is referencing the primary key of the referenced table. Foreign key constraints need only reference a unique index, it doesn't have to be the primary key. Granted, that unique index is behaving exactly as a primary key would, so conceptually it is very similar, but keeping with the existing naming (pk_rel, pk_type, etc) can lead a developer to think that it would be just as correct to find the referenced relation and get the primary key index from there, which would not always be correct. This patch correctly grabs the index from the constraint itself, so no problem there. I like that this patch changes the absolute minimum of the code in order to get a very significant performance benefit. It does so in a way that should reduce resource pressure found in other places [1]. This will in turn reduce the performance penalty of "doing the right thing" in terms of defining enforced foreign keys. It seems to get a clearer performance boost than was achieved with previous efforts at statement level triggers. This patch completely sidesteps the DELETE case, which has more insidious performance implications, but is also far less common, and whose solution will likely be very different. [1] https://www.postgresql.org/message-id/cakkq508z6r5e3jdqhfpwszsajlpho3oyyoamfesaupto5vg...@mail.gmail.com
Re: [PATCH] ProcessInterrupts_hook
On Tue, 19 Jan 2021 at 12:44, Craig Ringer wrote: > > > We're about halfway there already, see 7e784d1dc. I didn't do the >> > other half because it wasn't necessary to the problem, but exposing >> > the shutdown state more fully seems reasonable. >> > > Excellent, I'll take a look. Thanks. > That looks very handy already. Extending it to be set before SIGTERM too would be handy. My suggestion, which I'm happy to post in patch form if you think it's reasonable: * Change QuitSignalReason to ExitSignalReason to cover both SIGTERM (fast) and SIGQUIT (immediate) * Rename PMQUIT_FOR_STOP to PMEXIT_IMMEDIATE_SHUTDOWN * Add enumeration values PMEXIT_SMART_SHUTDOWN and PMEXIT_FAST_SHUTDOWN * For a fast shutdown, in pmdie()'s SIGINT case call SetExitSignalReason(PMEXIT_FAST_SHUTDOWN), so that when PostmasterStateMachine() calls SignalSomeChildren(SIGTERM, ...) in response to PM_STOP_BACKENDS, the reason is already available. * For smart shutdown, in pmdie()'s SIGTERM case call SetExitSignalReason(PMEXIT_SMART_SHUTDOWN) and set the latch of every live backend. There isn't any appropriate PROCSIG so unless we want to overload PROCSIG_WALSND_INIT_STOPPING (ick), but I think it'd generally be sufficient to check GetExitSignalReason() in backend main loops. The fast shutdown case seems like a no-brainer extension of your existing patch. I'm not entirely sure about the smart shutdown case. I don't want to add a ProcSignal slot just for this and the info isn't urgent anyway. I think that by checking for postmaster shutdown in the backend main loop we'd be able to support eager termination of idle backends on smart shutdown (immediately, or after an idle grace period), which is something I've wanted for quite some time. It shouldn't be significantly expensive especially in the idle loop. Thoughts? (Also I want a hook in PostgresMain's idle loop for things like this).
Re: Is Recovery actually paused?
On Tue, Jan 19, 2021 at 10:30 AM Kyotaro Horiguchi wrote: > > At Tue, 19 Jan 2021 11:41:18 +0900, Yugo NAGATA wrote in > > On Sun, 17 Jan 2021 11:33:52 +0530 > > Dilip Kumar wrote: > > > > > > > > > > > > > > I think the current pg_is_wal_replay_paused() already has another > > > > > > > purpose; > > > > > > > this waits recovery to actually get paused. If we want to limit > > > > > > > this API's > > > > > > > purpose only to return the pause state, it seems better to fix > > > > > > > this to return > > > > > > > the actual state at the cost of lacking the backward > > > > > > > compatibility. If we want > > > > > > > to know whether pause is requested, we may add a new API like > > > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait > > > > > > > recovery to actually > > > > > > > get paused, we may add an option to pg_wal_replay_pause() for > > > > > > > this purpose. > > > > > > > > > > > > > > However, this might be a bikeshedding. If anyone don't care that > > > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I > > > > > > > don't care either. > > > > > > > > > > > > I don't think that it will be blocked ever, because > > > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the > > > > > > recovery process will not be stuck on waiting for the WAL. > > > > > > > > Yes, there is no stuck on waiting for the WAL. However, it can be stuck > > > > during resolving > > > > a recovery conflict. The process could wait for > > > > max_standby_streaming_delay or > > > > max_standby_archive_delay at most before recovery get completely paused. > > > > > > Okay, I agree that it is possible so for handling this we have a > > > couple of options > > > 1. pg_is_wal_replay_paused(), interface will wait for recovery to > > > actually get paused, but user have an option to cancel that. So I > > > agree that there is currently no option to just know that recovery > > > pause is requested without waiting for its actually get paused if it > > > is requested. So one option is we can provide an another interface as > > > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just > > > return the request status. I am not sure how useful it is. > > > > If it is acceptable that pg_is_wal_replay_paused() makes users wait, > > I'm ok for the current interface. I don't feel the need of > > pg_is_wal_replay_paluse_requeseted(). > > FWIW, the name "pg_is_wal_replay_paused" is suggesting "to know > whether recovery is paused or not at present" and it would be > surprising to see it to wait for the recovery actually paused by > default. > > I think there's no functions to wait for some situation at least for > now. If we wanted to wait for some condition to make, we would loop > over check-and-wait using plpgsql. > > If you desire to wait to replication to pause by a function, I would > do that by adding a parameter to the function. > > pg_is_wal_replay_paused(OPTIONAL bool wait_for_pause) This seems to be a fair point to me. So I will add an option to the API, and if that is passed true then we will wait for recovery to get paused. otherwise, this will just return true if the pause is requested same as the current behavior. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Additional Chapter for Tutorial - arch-dev.sgml
On 18.01.21 15:13, Heikki Linnakangas wrote: On 20/11/2020 23:52, Erik Rijkers wrote: (smallish) Changes to arch-dev.sgml This looks good to me. One little complaint: @@ -125,7 +122,7 @@ use a supervisor process (also master process) that spawns a new server process every time a connection is requested. This supervisor - process is called postgres and listens at a + process is called postgres (formerly 'postmaster') and listens at a specified TCP/IP port for incoming connections. Whenever a request for a connection is detected the postgres process spawns a new server process. The server tasks I believe we still call it the postmaster process. We renamed the binary a long time ago (commit 5266f221a2), and the above text was changed as part of that commit. I think that was a mistake, and this should say simply: ... This supervisor process is called postmaster and ... like it did before we renamed the binary. Barring objections, I'll commit this with that change (as attached). - Heikki I fear that the patch 'Additional chapter for Tutorial' grows beyond manageable limits. It runs since nearly one year, the size of 228 KB is very huge, many people havemade significant contributions. But a commit seems to be in far distance. Having said that, I'm pleased with Heikki's proposal to split changes in the existing file 'arch-dev.sgml' from the rest of the patch and commit them separately. But I have some concerns with the chapter '51.2. How Connections Are Established'. It uses central terms like 'client process', 'server process','supervisor process', 'master process', 'server tasks', 'backend (server)', 'frontend (client)', 'server', 'client'. Some month ago, we have cleared his terminology in the new chapter 'glossary'. As long as it leads to readable text, we shall use the glossary-terms instead of the current ones. And we shall include some links to the glossary. I propose to start a new thread which contains only changes to 'arch-dev.sgml'. In pgsql-hackers or in pgsql-docs list? Initialized by Heikki or by me? -- Jürgen Purtz
RE: Global snapshots
Hello, Andrey-san, all, Based on the request at HighGo's sharding meeting, I'm re-sending the information on Commitment Ordering that could be used for global visibility. Their patents have already expired. -- Have anyone examined the following Multiversion Commitment Ordering (MVCO)? Although I haven't understood this yet, it insists that no concurrency control information including timestamps needs to be exchanged among the cluster nodes. I'd appreciate it if someone could give an opinion. Commitment Ordering Based Distributed Concurrency Control for Bridging Single and Multi Version Resources. Proceedings of the Third IEEE International Workshop on Research Issues on Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992) https://ieeexplore.ieee.org/document/281924?arnumber=281924 The author of the above paper, Yoav Raz, seems to have had strong passion at least until 2011 about making people believe the mightiness of Commitment Ordering (CO) for global serializability. However, he complains (sadly) that almost all researchers ignore his theory, as written in his following site and wikipedia page for Commitment Ordering. Does anyone know why CO is ignored? -- * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information to be transferred among the cluster nodes. However, this seems to have to track the order of read and write operations among concurrent transactions to ensure the correct commit order, so I'm not sure about the performance. The MVCO paper seems to present the information we need, but I haven't understood it well yet (it's difficult.) Could you anybody kindly interpret this? Commitment ordering (CO) - yoavraz2 https://sites.google.com/site/yoavraz2/the_principle_of_co -- Could you please try interpreting MVCO and see if we have any hope in this? This doesn't fit in my small brain. I'll catch up with understanding this when I have time. MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853) https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf MVCO is a multiversion member of Commitment Ordering algorithms described below: Commitment ordering (CO) - yoavraz2 https://sites.google.com/site/yoavraz2/the_principle_of_co Commitment ordering - Wikipedia https://en.wikipedia.org/wiki/Commitment_ordering Related patents are as follows. The last one is MVCO. US5504900A - Commitment ordering for guaranteeing serializability across distributed transactions https://patents.google.com/patent/US5504900A/en?oq=US5504900 US5504899A - Guaranteeing global serializability by applying commitment ordering selectively to global transactions https://patents.google.com/patent/US5504899A/en?oq=US5504899 US5701480A - Distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing https://patents.google.com/patent/US5701480A/en?oq=US5701480 Regards Takayuki Tsunakawa
Re: simplifying foreign key/RI checks
On Tue, Jan 19, 2021 at 2:56 PM Corey Huinker wrote: >> In file included from >> /home/japin/Codes/postgresql/Debug/../src/include/postgres.h:47:0, >> from >> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:24: >> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c: >> In function ‘ri_PrimaryKeyExists’: >> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:134:5: >> warning: this statement may fall through [-Wimplicit-fallthrough=] >> do { \ >> ^ >> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:156:2: note: >> in expansion of macro ‘ereport_domain’ >> ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__) >> ^~ >> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:229:2: note: >> in expansion of macro ‘ereport’ >> ereport(elevel, errmsg_internal(__VA_ARGS__)) >> ^~~ >> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:417:5: >> note: in expansion of macro ‘elog’ >> elog(ERROR, "unexpected table_tuple_lock status: %u", res); >> ^~~~ >> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:419:4: >> note: here >> default: >> ^~~ Thanks, will fix it. -- Amit Langote EDB: http://www.enterprisedb.com
Re: POC: postgres_fdw insert batching
On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.ta...@fujitsu.com wrote: > From: Amit Langote > > I apologize in advance for being maybe overly pedantic, but I noticed > > that, in ExecInitModifyTable(), you decided to place the call outside > > the loop that goes over resultRelations (shown below), although my > > intent was to ask to place it next to the BeginForeignModify() in that > > loop. > > Actually, I tried to do it (adding the GetModifyBatchSize() call after > BeginForeignModify()), but it failed. Because > postgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and > ResultRelInfo->projectReturning is created after the above part. Considering > the context where GetModifyBatchSize() implementations may want to know the > environment, I placed the call as late as possible in the initialization > phase. As for the future(?) multi-target DML statements, I think we can > change this together with other many(?) parts that assume a single target > table. Okay, sometime later then. I wasn't sure if bringing it up here would be appropriate, but there's a patch by me to refactor ModfiyTable result relation allocation that will have to remember to move this code along to an appropriate place [1]. Thanks for the tip about the dependency on how RETURNING is handled. I will remember it when rebasing my patch over this. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://commitfest.postgresql.org/31/2621/
How to expose session vs txn lock info in pg_locks view?
Presently there doesn't seem to be a way to tell whether a lock is session-level or transaction-level in the pg_locks view. I was expecting this to be a quick patch, but the comment on the definition of PROCLOCKTAG in lock.h notes that shmem state for heavyweight locks does not track whether the lock is session-level or txn-level. That explains why it's not already exposed in pg_locks. AFAICS it'd be necessary to expand PROCLOG to expose this in shmem. Probably by adding a small bitfield where bit 0 is set if there's a txn level lock and bit 1 is set if there's a session level lock. But I'm not convinced that expanding PROCLOCK is justifiable for this. sizeof(PROCLOCK) is 64 on a typical x64 machine. Adding anything to it increases it to 72 bytes. (gdb) ptype /o struct PROCLOCK /* offset| size */ type = struct PROCLOCK { /*0 |16 */PROCLOCKTAG tag; /* 16 | 8 */PGPROC *groupLeader; /* 24 | 4 */LOCKMASK holdMask; /* 28 | 4 */LOCKMASK releaseMask; /* 32 |16 */SHM_QUEUE lockLink; /* 48 |16 */SHM_QUEUE procLink; /* 64 | 1 */unsigned char locktypes; /* XXX 7-byte padding */ /* total size (bytes): 72 */ } Going over 64 sets off possible alarm bells about cache line sizing to me, but maybe it's not that critical? It'd also require (8 * max_locks_per_xact * (MaxBackends+max_prepared_xacts)) extra shmem space; that could land up being 128k on a default setup and a couple of megabytes on a big system. Not huge, but not insignificant if it's hot data. It's frustrating to be unable to tell the difference between session-level and txn-level locks in diagnostic output. And the deadlock detector has no way to tell the difference when selecting a victim for a deadlock abort - it'd probably make sense to prefer to send a deadlock abort for txn-only lockers. But I'm not sure I see a sensible way to add the info - PROCLOCK is already free of any padding, and I wouldn't want to use hacks like pointer-tagging. Thoughts anyone?
Re: Parallel INSERT (INTO ... SELECT ...)
On Tue, Jan 19, 2021 at 10:32 AM Greg Nancarrow wrote: > > On Tue, Jan 19, 2021 at 1:37 PM tsunakawa.ta...@fujitsu.com > wrote: > > > > > The table has ALREADY been locked (by the caller) during the > > > parse/analyze phase. > > > > Isn't there any case where planning is done but parse analysis is not done > > immediately before? e.g. > > > > * Alteration of objects invalidates cached query plans, and the next > > execution of the plan rebuilds it. (But it seems that parse analysis is > > done in this case in plancache.c.) > > > > * Execute a prepared statement with a different parameter value, which > > builds a new custom plan or a generic plan. > > > > I don't know, but since NoLock is used in other parts of the planner, > I'd expect those to fail if such cases existed. > I think I know how for both the above cases, we ensure that the locks are acquired before we reach the planner. It seems we will call GetCachedPlan during these scenarios which will acquire the required locks in RevalidateCachedQuery both when the cached plan is invalid and when it is valid. So, we should be fine even when the custom/generic plan needs to be formed due to a different parameter. > > Is the cached query plan invalidated when some alteration is done to change > > the parallel safety, such as adding a trigger with a parallel-unsafe > > trigger action? > > > > Needs to be tested, > Yeah, it would be good to test it but I think even if the plan is invalidated, we will reacquire the required locks as mentioned above. Tsunakawa-San, does this address your concerns around locking the target relation in the required cases? It would be good to test but I don't see any problems in the scenarios you mentioned. -- With Regards, Amit Kapila.
Re: [PATCH 1/1] Initial mach based shared memory support.
James Hilliard writes: > from my understanding due to mach semantics a number of the sanity checks > we are doing for sysv shmem are probably unnecessary when using mach > API's due to the mach port task bindings(mach_task_self()) effectively > ensuring our maps are already task bound and won't conflict with other tasks. Really? If so, this whole patch is pretty much dead in the water, because the fact that sysv shmem is globally accessible is exactly why we use it (well, that and the fact that you can find out how many processes are attached to it). It's been a long time since we cared about sysv shmem as a source of shared storage. What we really use it for is as a form of mutual exclusion, i.e. being able to detect whether any live children remain from a dead postmaster. That is, PGSharedMemoryIsInUse is not some minor ancillary check, it's the main reason this module exists at all. If POSIX-style mmap'd shmem had the same capability we'd likely have dropped sysv altogether long ago. I've occasionally wondered if we should take another look at file locking as a mechanism for ensuring only one postmaster+children process group can access a data directory. Back in the day it was too untrustworthy, but maybe that has changed. regards, tom lane
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/19 9:53, Hou, Zhijie wrote: +1 to add it after "dropped (Note )", how about as follows with slight changes? dropped (Note that server name of an invalid connection can be NULL if the server is dropped), and then such . Yes, I like this one. One question is; "should" or "is" is better than "can" in this case because the server name of invalid connection is always NULL when its server is dropped? I think "dropped (Note that server name of an invalid connection will be NULL if the server is dropped), and then such ." Sounds good to me. So patch attached. +1 Thanks! I pushed the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: CheckpointLock needed in CreateCheckPoint()?
On Mon, Jan 18, 2021 at 02:36:48PM -0500, Robert Haas wrote: > Here's a patch to remove CheckpointLock completely. For > ProcessInterrupts() to do anything, one of the following things would > have to be true: > > [...] > > So I don't see any problem with just ripping this out entirely, but > I'd like to know if anybody else does. Agreed, +1. -- Michael signature.asc Description: PGP signature
Re: simplifying foreign key/RI checks
> > > In file included from > /home/japin/Codes/postgresql/Debug/../src/include/postgres.h:47:0, > from > /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:24: > /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c: > In function ‘ri_PrimaryKeyExists’: > /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:134:5: > warning: this statement may fall through [-Wimplicit-fallthrough=] > do { \ > ^ > /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:156:2: > note: in expansion of macro ‘ereport_domain’ > ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__) > ^~ > /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:229:2: > note: in expansion of macro ‘ereport’ > ereport(elevel, errmsg_internal(__VA_ARGS__)) > ^~~ > /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:417:5: > note: in expansion of macro ‘elog’ > elog(ERROR, "unexpected table_tuple_lock status: %u", res); > ^~~~ > /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:419:4: > note: here > default: > ^~~ > > -- > Regrads, > Japin Li. > ChengDu WenWu Information Technology Co.,Ltd. > I also get this warning. Adding a "break;" at line 418 resolves the warning.
Re: [patch] Help information for pg_dump
On Tue, Jan 19, 2021 at 9:07 AM Zhang, Jie wrote: > > Hi all > > After executing command [pg_dump -?], some help information is as follows. > > pg_dump -? > - > -N, --exclude-schema=PATTERN do NOT dump the specified schema(s) ※ > -T, --exclude-table=PATTERN do NOT dump the specified table(s) ※ > -x, --no-privileges do not dump privileges (grant/revoke) > --exclude-table-data=PATTERN do NOT dump data for the specified table(s) ※ > --no-commentsdo not dump comments > --no-publicationsdo not dump publications > --no-security-labels do not dump security label assignments > --no-subscriptions do not dump subscriptions > --no-synchronized-snapshots do not use synchronized snapshots in parallel > jobs > --no-tablespaces do not dump tablespace assignments > --no-unlogged-table-data do not dump unlogged table data > > > I think it would be better to change [do NOT dump] to [do not dump]. > > Here is a patch. +1. Looks like SQL keywords are mentioned in capital letters in both pg_dump and pg_dumpall cases, so changing "do NOT" to "do not" seems okay to me. Patch LGTM. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH 1/1] Initial mach based shared memory support.
On Mon, Jan 18, 2021 at 5:29 PM Tom Lane wrote: > > James Hilliard writes: > > OSX implements sysv shmem support via a mach wrapper, however the mach > > sysv shmem wrapper has some severe restrictions that prevent us from > > allocating enough memory segments in some cases. > > ... > > In order to avoid hitting these limits we can bypass the wrapper layer > > and just use mach directly. > > I wanted to review this, but it's impossible because the kernel calls > you're using have you've-got-to-be-kidding documentation like this: > > https://developer.apple.com/documentation/kernel/1402504-mach_vm_page_info?language=objc FYI there's a good chance I'm not using some of these info API's correctly, from my understanding due to mach semantics a number of the sanity checks we are doing for sysv shmem are probably unnecessary when using mach API's due to the mach port task bindings(mach_task_self()) effectively ensuring our maps are already task bound and won't conflict with other tasks. The mach vm API for our use case does appear to be superior to the shmem API on top of not being bound by the sysv wrapper limits due to it being natively task bound as opposed to normal sysv shared memory maps which appears to be system wide global maps and thus requires lots of key conflict checks which should be entirely unnecessary when using task bound mach vm maps. For example PGSharedMemoryIsInUse() from my understanding should always return false for the mach vm version as it is already task scoped. I'm not 100% sure here as I'm not very familiar with all the ways postgres uses shared memory however. Is it true that postgresql shared memory is strictly used by child processes fork()'d from the parent postmaster? If that is the case I think as long as we ensure all our map operations are bound to the parent postmaster mach_task_self() port then we should never conflict with another postmaster's process tree's memory maps. I tried to largely copy existing sysv shmem semantics in this first proof of concept as I'm largely unfamiliar with postgresql internals but I suspect we can safely remove much of the map key conflict checking code entirely. Here's a few additional references I found which may be helpful. This one is more oriented for kernel programming, however much of the info is relevant to the userspace interface as well it would appear: https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/vm/vm.html#//apple_ref/doc/uid/TP3905-CH210-TPXREF109 This book gives an overview of the mach userspace interface as well, although I don't think it's a fully complete API reference, but it covers some of the basic use cases: https://flylib.com/books/en/3.126.1.89/1/ Apparently the hurd kernel uses a virtual memory interface that is based on the mach kernel, and it has more detailed API docs, note that these tend to not have the "mach_" prefix but appear to largely function the same: https://www.gnu.org/software/hurd/gnumach-doc/Virtual-Memory-Interface.html There is additional documentation from the OSFMK kernel(which was combined with XNU by apple from my understanding) here related to the virtual memory interface, like the herd docs these tend to not have the "mach_" prefix: http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/ > > Google finds the source code too, but that's equally bereft of useful > documentation. So I wonder where you obtained the information necessary > to write this patch. > > I did notice however that mach_shmem.c seems to be 95% copy-and-paste from > sysv_shmem.c, including a lot of comments that don't feel particularly > true or relevant here. I wonder whether we need a separate file as > opposed to a few #ifdef's around the kernel calls. > > regards, tom lane
Re: Odd, intermittent failure in contrib/pageinspect
On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote: > Right, then we don't need any strange theories about autovacuum, > just bad timing luck. whelk does seem pretty slow, so it's not > much of a stretch to imagine that it's more susceptible to this > corner case than faster machines. > > So, do we have any other tests that are invoking a manual vacuum > and assuming it won't skip any pages? By this theory, they'd > all be failures waiting to happen. That looks possible by looking at the code around lazy_scan_heap(), but that's narrow. check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it seems to me that we had better be sure that no pages are skipped for their cases? The duplicated query result looks to be an oversight from 58b4cb3 when the thing got rewritten, so it can just go away. Good catch. -- Michael signature.asc Description: PGP signature
Re: Parallel INSERT (INTO ... SELECT ...)
On Tue, Jan 19, 2021 at 9:19 AM Greg Nancarrow wrote: > > On Tue, Jan 19, 2021 at 2:03 PM Amit Kapila wrote: > > > > > > > > > > You have not raised a WARNING for the second case. > > > > > > The same checks in current Postgres code also don't raise a WARNING > > > for that case, so I'm just being consistent with existing Postgres > > > code (which itself isn't consistent for those two cases). > > > > > > > Search for the string "too few entries in indexprs list" and you will > > find a lot of places in code raising ERROR for the same condition. > > > > Yes, but raising an ERROR stops processing (not just logs an error > message). Raising a WARNING logs a warning message and continues > processing. It's a big difference. > So, for the added parallel-safety-checking code, it was suggested by > Vignesh (and agreed by me) that, for these rare and highly unlikely > conditions, it would be best not to just copy the error-handling code > verbatim from other cases in the Postgres code (as I had originally > done) and just stop processing dead with an error, but to instead > return PARALLEL_UNSAFE, so that processing continues as it would for > current non-parallel processing, which would most likely error-out > anyway along the current error-handling checks and paths when those > bad attributes/fields are referenced. > I will add some Asserts() and don't mind adding a WARNING message for > the 2nd case. > If you really feel strongly about this, I can just restore the > original code, which will stop dead with an ERROR in the middle of > parallel-safety checking should one of these rare conditions ever > occur. > I am expecting that either we raise a WARNING and return parallel_unsafe for all such checks (shouldn't reach cases) in the patch or simply raise an ERROR as we do in other parts of the patch. I personally prefer the latter alternative but I am fine with the former one as well. -- With Regards, Amit Kapila.
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 3:50 PM Greg Nancarrow wrote: > > On Mon, Jan 18, 2021 at 8:10 PM Amit Kapila wrote: > > > > > 1) > > > > > > >Here, it seems we are accessing the relation descriptor without any > > > >lock on the table which is dangerous considering the table can be > > > >modified in a parallel session. Is there a reason why you think this > > > >is safe? Did you check anywhere else such a coding pattern? > > > > > > Yes, there's a very good reason and I certainly have checked for the > > > same coding pattern elsewhere, and not just randomly decided that > > > locking can be ignored. > > > The table has ALREADY been locked (by the caller) during the > > > parse/analyze phase. > > > > > > > Fair enough. I suggest adding a comment saying the same so that the > > reader doesn't get confused about the same. > > > > OK, I'll add a comment. > > > > (This is not the case for a partition, in which case the patch code > > > uses AccessShareLock, as you will see). > > > > Okay, but I see you release the lock on partition rel immediately > > after checking parallel-safety. What if a user added some > > parallel-unsafe constraint (via Alter Table) after that check? > > > > > > > I'm not sure. But there would be a similar concern for current > Parallel SELECT functionality, right? > I don't think so because, for Selects, we take locks on the required partitions and don't release them immediately. We do parallel safety checks after acquiring those locks. From the code perspective, we lock individual partitions via expand_inherited_rtentry->expand_partitioned_rtentry and then check parallel-safety at a later point via set_append_rel_size->set_rel_consider_parallel. Also, I am not sure if there is anything we check for Selects at each partition relation level that can be changed by a concurrent session. Do you have a different understanding? Similarly, we do retain locks on indexes, see get_relation_info, which we are not doing in the patch. > My recollection is that ALTER TABLE obtains an exclusive lock on the > table which it retains until the end of the transaction, so that will > result in blocking at certain points, during parallel-checks and > execution, but there may still be a window. > Once the Select has acquired locks in the above code path, I don't think Alter for a particular partition would be able to proceed unless those locks are non-conflicting. -- With Regards, Amit Kapila.
RE: POC: postgres_fdw insert batching
From: Amit Langote > I apologize in advance for being maybe overly pedantic, but I noticed > that, in ExecInitModifyTable(), you decided to place the call outside > the loop that goes over resultRelations (shown below), although my > intent was to ask to place it next to the BeginForeignModify() in that > loop. Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed. Because postgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is created after the above part. Considering the context where GetModifyBatchSize() implementations may want to know the environment, I placed the call as late as possible in the initialization phase. As for the future(?) multi-target DML statements, I think we can change this together with other many(?) parts that assume a single target table. Regards Takayuki Tsunakawa
Re: Parallel INSERT (INTO ... SELECT ...)
On Tue, Jan 19, 2021 at 1:37 PM tsunakawa.ta...@fujitsu.com wrote: > > > The table has ALREADY been locked (by the caller) during the > > parse/analyze phase. > > Isn't there any case where planning is done but parse analysis is not done > immediately before? e.g. > > * Alteration of objects invalidates cached query plans, and the next > execution of the plan rebuilds it. (But it seems that parse analysis is done > in this case in plancache.c.) > > * Execute a prepared statement with a different parameter value, which builds > a new custom plan or a generic plan. > I don't know, but since NoLock is used in other parts of the planner, I'd expect those to fail if such cases existed. > Is the cached query plan invalidated when some alteration is done to change > the parallel safety, such as adding a trigger with a parallel-unsafe trigger > action? > Needs to be tested, but I'd expect the cached plan to get invalidated in this case - surely the same potential issue exists in Postgres for the current Parallel SELECT functionality - for example, for a column with a default value that is an expression (which could be altered from being parallel-safe to parallel-unsafe). Regards, Greg Nancarrow Fujitsu Australia
Re: Is Recovery actually paused?
At Tue, 19 Jan 2021 11:41:18 +0900, Yugo NAGATA wrote in > On Sun, 17 Jan 2021 11:33:52 +0530 > Dilip Kumar wrote: > > > > > > > > > > > > I think the current pg_is_wal_replay_paused() already has another > > > > > > purpose; > > > > > > this waits recovery to actually get paused. If we want to limit > > > > > > this API's > > > > > > purpose only to return the pause state, it seems better to fix this > > > > > > to return > > > > > > the actual state at the cost of lacking the backward compatibility. > > > > > > If we want > > > > > > to know whether pause is requested, we may add a new API like > > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait > > > > > > recovery to actually > > > > > > get paused, we may add an option to pg_wal_replay_pause() for this > > > > > > purpose. > > > > > > > > > > > > However, this might be a bikeshedding. If anyone don't care that > > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I > > > > > > don't care either. > > > > > > > > > > I don't think that it will be blocked ever, because > > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the > > > > > recovery process will not be stuck on waiting for the WAL. > > > > > > Yes, there is no stuck on waiting for the WAL. However, it can be stuck > > > during resolving > > > a recovery conflict. The process could wait for > > > max_standby_streaming_delay or > > > max_standby_archive_delay at most before recovery get completely paused. > > > > Okay, I agree that it is possible so for handling this we have a > > couple of options > > 1. pg_is_wal_replay_paused(), interface will wait for recovery to > > actually get paused, but user have an option to cancel that. So I > > agree that there is currently no option to just know that recovery > > pause is requested without waiting for its actually get paused if it > > is requested. So one option is we can provide an another interface as > > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just > > return the request status. I am not sure how useful it is. > > If it is acceptable that pg_is_wal_replay_paused() makes users wait, > I'm ok for the current interface. I don't feel the need of > pg_is_wal_replay_paluse_requeseted(). FWIW, the name "pg_is_wal_replay_paused" is suggesting "to know whether recovery is paused or not at present" and it would be surprising to see it to wait for the recovery actually paused by default. I think there's no functions to wait for some situation at least for now. If we wanted to wait for some condition to make, we would loop over check-and-wait using plpgsql. If you desire to wait to replication to pause by a function, I would do that by adding a parameter to the function. pg_is_wal_replay_paused(OPTIONAL bool wait_for_pause) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: POC: postgres_fdw insert batching
Tsunakawa-san, On Tue, Jan 19, 2021 at 12:50 PM tsunakawa.ta...@fujitsu.com wrote: > From: Tomas Vondra > > OK. Can you prepare a final patch, squashing all the commits into a > > single one, and perhaps use the function in create_foreign_modify? > > Attached, including the message fix pointed by Zaihong-san. Thanks for adopting my suggestions regarding GetForeignModifyBatchSize(). I apologize in advance for being maybe overly pedantic, but I noticed that, in ExecInitModifyTable(), you decided to place the call outside the loop that goes over resultRelations (shown below), although my intent was to ask to place it next to the BeginForeignModify() in that loop. resultRelInfo = mtstate->resultRelInfo; i = 0; forboth(l, node->resultRelations, l1, node->plans) { ... /* Also let FDWs init themselves for foreign-table result rels */ if (!resultRelInfo->ri_usesFdwDirectModify && resultRelInfo->ri_FdwRoutine != NULL && resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL) { List *fdw_private = (List *) list_nth(node->fdwPrivLists, i); resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate, resultRelInfo, fdw_private, i, eflags); } Maybe it's fine today because we only care about inserts and there's always only one entry in the resultRelations list in that case, but that may not remain the case in the future. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Wrong usage of RelationNeedsWAL
At Mon, 18 Jan 2021 17:30:22 +0900 (JST), Kyotaro Horiguchi wrote in > At Sun, 17 Jan 2021 23:02:18 -0800, Noah Misch wrote in > > On Sun, Jan 17, 2021 at 10:36:31PM -0800, Noah Misch wrote: > > > I wrote the above based on the "PageGetLSN(page) > (snapshot)->lsn" check > > > in > > > TestForOldSnapshot(). If the LSN isn't important, what else explains > > > RelationAllowsEarlyPruning() checking RelationNeedsWAL()? > > > > Thinking about it more, some RelationAllowsEarlyPruning() callers need > > different treatment. Above, I was writing about the case of deciding > > whether > > to do early pruning. The other RelationAllowsEarlyPruning() call sites are > > deciding whether the relation might be lacking old data. For that case, we > > should check RELPERSISTENCE_PERMANENT, not RelationNeedsWAL(). Otherwise, > > we > > could fail to report an old-snapshot error in a case like this: > > A> > setup: CREATE TABLE t AS SELECT ...; B> > xact1: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; -- take snapshot C> > xact2: DELETE FROM t; D> > (plenty of time passes) E> > xact3: SELECT count(*) FROM t; -- early pruning F> > xact1: SAVEPOINT q; SELECT count(*) FROM t; ROLLBACK TO q; -- "snapshot too old" G> > xact1: ALTER TABLE t SET TABLESPACE something; -- start skipping WAL H> > xact1: SELECT count(*) FROM t; -- no error, wanted "snapshot too old" > > > > Is that plausible? > > Thank you for the consideration and yes. But I get "snapshot too old" > from the last query with the patched version. maybe I'm missing > something. I'm going to investigate the case. Ah. I took that in reverse way. (caught by the discussion on page LSN.) Ok, the "current" code works that way. So we need to perform the check the new way in RelationAllowsEarlyPruning. So in a short answer for the last question in regard to the reference side is yes. I understand that you are suggesting that at least TransactionIdLimitedForOldSnapshots should follow not only relation persistence but RelationNeedsWAL, based on the discussion on the check on LSN of TestForOldSnapshot(). I still don't think that LSN in the WAL-skipping-created relfilenode harm. ALTER TABLE SET TABLESPACE always makes a dead copy of every block (except checksum) including page LSN regardless of wal_level. In the scenario above, the last select at H correctly reads page LSN set by E then copied by G, which is larger than the snapshot LSN at B. So doesn't go the fast-return path before actual check performed by RelationAllowsEarlyPruning. As the result still RelationAllowsEarlyPruning is changed to check only for the persistence of the table in this version. (In other words, all the callers of the function works based on the same criteria.) - Removed RelationIsWalLoggeed which was forgotten to remove in the last version. - Enclosed parameter of RelationAllowsEarlyPruning. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From d688abbc04459b11bc2801f3c7f1955a86ef7a51 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 18 Jan 2021 14:47:21 +0900 Subject: [PATCH v3] Do not use RelationNeedsWAL to identify relation persistence RelationNeedsWAL() may return false for a permanent relation when wal_level=minimal and the relation is created or truncated in the current transaction. Directly examine relpersistence instead of using the function to know relation persistence. --- src/backend/catalog/pg_publication.c | 2 +- src/backend/optimizer/util/plancat.c | 3 ++- src/include/utils/rel.h| 2 +- src/include/utils/snapmgr.h| 2 +- src/test/subscription/t/001_rep_changes.pl | 20 +++- 5 files changed, 24 insertions(+), 5 deletions(-) diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c index 5f8e1c64e1..84d2efcfd2 100644 --- a/src/backend/catalog/pg_publication.c +++ b/src/backend/catalog/pg_publication.c @@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel) errdetail("System tables cannot be added to publications."))); /* UNLOGGED and TEMP relations cannot be part of publication. */ - if (!RelationNeedsWAL(targetrel)) + if (targetrel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("table \"%s\" cannot be replicated", diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index da322b453e..177e6e336a 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -126,7 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, relation = table_open(relationObjectId, NoLock); /* Temporary and unlogged relations are inaccessible during recovery. */ - if (!RelationNeedsWAL(relation) && RecoveryInProgress()) + if (relation->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT && + RecoveryInProgress()) ereport(ERROR,
Re: [PATCH] ProcessInterrupts_hook
On Tue, 19 Jan 2021, 02:01 Robert Haas, wrote: > On Mon, Jan 18, 2021 at 11:56 AM Tom Lane wrote: > > > I've wanted this in the past, too, so +1 from me. > > > > I dunno, this seems pretty scary and easily abusable. There's not all > > that much that can be done safely in ProcessInterrupts(), and we should > > not be encouraging extensions to think they can add random processing > > there. > > We've had this disagreement before about other things, and I just > don't agree. If somebody uses a hook for something wildly unsafe, that > will break their stuff, not ours. Generally yeah. And we have no shortage of hooks with plenty of error or abuse potential and few safeguards already. I'd argue that in C code any external code is inherently unsafe anyway. So it's mainly down to whether the hook actively encourages unsafe actions without providing commensurate benefits, and whether there's a better/safer way to achieve the same thing. That's not to say I endorse adding hooks for random purposes in random places. In particular, if it's > impossible to use a particular hook in a reasonably safe way, that's a > sign that the hook is badly-designed and that we should not have it. > Yep. Agreed. Any hook is possible to abuse or write incorrectly, from simple fmgr loadable functions right on up. The argument that a hook could be abused would apply just as well to exposing pqsignal() itself to extensions. Probably more so. Also to anything like ProcessUtility_hook. > > We're about halfway there already, see 7e784d1dc. I didn't do the > > other half because it wasn't necessary to the problem, but exposing > > the shutdown state more fully seems reasonable. > Excellent, I'll take a look. Thanks.
Re: simplifying foreign key/RI checks
út 19. 1. 2021 v 3:08 odesílatel Amit Langote napsal: > On Tue, Jan 19, 2021 at 3:01 AM Pavel Stehule > wrote: > > po 18. 1. 2021 v 13:40 odesílatel Amit Langote > napsal: > >> I started with the check that's performed when inserting into or > >> updating the referencing table to confirm that the new row points to a > >> valid row in the referenced relation. The corresponding SQL is this: > >> > >> SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x > >> > >> $1 is the value of the foreign key of the new row. If the query > >> returns a row, all good. Thanks to SPI, or its use of plan caching, > >> the query is re-planned only a handful of times before making a > >> generic plan that is then saved and reused, which looks like this: > >> > >> QUERY PLAN > >> -- > >> LockRows > >>-> Index Scan using pk_pkey on pk x > >> Index Cond: (a = $1) > >> (3 rows) > > > > > > What is performance when the referenced table is small? - a lot of > codebooks are small between 1000 to 10K rows. > > I see the same ~2x improvement. > > create table p (a numeric primary key); > insert into p select generate_series(1, 1000); > create table f (a bigint references p); > > Unpatched: > > insert into f select i%1000+1 from generate_series(1, 100) i; > INSERT 0 100 > Time: 5461.377 ms (00:05.461) > > > Patched: > > insert into f select i%1000+1 from generate_series(1, 100) i; > INSERT 0 100 > Time: 2357.440 ms (00:02.357) > > That's expected because the overhead of using SPI to check the PK > table, which the patch gets rid of, is the same no matter the size of > the index to be scanned. > It looks very well. Regards Pavel > -- > Amit Langote > EDB: http://www.enterprisedb.com >
RE: POC: postgres_fdw insert batching
From: Tomas Vondra > OK. Can you prepare a final patch, squashing all the commits into a > single one, and perhaps use the function in create_foreign_modify? Attached, including the message fix pointed by Zaihong-san. Regards Takayuki Tsunakawa v12-0001-Add-bulk-insert-for-foreign-tables.patch Description: v12-0001-Add-bulk-insert-for-foreign-tables.patch
Re: Parallel INSERT (INTO ... SELECT ...)
On Tue, Jan 19, 2021 at 2:03 PM Amit Kapila wrote: > > > > > > > You have not raised a WARNING for the second case. > > > > The same checks in current Postgres code also don't raise a WARNING > > for that case, so I'm just being consistent with existing Postgres > > code (which itself isn't consistent for those two cases). > > > > Search for the string "too few entries in indexprs list" and you will > find a lot of places in code raising ERROR for the same condition. > Yes, but raising an ERROR stops processing (not just logs an error message). Raising a WARNING logs a warning message and continues processing. It's a big difference. So, for the added parallel-safety-checking code, it was suggested by Vignesh (and agreed by me) that, for these rare and highly unlikely conditions, it would be best not to just copy the error-handling code verbatim from other cases in the Postgres code (as I had originally done) and just stop processing dead with an error, but to instead return PARALLEL_UNSAFE, so that processing continues as it would for current non-parallel processing, which would most likely error-out anyway along the current error-handling checks and paths when those bad attributes/fields are referenced. I will add some Asserts() and don't mind adding a WARNING message for the 2nd case. If you really feel strongly about this, I can just restore the original code, which will stop dead with an ERROR in the middle of parallel-safety checking should one of these rare conditions ever occur. Regards, Greg Nancarrow Fujitsu Australia
回复:Re: Cache relation sizes?
Hi Thomas I want to share a patch with you, I change the replacement algorithm from fifo to a simple lru. Buzhen 0001-update-fifo-to-lru-to-sweep-a-valid-cache.patch Description: Binary data
[patch] Help information for pg_dump
Hi all After executing command [pg_dump -?], some help information is as follows. pg_dump -? - -N, --exclude-schema=PATTERN do NOT dump the specified schema(s) ※ -T, --exclude-table=PATTERN do NOT dump the specified table(s) ※ -x, --no-privileges do not dump privileges (grant/revoke) --exclude-table-data=PATTERN do NOT dump data for the specified table(s) ※ --no-commentsdo not dump comments --no-publicationsdo not dump publications --no-security-labels do not dump security label assignments --no-subscriptions do not dump subscriptions --no-synchronized-snapshots do not use synchronized snapshots in parallel jobs --no-tablespaces do not dump tablespace assignments --no-unlogged-table-data do not dump unlogged table data I think it would be better to change [do NOT dump] to [do not dump]. Here is a patch. Best Regards! pg_dump.patch Description: pg_dump.patch
Re: simplifying foreign key/RI checks
On Tue, 19 Jan 2021 at 10:45, Amit Langote wrote: > On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu wrote: >> >> Hi, >> I was looking at this statement: >> >> insert into f select generate_series(1, 200, 2); >> >> Since certain generated values (the second half) are not in table p, >> wouldn't insertion for those values fail ? >> I tried a scaled down version (1000th) of your example: >> >> yugabyte=# insert into f select generate_series(1, 2000, 2); >> ERROR: insert or update on table "f" violates foreign key constraint >> "f_a_fkey" >> DETAIL: Key (a)=(1001) is not present in table "p". > > Sorry, a wrong copy-paste by me. Try this: > > create table p (a numeric primary key); > insert into p select generate_series(1, 200); > create table f (a bigint references p); > > -- Unpatched > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 6527.652 ms (00:06.528) > > update f set a = a + 1; > UPDATE 100 > Time: 8108.310 ms (00:08.108) > > -- Patched: > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 3312.193 ms (00:03.312) > > update f set a = a + 1; > UPDATE 100 > Time: 4292.807 ms (00:04.293) > >> For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch : >> >> +* Collect partition key values from the unique key. >> >> At the end of the nested loop, should there be an assertion that >> partkey->partnatts partition key values have been found ? >> This can be done by using a counter (initialized to 0) which is incremented >> when a match is found by the inner loop. > > I've updated the patch to add the Assert. Thanks for taking a look. After apply the v2 patches, here are some warnings: In file included from /home/japin/Codes/postgresql/Debug/../src/include/postgres.h:47:0, from /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:24: /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c: In function ‘ri_PrimaryKeyExists’: /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:134:5: warning: this statement may fall through [-Wimplicit-fallthrough=] do { \ ^ /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:156:2: note: in expansion of macro ‘ereport_domain’ ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__) ^~ /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:229:2: note: in expansion of macro ‘ereport’ ereport(elevel, errmsg_internal(__VA_ARGS__)) ^~~ /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:417:5: note: in expansion of macro ‘elog’ elog(ERROR, "unexpected table_tuple_lock status: %u", res); ^~~~ /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:419:4: note: here default: ^~~ -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On Mon, Jan 18, 2021 at 9:11 PM Fujii Masao wrote: > > Attaching v12 patch set. 0001 is for postgres_fdw_disconnect() > > function, 0002 is for keep_connections GUC and 0003 is for > > keep_connection server level option. > > Thanks! > > > > > Please review it further. > > + server = GetForeignServerByName(servername, true); > + > + if (!server) > + ereport(ERROR, > + > (errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST), > +errmsg("foreign server \"%s\" does > not exist", servername))); > > ISTM we can simplify this code as follows. > > server = GetForeignServerByName(servername, false); Done. > + hash_seq_init(, ConnectionHash); > + while ((entry = (ConnCacheEntry *) hash_seq_search())) > > When the server name is specified, even if its connection is successfully > closed, postgres_fdw_disconnect() scans through all the entries to check > whether there are active connections. But if "result" is true and > active_conn_exists is true, we can get out of this loop to avoid unnecessary > scans. My initial thought was that it's possible to have two entries with the same foreign server name but with different user mappings, looks like it's not possible. I tried associating a foreign server with two different user mappings [1], then the cache entry is getting associated initially with the user mapping that comes first in the pg_user_mappings, if this user mapping is dropped then the cache entry gets invalidated, so next time the second user mapping is used. Since there's no way we can have two cache entries with the same foreign server name, we can get out of the loop when we find the cache entry match with the given server. I made the changes. [1] postgres=# select * from pg_user_mappings ; umid | srvid | srvname | umuser | usename | umoptions ---+---+---++-+--- 16395 | 16394 | loopback1 | 10 | bharath |-> cache entry is initially made with this user mapping. 16399 | 16394 | loopback1 | 0 | public | -> if the above user mapping is dropped, then the cache entry is made with this user mapping. > + /* > +* Destroy the cache if we discarded all active connections i.e. if > there > +* is no single active connection, which we can know while scanning > the > +* cached entries in the above loop. Destroying the cache is better > than to > +* keep it in the memory with all inactive entries in it to save some > +* memory. Cache can get initialized on the subsequent queries to > foreign > +* server. > > How much memory is assumed to be saved by destroying the cache in > many cases? I'm not sure if it's really worth destroying the cache to save > the memory. I removed the cache destroying code, if somebody complains in future(after the feature commit), we can really revisit then. > + a warning is issued and false is returned. > false > + is returned when there are no open connections. When there are some > open > + connections, but there is no connection for the given foreign server, > + then false is returned. When no foreign server > exists > + with the given name, an error is emitted. Example usage of the > function: > > When a non-existent server name is specified, postgres_fdw_disconnect() > emits an error if there is at least one open connection, but just returns > false otherwise. At least for me, this behavior looks inconsitent and strange. > In that case, IMO the function always should emit an error. Done. Attaching v13 patch set, please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From 0731c6244ac228818916d62cc51ea1434178c5be Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 19 Jan 2021 08:23:55 +0530 Subject: [PATCH v13] postgres_fdw function to discard cached connections This patch introduces a new function postgres_fdw_disconnect(). When called with a foreign server name, it discards the associated connection with the server. When called without any argument, it discards all the existing cached connections. --- contrib/postgres_fdw/connection.c | 147 ++ .../postgres_fdw/expected/postgres_fdw.out| 93 +++ .../postgres_fdw/postgres_fdw--1.0--1.1.sql | 10 ++ contrib/postgres_fdw/sql/postgres_fdw.sql | 35 + doc/src/sgml/postgres-fdw.sgml| 59 +++ 5 files changed, 344 insertions(+) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index a1404cb6bb..287a047c80 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -80,6 +80,7 @@ static bool xact_got_connection = false; * SQL functions */ PG_FUNCTION_INFO_V1(postgres_fdw_get_connections); +PG_FUNCTION_INFO_V1(postgres_fdw_disconnect); /* prototypes
Re: Is Recovery actually paused?
On Tue, 19 Jan 2021 at 8:12 AM, Yugo NAGATA wrote: > On Sun, 17 Jan 2021 11:33:52 +0530 > Dilip Kumar wrote: > > > On Thu, Jan 14, 2021 at 6:49 PM Yugo NAGATA wrote: > > > > > > On Wed, 13 Jan 2021 17:49:43 +0530 > > > Dilip Kumar wrote: > > > > > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar > wrote: > > > > > > > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA > wrote: > > > > > > > > > > > > On Thu, 10 Dec 2020 11:25:23 +0530 > > > > > > Dilip Kumar wrote: > > > > > > > > > > > > > > > However, I wonder users don't expect > pg_is_wal_replay_paused to wait. > > > > > > > > > Especially, if max_standby_streaming_delay is -1, this > will be blocked forever, > > > > > > > > > although this setting may not be usual. In addition, some > users may set > > > > > > > > > recovery_min_apply_delay for a large. If such users call > pg_is_wal_replay_paused, > > > > > > > > > it could wait for a long time. > > > > > > > > > > > > > > > > > > At least, I think we need some descriptions on document to > explain > > > > > > > > > pg_is_wal_replay_paused could wait while a time. > > > > > > > > > > > > > > > > Ok > > > > > > > > > > > > > > Fixed this, added some comments in .sgml as well as in > function header > > > > > > > > > > > > Thank you for fixing this. > > > > > > > > > > > > Also, is it better to fix the description of pg_wal_replay_pause > from > > > > > > "Pauses recovery." to "Request to pause recovery." in according > with > > > > > > pg_is_wal_replay_paused? > > > > > > > > > > Okay > > > > > > > > > > > > > > > > > > > > Also, how about adding a new boolean argument to > pg_is_wal_replay_paused to > > > > > > > > > control whether this waits for recovery to get paused or > not? By setting its > > > > > > > > > default value to true or false, users can use the old > format for calling this > > > > > > > > > and the backward compatibility can be maintained. > > > > > > > > > > > > > > > > So basically, if the wait_recovery_pause flag is false then > we will > > > > > > > > immediately return true if the pause is requested? I agree > that it is > > > > > > > > good to have an API to know whether the recovery pause is > requested or > > > > > > > > not but I am not sure is it good idea to make this API serve > both the > > > > > > > > purpose? Anyone else have any thoughts on this? > > > > > > > > > > > > > > > > > > > > I think the current pg_is_wal_replay_paused() already has > another purpose; > > > > > > this waits recovery to actually get paused. If we want to limit > this API's > > > > > > purpose only to return the pause state, it seems better to fix > this to return > > > > > > the actual state at the cost of lacking the backward > compatibility. If we want > > > > > > to know whether pause is requested, we may add a new API like > > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait > recovery to actually > > > > > > get paused, we may add an option to pg_wal_replay_pause() for > this purpose. > > > > > > > > > > > > However, this might be a bikeshedding. If anyone don't care that > > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I > don't care either. > > > > > > > > > > I don't think that it will be blocked ever, because > > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the > > > > > recovery process will not be stuck on waiting for the WAL. > > > > > > Yes, there is no stuck on waiting for the WAL. However, it can be > stuck during resolving > > > a recovery conflict. The process could wait for > max_standby_streaming_delay or > > > max_standby_archive_delay at most before recovery get completely > paused. > > > > Okay, I agree that it is possible so for handling this we have a > > couple of options > > 1. pg_is_wal_replay_paused(), interface will wait for recovery to > > actually get paused, but user have an option to cancel that. So I > > agree that there is currently no option to just know that recovery > > pause is requested without waiting for its actually get paused if it > > is requested. So one option is we can provide an another interface as > > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just > > return the request status. I am not sure how useful it is. > > If it is acceptable that pg_is_wal_replay_paused() makes users wait, > I'm ok for the current interface. I don't feel the need of > pg_is_wal_replay_paluse_requeseted(). > > > > > 2. Pass an option to pg_is_wal_replay_paused whether to wait for > > recovery to actually get paused or not. > > > > 3. Pass an option to pg_wal_replay_pause(), whether to wait for > > recovery pause or just request and return. > > > > I like the option 1, any other opinion on this? > > > > > Also, it could wait for recovery_min_apply_delay if it has a valid > value. It is possible > > > that a user set this parameter to a large value, so it could wait for > a long time. However, > > > this will be avoided by calling recoveryPausesHere() or >
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 3:50 PM Greg Nancarrow wrote: > > On Mon, Jan 18, 2021 at 8:10 PM Amit Kapila wrote: > > > > > >It is not clear why the above two are shouldn't happen cases and if so > > > >why we want to treat them as unsafe. Ideally, there should be an > > > >Assert if these can't happen but it is difficult to decide without > > > >knowing why you have considered them unsafe? > > > > > > The checks being done here for "should never happen" cases are THE > > > SAME as other parts of the Postgres code. > > > For example, search Postgres code for "null conbin" and you'll find 6 > > > other places in the Postgres code which actually ERROR out if conbin > > > (binary representation of the constraint) in a pg_constraint tuple is > > > found to be null. > > > The cases that you point out in the patch used to also error out in > > > the same way, but Vignesh suggested changing them to just return > > > parallel-unsafe instead of erroring-out, which I agree with. > > > > > > > You have not raised a WARNING for the second case. > > The same checks in current Postgres code also don't raise a WARNING > for that case, so I'm just being consistent with existing Postgres > code (which itself isn't consistent for those two cases). > Search for the string "too few entries in indexprs list" and you will find a lot of places in code raising ERROR for the same condition. -- With Regards, Amit Kapila.
Re: list of extended statistics on psql
Hi, The above query is so simple so that we would better to use the following query: # This query works on PG10 or later SELECT es.stxnamespace::pg_catalog.regnamespace::text AS "Schema", es.stxname AS "Name", pg_catalog.format('%s FROM %s', (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ') FROM pg_catalog.unnest(es.stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (es.stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT a.attisdropped)), es.stxrelid::regclass) AS "Definition", CASE WHEN 'd' = any(es.stxkind) THEN 'defined' END AS "Ndistinct", CASE WHEN 'f' = any(es.stxkind) THEN 'defined' END AS "Dependencies", CASE WHEN 'm' = any(es.stxkind) THEN 'defined' END AS "MCV" FROM pg_catalog.pg_statistic_ext es ORDER BY 1, 2; Schema | Name | Definition | Ndistinct | Dependencies | Dependencies ++--+---+--+-- public | hoge1_ext | a, b FROM hoge1 | defined | defined | defined public | hoge_t_ext | a, b FROM hoge_t | defined | defined | defined (2 rows) I'm going to create the WIP patch to use the above query. Any comments welcome. :-D Attached patch is WIP patch. The changes are: - Use pg_statistic_ext only - Remove these statuses: "required" and "built" - Add new status: "defined" - Remove the size columns - Fix document I'll create and send the regression test on the next patch if there is no objection. Is it Okay? Regards, Tatsuro Yamada diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 221a967bfe..36a79d9e3f 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1918,6 +1918,26 @@ testdb= + + +\dX [ pattern ] + + +Lists extended statistics. +If pattern +is specified, only those extended statistics whose names match the +pattern are listed. + + + +The column of the kind of extended stats (e.g. Ndistinct) shows its status. +NULL means that it doesn't exists. "defined" means that it is declared. +You can use pg_stats_ext if you'd like to know whether +ANALYZE was run and statistics are available to the +planner. + + + \dy[+] [ pattern ] diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 303e7c3ad8..c98e3d31d0 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -928,6 +928,9 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd) else success = listExtensions(pattern); break; + case 'X': /* Extended Statistics */ + success = listExtendedStats(pattern); + break; case 'y': /* Event Triggers */ success = listEventTriggers(pattern, show_verbose); break; diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index caf97563f4..899fe5d85c 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -4392,6 +4392,89 @@ listEventTriggers(const char *pattern, bool verbose) return true; } +/* + * \dX + * + * Describes extended statistics. + */ +bool +listExtendedStats(const char *pattern) +{ + PQExpBufferData buf; + PGresult *res; + printQueryOpt myopt = pset.popt; + + if (pset.sversion < 10) + { + charsverbuf[32]; + + pg_log_error("The server (version %s) does not support extended statistics.", +formatPGVersionNumber(pset.sversion, false, + sverbuf, sizeof(sverbuf))); + return true; + } + + initPQExpBuffer(); + printfPQExpBuffer(, + "SELECT \n" + "es.stxnamespace::pg_catalog.regnamespace::text AS \"%s\", \n" + "es.stxname AS \"%s\", \n" + "pg_catalog.format('%%s FROM %%s', \n" + " (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ') \n" + " FROM pg_catalog.unnest(es.stxkeys) s(attnum) \n" + " JOIN pg_catalog.pg_attribute a \n" + " ON (es.stxrelid = a.attrelid \n" + " AND a.attnum = s.attnum \n" +
Re: simplifying foreign key/RI checks
Thanks for the quick response. + if (mapped_partkey_attnums[i] == pk_attnums[j]) + { + partkey_vals[i] = pk_vals[j]; + partkey_isnull[i] = pk_nulls[j] == 'n' ? true : false; + i++; + break; The way counter (i) is incremented is out of my expectation. In the rare case, where some i doesn't have corresponding pk_attnums[j], wouldn't there be a dead loop ? I think the goal of adding the assertion should be not loop infinitely even if the invariant is not satisfied. I guess a counter other than i would be better for this purpose. Cheers On Mon, Jan 18, 2021 at 6:45 PM Amit Langote wrote: > On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu wrote: > > > > Hi, > > I was looking at this statement: > > > > insert into f select generate_series(1, 200, 2); > > > > Since certain generated values (the second half) are not in table p, > wouldn't insertion for those values fail ? > > I tried a scaled down version (1000th) of your example: > > > > yugabyte=# insert into f select generate_series(1, 2000, 2); > > ERROR: insert or update on table "f" violates foreign key constraint > "f_a_fkey" > > DETAIL: Key (a)=(1001) is not present in table "p". > > Sorry, a wrong copy-paste by me. Try this: > > create table p (a numeric primary key); > insert into p select generate_series(1, 200); > create table f (a bigint references p); > > -- Unpatched > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 6527.652 ms (00:06.528) > > update f set a = a + 1; > UPDATE 100 > Time: 8108.310 ms (00:08.108) > > -- Patched: > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 3312.193 ms (00:03.312) > > update f set a = a + 1; > UPDATE 100 > Time: 4292.807 ms (00:04.293) > > > For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch : > > > > +* Collect partition key values from the unique key. > > > > At the end of the nested loop, should there be an assertion that > partkey->partnatts partition key values have been found ? > > This can be done by using a counter (initialized to 0) which is > incremented when a match is found by the inner loop. > > I've updated the patch to add the Assert. Thanks for taking a look. > > -- > Amit Langote > EDB: http://www.enterprisedb.com >
Re: list of extended statistics on psql
On 1/19/21 1:44 AM, Tatsuro Yamada wrote: Hi Tomas, As for how to deal with this, I can think of about three ways: 1) simplify the command to only print information from pg_statistic_ext (so on information about which stats are built or sizes) 2) extend pg_stats_ext with necessary information (e.g. sizes) 3) create a new system view, with necessary information (so that pg_stats_ext does not need to be modified) 4) add functions returning the necessary information, possibly only for statistics the user can access (similarly to what pg_stats_ext does) Options 2-4 have the obvious disadvantage that this won't work on older releases (we can't add views or functions there). So I'm leaning towards #1 even if that means we have to remove some of the details. We can consider adding that for new releases, though. Thanks for the useful advice. I go with option 1). The following query is created by using pg_stats_ext instead of pg_statistic_ext and pg_statistic_ext_data. However, I was confused about writing a part of the query for calculating MCV size because there are four columns related to MCV. For example, most_common_vals, most_common_val_nulls, most_common_freqs, and most_common_base_freqs. Currently, I don't know how to calculate the size of MCV by using the four columns. Thoughts? :-) Well, my suggestion was to use pg_statistic_ext, because that lists all statistics, while pg_stats_ext is filtering statistics depending on access privileges. I think that's more appropriate for \dX, the contents should not change depending on the user. Also, let me clarify - with option (1) we'd not show the sizes at all. The size of the formatted statistics may be very different from the on-disk representation, so I see no point in showing it in \dX. We might show other stats (e.g. number of MCV items, or the fraction of data represented by the MCV list), but the user can inspect pg_stats_ext if needed. What we might do is to show those stats when a superuser is running this command, but I'm not sure that's a good idea (or how difficult would it be to implement). Thanks for clarifying. I see that your suggestion was to use pg_statistic_ext, not pg_stats_ext. And we don't need the size of stats. If that's the case, we also can't get the status of stats since PG12 or later because we can't use pg_statistic_ext_data, as you know. Therefore, it would be better to replace the query with the old query that I sent five months ago like this: # the old query SELECT stxnamespace::pg_catalog.regnamespace AS "Schema", stxrelid::pg_catalog.regclass AS "Table", stxname AS "Name", (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(attname),', ') FROM pg_catalog.unnest(stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT attisdropped)) AS "Columns", 'd' = any(stxkind) AS "Ndistinct", 'f' = any(stxkind) AS "Dependencies", 'm' = any(stxkind) AS "MCV" FROM pg_catalog.pg_statistic_ext stat ORDER BY 1,2; Schema | Table | Name | Columns | Ndistinct | Dependencies | MCV +++-+---+--+- public | hoge1 | hoge1_ext | a, b | t | t | t public | hoge_t | hoge_t_ext | a, b | t | t | t (2 rows) The above query is so simple so that we would better to use the following query: # This query works on PG10 or later SELECT es.stxnamespace::pg_catalog.regnamespace::text AS "Schema", es.stxname AS "Name", pg_catalog.format('%s FROM %s', (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ') FROM pg_catalog.unnest(es.stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (es.stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT a.attisdropped)), es.stxrelid::regclass) AS "Definition", CASE WHEN 'd' = any(es.stxkind) THEN 'defined' END AS "Ndistinct", CASE WHEN 'f' = any(es.stxkind) THEN 'defined' END AS "Dependencies", CASE WHEN 'm' = any(es.stxkind) THEN 'defined' END AS "MCV" FROM pg_catalog.pg_statistic_ext es ORDER BY 1, 2; Schema | Name | Definition | Ndistinct | Dependencies | Dependencies ++--+---+--+-- public | hoge1_ext | a, b FROM hoge1 | defined | defined | defined public | hoge_t_ext | a, b FROM hoge_t | defined | defined | defined (2 rows) I'm going to create the WIP patch to use the above queriy. Any comments welcome. :-D Yes, I think using this simpler query makes sense. If we decide we need something more elaborate, we can improve that by in future PostgreSQL versions (after adding view/function to core), but I'd leave that as a work for the future. Apologies for all the extra work - I haven't realized this flaw when pushing
Re: simplifying foreign key/RI checks
On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu wrote: > > Hi, > I was looking at this statement: > > insert into f select generate_series(1, 200, 2); > > Since certain generated values (the second half) are not in table p, wouldn't > insertion for those values fail ? > I tried a scaled down version (1000th) of your example: > > yugabyte=# insert into f select generate_series(1, 2000, 2); > ERROR: insert or update on table "f" violates foreign key constraint > "f_a_fkey" > DETAIL: Key (a)=(1001) is not present in table "p". Sorry, a wrong copy-paste by me. Try this: create table p (a numeric primary key); insert into p select generate_series(1, 200); create table f (a bigint references p); -- Unpatched insert into f select generate_series(1, 200, 2); INSERT 0 100 Time: 6527.652 ms (00:06.528) update f set a = a + 1; UPDATE 100 Time: 8108.310 ms (00:08.108) -- Patched: insert into f select generate_series(1, 200, 2); INSERT 0 100 Time: 3312.193 ms (00:03.312) update f set a = a + 1; UPDATE 100 Time: 4292.807 ms (00:04.293) > For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch : > > +* Collect partition key values from the unique key. > > At the end of the nested loop, should there be an assertion that > partkey->partnatts partition key values have been found ? > This can be done by using a counter (initialized to 0) which is incremented > when a match is found by the inner loop. I've updated the patch to add the Assert. Thanks for taking a look. -- Amit Langote EDB: http://www.enterprisedb.com v2-0001-Export-get_partition_for_tuple.patch Description: Binary data v2-0002-Avoid-using-SPI-for-some-RI-checks.patch Description: Binary data
configurable the threshold for warning due to run out of transaction ID
Hi, PostgreSQL has the feature to warn about running out of transaction ID. The following message is an example. ``` 2021-01-19 10:59:27 JST [client backend] WARNING: database "postgres" must be vacuumed within xxx transactions 2021-01-19 10:59:27 JST [client backend] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. You might also need to commit or roll back old prepared transactions, or drop stale replication slots. ``` But, the threshold for the warning is not configurable. The value is hard-coded to 40M. ``` varsup.c /* * We'll start complaining loudly when we get within 40M transactions of * data loss. This is kind of arbitrary, but if you let your gas gauge * get down to 2% of full, would you be looking for the next gas station? * We need to be fairly liberal about this number because there are lots * of scenarios where most transactions are done by automatic clients that * won't pay attention to warnings. (No, we're not gonna make this * configurable. If you know enough to configure it, you know enough to * not get in this kind of trouble in the first place.) */ ``` I think it's useful to configure the threshold for warning due to run out of transaction ID like "checkpoint_warning" parameter. Actually, when a user's workload is too write-heavy, there was a case we want to get the warning message earlier. I understood that there is another way to handle it. For example, to monitor frozen transaction ID to execute the following query and check to see if the custom threshold is exceeded. ``` SELECT max(age(datfrozenxid)) FROM pg_database; ``` But, I think to warn to a server log is a simpler way. I would like to know your opinion. If it's useful for us, I'll make patches. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Is Recovery actually paused?
On Sun, 17 Jan 2021 11:33:52 +0530 Dilip Kumar wrote: > On Thu, Jan 14, 2021 at 6:49 PM Yugo NAGATA wrote: > > > > On Wed, 13 Jan 2021 17:49:43 +0530 > > Dilip Kumar wrote: > > > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar wrote: > > > > > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA wrote: > > > > > > > > > > On Thu, 10 Dec 2020 11:25:23 +0530 > > > > > Dilip Kumar wrote: > > > > > > > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to > > > > > > > > wait. > > > > > > > > Especially, if max_standby_streaming_delay is -1, this will be > > > > > > > > blocked forever, > > > > > > > > although this setting may not be usual. In addition, some users > > > > > > > > may set > > > > > > > > recovery_min_apply_delay for a large. If such users call > > > > > > > > pg_is_wal_replay_paused, > > > > > > > > it could wait for a long time. > > > > > > > > > > > > > > > > At least, I think we need some descriptions on document to > > > > > > > > explain > > > > > > > > pg_is_wal_replay_paused could wait while a time. > > > > > > > > > > > > > > Ok > > > > > > > > > > > > Fixed this, added some comments in .sgml as well as in function > > > > > > header > > > > > > > > > > Thank you for fixing this. > > > > > > > > > > Also, is it better to fix the description of pg_wal_replay_pause from > > > > > "Pauses recovery." to "Request to pause recovery." in according with > > > > > pg_is_wal_replay_paused? > > > > > > > > Okay > > > > > > > > > > > > > > > > > Also, how about adding a new boolean argument to > > > > > > > > pg_is_wal_replay_paused to > > > > > > > > control whether this waits for recovery to get paused or not? > > > > > > > > By setting its > > > > > > > > default value to true or false, users can use the old format > > > > > > > > for calling this > > > > > > > > and the backward compatibility can be maintained. > > > > > > > > > > > > > > So basically, if the wait_recovery_pause flag is false then we > > > > > > > will > > > > > > > immediately return true if the pause is requested? I agree that > > > > > > > it is > > > > > > > good to have an API to know whether the recovery pause is > > > > > > > requested or > > > > > > > not but I am not sure is it good idea to make this API serve both > > > > > > > the > > > > > > > purpose? Anyone else have any thoughts on this? > > > > > > > > > > > > > > > > > I think the current pg_is_wal_replay_paused() already has another > > > > > purpose; > > > > > this waits recovery to actually get paused. If we want to limit this > > > > > API's > > > > > purpose only to return the pause state, it seems better to fix this > > > > > to return > > > > > the actual state at the cost of lacking the backward compatibility. > > > > > If we want > > > > > to know whether pause is requested, we may add a new API like > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait > > > > > recovery to actually > > > > > get paused, we may add an option to pg_wal_replay_pause() for this > > > > > purpose. > > > > > > > > > > However, this might be a bikeshedding. If anyone don't care that > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I don't > > > > > care either. > > > > > > > > I don't think that it will be blocked ever, because > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the > > > > recovery process will not be stuck on waiting for the WAL. > > > > Yes, there is no stuck on waiting for the WAL. However, it can be stuck > > during resolving > > a recovery conflict. The process could wait for max_standby_streaming_delay > > or > > max_standby_archive_delay at most before recovery get completely paused. > > Okay, I agree that it is possible so for handling this we have a > couple of options > 1. pg_is_wal_replay_paused(), interface will wait for recovery to > actually get paused, but user have an option to cancel that. So I > agree that there is currently no option to just know that recovery > pause is requested without waiting for its actually get paused if it > is requested. So one option is we can provide an another interface as > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just > return the request status. I am not sure how useful it is. If it is acceptable that pg_is_wal_replay_paused() makes users wait, I'm ok for the current interface. I don't feel the need of pg_is_wal_replay_paluse_requeseted(). > > 2. Pass an option to pg_is_wal_replay_paused whether to wait for > recovery to actually get paused or not. > > 3. Pass an option to pg_wal_replay_pause(), whether to wait for > recovery pause or just request and return. > > I like the option 1, any other opinion on this? > > > Also, it could wait for recovery_min_apply_delay if it has a valid value. > > It is possible > > that a user set this parameter to a large value, so it could wait for a > > long time. However, > > this will be avoided by calling recoveryPausesHere()
RE: Parallel INSERT (INTO ... SELECT ...)
From: Greg Nancarrow > On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila wrote: > >Here, it seems we are accessing the relation descriptor without any > >lock on the table which is dangerous considering the table can be > >modified in a parallel session. Is there a reason why you think this > >is safe? Did you check anywhere else such a coding pattern? > > Yes, there's a very good reason and I certainly have checked for the > same coding pattern elsewhere, and not just randomly decided that > locking can be ignored. > The table has ALREADY been locked (by the caller) during the > parse/analyze phase. Isn't there any case where planning is done but parse analysis is not done immediately before? e.g. * Alteration of objects invalidates cached query plans, and the next execution of the plan rebuilds it. (But it seems that parse analysis is done in this case in plancache.c.) * Execute a prepared statement with a different parameter value, which builds a new custom plan or a generic plan. Is the cached query plan invalidated when some alteration is done to change the parallel safety, such as adding a trigger with a parallel-unsafe trigger action? Regards Takayuki Tsunakawa
Re: POC: postgres_fdw insert batching
On 1/19/21 2:28 AM, tsunakawa.ta...@fujitsu.com wrote: From: Tomas Vondra I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so attached is a rebased patch (0001) fixing that. 0002 adds a couple comments and minor tweaks 0003 addresses a couple shortcomings related to explain - we haven't been showing the batch size for EXPLAIN (VERBOSE), because there'd be no FdwState, so this tries to fix that. Furthermore, there were no tests for EXPLAIN output with batch size, so I added a couple. Thank you, good additions. They all look good. Only one point: I think the code for retrieving batch_size in create_foreign_modify() can be replaced with a call to the new function in 0003. OK. Can you prepare a final patch, squashing all the commits into a single one, and perhaps use the function in create_foreign_modify? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgindent for worker.c
On Tue, Jan 19, 2021 at 6:53 AM Peter Smith wrote: > > PSA a trivial patch just to pgindent the file > src/backend/replication/logical/worker.c > > (I am modifying this file in a separate patch, but every time I used > pgindent for my own code I would keep seeing these existing format > problems). > Sorry for the inconvenience. This seems to be a leftover from my commit 0926e96c49, so I will take care of this. I think we need to change this file in the upcoming patches for logical replication of 2PC so, I'll push this change separately. -- With Regards, Amit Kapila.
Re: [PATCH 1/1] Initial mach based shared memory support.
On Mon, Jan 18, 2021 at 5:29 PM Tom Lane wrote: > > James Hilliard writes: > > OSX implements sysv shmem support via a mach wrapper, however the mach > > sysv shmem wrapper has some severe restrictions that prevent us from > > allocating enough memory segments in some cases. > > ... > > In order to avoid hitting these limits we can bypass the wrapper layer > > and just use mach directly. > > I wanted to review this, but it's impossible because the kernel calls > you're using have you've-got-to-be-kidding documentation like this: > > https://developer.apple.com/documentation/kernel/1402504-mach_vm_page_info?language=objc Yeah, the documentation situation with Apple tends to be some degree of terrible in general. Ultimately I was referencing the sysv shmem wrapper code itself a lot: https://github.com/apple/darwin-xnu/blob/master/bsd/kern/sysv_shm.c I also used the browser shmem implementations as references: https://github.com/chromium/chromium/blob/master/base/memory/platform_shared_memory_region_mac.cc https://github.com/mozilla/gecko-dev/blob/master/ipc/glue/SharedMemoryBasic_mach.mm > > Google finds the source code too, but that's equally bereft of useful > documentation. So I wonder where you obtained the information necessary > to write this patch. Mostly trial and error while using the above references. > > I did notice however that mach_shmem.c seems to be 95% copy-and-paste from > sysv_shmem.c, including a lot of comments that don't feel particularly > true or relevant here. I wonder whether we need a separate file as > opposed to a few #ifdef's around the kernel calls. Well I basically started with copying sysv_shmem.c and then started replacing the sysv shmem calls with their mach equivalents. I kept it as a separate file since the mach calls use different semantics and that's also how the other projects seem to separate out their mach shared memory handlers. I left a number of the original comments since I wasn't entirely sure what best to replace them with. > > regards, tom lane
Re: simplifying foreign key/RI checks
On Tue, Jan 19, 2021 at 3:01 AM Pavel Stehule wrote: > po 18. 1. 2021 v 13:40 odesílatel Amit Langote > napsal: >> I started with the check that's performed when inserting into or >> updating the referencing table to confirm that the new row points to a >> valid row in the referenced relation. The corresponding SQL is this: >> >> SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x >> >> $1 is the value of the foreign key of the new row. If the query >> returns a row, all good. Thanks to SPI, or its use of plan caching, >> the query is re-planned only a handful of times before making a >> generic plan that is then saved and reused, which looks like this: >> >> QUERY PLAN >> -- >> LockRows >>-> Index Scan using pk_pkey on pk x >> Index Cond: (a = $1) >> (3 rows) > > > What is performance when the referenced table is small? - a lot of codebooks > are small between 1000 to 10K rows. I see the same ~2x improvement. create table p (a numeric primary key); insert into p select generate_series(1, 1000); create table f (a bigint references p); Unpatched: insert into f select i%1000+1 from generate_series(1, 100) i; INSERT 0 100 Time: 5461.377 ms (00:05.461) Patched: insert into f select i%1000+1 from generate_series(1, 100) i; INSERT 0 100 Time: 2357.440 ms (00:02.357) That's expected because the overhead of using SPI to check the PK table, which the patch gets rid of, is the same no matter the size of the index to be scanned. -- Amit Langote EDB: http://www.enterprisedb.com
Re: psql \df choose functions by their arguments
2021年1月15日(金) 1:46 Greg Sabino Mullane : > Thanks for the feedback: new version v5 (attached) has int8, plus the > suggested code formatting. > > Cheers, > Greg > Thanks for the update. In my preceding mail I meant we should add int2, int4 and int8 for completeness (apologies, I was a bit unclear there), as AFAICS that covers all aliases, even if these three are less widely used. FWIW one place where these do get used in substantial numbers is in the regression tests themselves: $ for L in 2 4 8; do git grep int$L src/test/regress/ | wc -l; done 544 2332 1353 Regards Ian Barwick -- EnterpriseDB: https://www.enterprisedb.com
Re: PoC/WIP: Extended statistics on expressions
On 1/18/21 4:48 PM, Dean Rasheed wrote: Looking through extended_stats.c, I found a corner case that can lead to a seg-fault: CREATE TABLE foo(); CREATE STATISTICS s ON (1) FROM foo; ANALYSE foo; This crashes in lookup_var_attr_stats(), because it isn't expecting nvacatts to be 0. I can't think of any case where building stats on a table with no analysable columns is useful, so it should probably just exit early in that case. In BuildRelationExtStatistics(), it looks like min_attrs should be declared assert-only. In evaluate_expressions(): + /* set the pointers */ + result = (ExprInfo *) ptr; + ptr += sizeof(ExprInfo); I think that should probably have a MAXALIGN(). Thanks, I'll fix all of that. A slightly bigger issue that I don't like is the way it assigns attribute numbers for expressions starting from MaxHeapAttributeNumber+1, so the first expression has an attnum of 1601. That leads to pretty inefficient use of Bitmapsets, since most tables only contain a handful of columns, and there's a large block of unused space in the middle the Bitmapset. An alternative approach might be to use regular attnums for columns and use negative indexes -1, -2, -3, ... for expressions in the stored stats. Then when storing and retrieving attnums from Bitmapsets, it could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values in the Bitmapsets, since there can't be more than that many expressions (just like other code stores system attributes using FirstLowInvalidHeapAttributeNumber). That would be a somewhat bigger change, but hopefully fairly mechanical, and then some code like add_expressions_to_attributes() would go away. Well, I tried this but unfortunately it's not that simple. We still need to build the bitmaps, so I don't think add_expression_to_attributes can be just removed. I mean, we need to do the offsetting somewhere, even if we change how we do it. But the main issue is that in some cases the number of expressions is not really limited by STATS_MAX_DIMENSIONS - for example when applying functional dependencies, we "merge" multiple statistics, so we may end up with more expressions. So we can't just use STATS_MAX_DIMENSIONS. Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts the order of processing (so far we've assumed expressions are after regular attnums). So the changes are more extensive - I tried doing that anyway, and I'm still struggling with crashes and regression failures. Of course, that doesn't mean we shouldn't do it, but it's far from mechanical. (Some of that is probably a sign this code needs a bit more work to polish.) But I wonder if it'd be easier to just calculate the actual max attnum and then use it instead of MaxHeapAttributeNumber ... Looking at the new view pg_stats_ext_exprs, I noticed that it fails to show expressions until the statistics have been built. For example: CREATE TABLE foo(a int, b int); CREATE STATISTICS s ON (a+b), (a*b) FROM foo; SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; statistics_name | tablename | expr | n_distinct -+---+--+ s | foo | | (1 row) but after populating and analysing the table, this becomes: statistics_name | tablename | expr | n_distinct -+---+-+ s | foo | (a + b) | 11 s | foo | (a * b) | 11 (2 rows) I think it should show the expressions even before the stats have been built. Another issue is that it returns rows for non-expression stats as well. For example: CREATE TABLE foo(a int, b int); CREATE STATISTICS s ON a, b FROM foo; SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; statistics_name | tablename | expr | n_distinct -+---+--+ s | foo | | (1 row) and those values will never be populated, since they're not expressions, so I would expect them to not be shown in the view. So basically, instead of + LEFT JOIN LATERAL ( + SELECT + * + FROM ( + SELECT + unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, + unnest(sd.stxdexpr)::pg_statistic AS a + ) x + ) stat ON sd.stxdexpr IS NOT NULL; perhaps just + JOIN LATERAL ( + SELECT + * + FROM ( + SELECT + unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, + unnest(sd.stxdexpr)::pg_statistic AS a + ) x + ) stat ON true; Will fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: POC: postgres_fdw insert batching
Tomas-san, Zhihong-san, From: Zhihong Yu > + if (batch_size <= 0) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > +errmsg("%s requires a non-negative integer value", > > It seems the message doesn't match the check w.r.t. the batch size of 0. Ah, "non-negative" should be "positive". The message for the existing fetch_size should be fixed too. Tomas-san, could you include this as well? I'm sorry to trouble you. > + int numInserted = numSlots; > > Since numInserted is filled by ExecForeignBatchInsert(), the initialization > can be done with 0. No, the code is correct, since the batch function requires the number of rows to insert as input. Regards Takayuki Tsunakawa
Paint some PG_USED_FOR_ASSERTS_ONLY in inline functions of ilist.h and bufpage.h
Hi all, The following functions in ilist.h and bufpage.h use some arguments only in assertions: - dlist_next_node - dlist_prev_node - slist_has_next - slist_next_node - PageValidateSpecialPointer Without PG_USED_FOR_ASSERTS_ONLY, this can lead to compilation warnings when not using assertions, and one example of that is plpgsql_check. We don't have examples on HEAD where PG_USED_FOR_ASSERTS_ONLY is used on arguments, but that looks to work properly with gcc. Thoughts? -- Michael diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index aa196428ed..0c5e1d375f 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -418,7 +418,8 @@ dlist_has_prev(dlist_head *head, dlist_node *node) * Return the next node in the list (there must be one). */ static inline dlist_node * -dlist_next_node(dlist_head *head, dlist_node *node) +dlist_next_node(dlist_head *head PG_USED_FOR_ASSERTS_ONLY, +dlist_node *node) { Assert(dlist_has_next(head, node)); return node->next; @@ -428,7 +429,8 @@ dlist_next_node(dlist_head *head, dlist_node *node) * Return previous node in the list (there must be one). */ static inline dlist_node * -dlist_prev_node(dlist_head *head, dlist_node *node) +dlist_prev_node(dlist_head *head PG_USED_FOR_ASSERTS_ONLY, +dlist_node *node) { Assert(dlist_has_prev(head, node)); return node->prev; @@ -608,7 +610,8 @@ slist_pop_head_node(slist_head *head) * Check whether 'node' has a following node. */ static inline bool -slist_has_next(slist_head *head, slist_node *node) +slist_has_next(slist_head *head PG_USED_FOR_ASSERTS_ONLY, + slist_node *node) { slist_check(head); @@ -619,7 +622,8 @@ slist_has_next(slist_head *head, slist_node *node) * Return the next node in the list (there must be one). */ static inline slist_node * -slist_next_node(slist_head *head, slist_node *node) +slist_next_node(slist_head *head PG_USED_FOR_ASSERTS_ONLY, +slist_node *node) { Assert(slist_has_next(head, node)); return node->next; diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h index 359b749f7f..85414a53d6 100644 --- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h @@ -310,7 +310,7 @@ typedef PageHeaderData *PageHeader; * specifics from the macro failure within this function. */ static inline bool -PageValidateSpecialPointer(Page page) +PageValidateSpecialPointer(Page page PG_USED_FOR_ASSERTS_ONLY) { Assert(PageIsValid(page)); Assert(((PageHeader) (page))->pd_special <= BLCKSZ); signature.asc Description: PGP signature
RE: POC: postgres_fdw insert batching
From: Tomas Vondra > I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so > attached is > a rebased patch (0001) fixing that. > > 0002 adds a couple comments and minor tweaks > > 0003 addresses a couple shortcomings related to explain - we haven't been > showing the batch size for EXPLAIN (VERBOSE), because there'd be no > FdwState, so this tries to fix that. Furthermore, there were no tests for > EXPLAIN > output with batch size, so I added a couple. Thank you, good additions. They all look good. Only one point: I think the code for retrieving batch_size in create_foreign_modify() can be replaced with a call to the new function in 0003. God bless us. Regards Takayuki Tsunakawa
pgindent for worker.c
PSA a trivial patch just to pgindent the file src/backend/replication/logical/worker.c (I am modifying this file in a separate patch, but every time I used pgindent for my own code I would keep seeing these existing format problems). Kind Regards, Peter Smith. Fujitsu Australia pgindent_worker.patch Description: Binary data
RE: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
> >>> +1 to add it after "dropped (Note )", how about as follows > >>> with slight changes? > >>> > >>> dropped (Note that server name of an invalid connection can be NULL > >>> if the server is dropped), and then such . > >> > >> Yes, I like this one. One question is; "should" or "is" is better > >> than "can" in this case because the server name of invalid connection > >> is always NULL when its server is dropped? > > > > I think "dropped (Note that server name of an invalid connection will > > be NULL if the server is dropped), and then such ." > > Sounds good to me. So patch attached. +1 Best regards, houzj
Re: list of extended statistics on psql
Hi Tomas, As for how to deal with this, I can think of about three ways: 1) simplify the command to only print information from pg_statistic_ext (so on information about which stats are built or sizes) 2) extend pg_stats_ext with necessary information (e.g. sizes) 3) create a new system view, with necessary information (so that pg_stats_ext does not need to be modified) 4) add functions returning the necessary information, possibly only for statistics the user can access (similarly to what pg_stats_ext does) Options 2-4 have the obvious disadvantage that this won't work on older releases (we can't add views or functions there). So I'm leaning towards #1 even if that means we have to remove some of the details. We can consider adding that for new releases, though. Thanks for the useful advice. I go with option 1). The following query is created by using pg_stats_ext instead of pg_statistic_ext and pg_statistic_ext_data. However, I was confused about writing a part of the query for calculating MCV size because there are four columns related to MCV. For example, most_common_vals, most_common_val_nulls, most_common_freqs, and most_common_base_freqs. Currently, I don't know how to calculate the size of MCV by using the four columns. Thoughts? :-) Well, my suggestion was to use pg_statistic_ext, because that lists all statistics, while pg_stats_ext is filtering statistics depending on access privileges. I think that's more appropriate for \dX, the contents should not change depending on the user. Also, let me clarify - with option (1) we'd not show the sizes at all. The size of the formatted statistics may be very different from the on-disk representation, so I see no point in showing it in \dX. We might show other stats (e.g. number of MCV items, or the fraction of data represented by the MCV list), but the user can inspect pg_stats_ext if needed. What we might do is to show those stats when a superuser is running this command, but I'm not sure that's a good idea (or how difficult would it be to implement). Thanks for clarifying. I see that your suggestion was to use pg_statistic_ext, not pg_stats_ext. And we don't need the size of stats. If that's the case, we also can't get the status of stats since PG12 or later because we can't use pg_statistic_ext_data, as you know. Therefore, it would be better to replace the query with the old query that I sent five months ago like this: # the old query SELECT stxnamespace::pg_catalog.regnamespace AS "Schema", stxrelid::pg_catalog.regclass AS "Table", stxname AS "Name", (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(attname),', ') FROM pg_catalog.unnest(stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT attisdropped)) AS "Columns", 'd' = any(stxkind) AS "Ndistinct", 'f' = any(stxkind) AS "Dependencies", 'm' = any(stxkind) AS "MCV" FROM pg_catalog.pg_statistic_ext stat ORDER BY 1,2; Schema | Table |Name| Columns | Ndistinct | Dependencies | MCV +++-+---+--+- public | hoge1 | hoge1_ext | a, b| t | t| t public | hoge_t | hoge_t_ext | a, b| t | t| t (2 rows) The above query is so simple so that we would better to use the following query: # This query works on PG10 or later SELECT es.stxnamespace::pg_catalog.regnamespace::text AS "Schema", es.stxname AS "Name", pg_catalog.format('%s FROM %s', (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ') FROM pg_catalog.unnest(es.stxkeys) s(attnum) JOIN pg_catalog.pg_attribute a ON (es.stxrelid = a.attrelid AND a.attnum = s.attnum AND NOT a.attisdropped)), es.stxrelid::regclass) AS "Definition", CASE WHEN 'd' = any(es.stxkind) THEN 'defined' END AS "Ndistinct", CASE WHEN 'f' = any(es.stxkind) THEN 'defined' END AS "Dependencies", CASE WHEN 'm' = any(es.stxkind) THEN 'defined' END AS "MCV" FROM pg_catalog.pg_statistic_ext es ORDER BY 1, 2; Schema |Name|Definition| Ndistinct | Dependencies | Dependencies ++--+---+--+-- public | hoge1_ext | a, b FROM hoge1 | defined | defined | defined public | hoge_t_ext | a, b FROM hoge_t | defined | defined | defined (2 rows) I'm going to create the WIP patch to use the above queriy. Any comments welcome. :-D Thanks, Tatsuro Yamada
Re: [PATCH 1/1] Initial mach based shared memory support.
James Hilliard writes: > OSX implements sysv shmem support via a mach wrapper, however the mach > sysv shmem wrapper has some severe restrictions that prevent us from > allocating enough memory segments in some cases. > ... > In order to avoid hitting these limits we can bypass the wrapper layer > and just use mach directly. I wanted to review this, but it's impossible because the kernel calls you're using have you've-got-to-be-kidding documentation like this: https://developer.apple.com/documentation/kernel/1402504-mach_vm_page_info?language=objc Google finds the source code too, but that's equally bereft of useful documentation. So I wonder where you obtained the information necessary to write this patch. I did notice however that mach_shmem.c seems to be 95% copy-and-paste from sysv_shmem.c, including a lot of comments that don't feel particularly true or relevant here. I wonder whether we need a separate file as opposed to a few #ifdef's around the kernel calls. regards, tom lane
Re: Is it worth accepting multiple CRLs?
At Fri, 15 Jan 2021 08:56:27 +0100, Peter Eisentraut wrote in > On 2020-08-31 11:03, Kyotaro Horiguchi wrote: > > At Tue, 18 Aug 2020 16:43:47 +0900 (JST), Kyotaro Horiguchi > > wrote in > >> Thank you very much. I'll do that after some polishing. > >> > >> A near-by discussion about OpenSSL3.0 conflicts with this but it's > >> easy to follow. > > Rebased. Fixed bogus tests and strange tentative API change of > > SSLServer.pm. Corrected a (maybe) spelling mistake. I'm going to > > register this to the coming CF. > > Other systems that offer both a CRL file and a CRL directory usually > specify those using two separate configuration settings. Examples: > > https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_ssl_crlpath > https://httpd.apache.org/docs/current/mod/mod_ssl.html#sslcarevocationpath > > These are then presumably both passed to X509_STORE_load_locations(), > which supports specifying a file and directory concurrently. > > I think that would be a preferable approach. In practical terms, it > would allow a user to introduce the directory method gradually without > having to convert the existing CRL file at the same time. Thank you for the information. The only reason for sharing the same variable for both file and directory is to avoid additional variable only for this reason. I'll post a new version where new GUC ssl_crl_path is added. By the way we can do the same thing on CA file/dir, but I personally think that the benefit from the specify-by-directory for CA files is far less than CRL files. So I'm not going to do this for CA files for now. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: popcount
On Mon, Jan 18, 2021 at 10:34:10AM -0500, Tom Lane wrote: > Peter Eisentraut writes: > > [ assorted nits ] fixed, I think. > At the level of bikeshedding ... I quite dislike using the name "popcount" > for these functions. I'm aware that some C compilers provide primitives > of that name, but I wouldn't expect a SQL programmer to know that; > without that context the name seems pretty random and unintuitive. > Moreover, it invites confusion with SQL's use of "pop" to abbreviate > "population" in the statistical aggregates, such as var_pop(). > > Perhaps something along the lines of count_ones() or count_set_bits() > would be more apropos. Done that way. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 92f3d5bf1718b1b52a5a4d792f5c75e0bcc7fb74 Mon Sep 17 00:00:00 2001 From: David Fetter Date: Wed, 30 Dec 2020 02:51:46 -0800 Subject: [PATCH v4] popcount To: hackers MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="2.29.2" This is a multi-part message in MIME format. --2.29.2 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit Now it's accessible to SQL for the BIT VARYING and BYTEA types. diff --git doc/src/sgml/func.sgml doc/src/sgml/func.sgml index aa99665e2e..7626edeee7 100644 --- doc/src/sgml/func.sgml +++ doc/src/sgml/func.sgml @@ -4030,6 +4030,23 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three'); + + + + count_set_bits + +count_set_bits ( bytes bytea ) +bigint + + +Counts the number of bits set in a binary string. + + +count_set_bits('\xdeadbeef'::bytea) +24 + + + @@ -4830,6 +4847,23 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three'); + + + + count_set_bits + +count_set_bits ( bits bit ) +bigint + + +Counts the bits set in a bit string. + + +count_set_bits(B'101010101010101010') +9 + + + @@ -4869,6 +4903,7 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three'); 101010001010101010 + diff --git src/include/catalog/pg_proc.dat src/include/catalog/pg_proc.dat index b5f52d4e4a..416ee0c2fb 100644 --- src/include/catalog/pg_proc.dat +++ src/include/catalog/pg_proc.dat @@ -1446,6 +1446,9 @@ { oid => '752', descr => 'substitute portion of string', proname => 'overlay', prorettype => 'bytea', proargtypes => 'bytea bytea int4', prosrc => 'byteaoverlay_no_len' }, +{ oid => '8436', descr => 'count set bits', + proname => 'count_set_bits', prorettype => 'int8', proargtypes => 'bytea', + prosrc => 'byteapopcount'}, { oid => '725', proname => 'dist_pl', prorettype => 'float8', proargtypes => 'point line', @@ -3865,6 +3868,9 @@ { oid => '3033', descr => 'set bit', proname => 'set_bit', prorettype => 'bit', proargtypes => 'bit int4 int4', prosrc => 'bitsetbit' }, +{ oid => '8435', descr => 'count set bits', + proname => 'count_set_bits', prorettype => 'int8', proargtypes => 'bit', + prosrc => 'bitpopcount'}, # for macaddr type support { oid => '436', descr => 'I/O', diff --git src/backend/utils/adt/varbit.c src/backend/utils/adt/varbit.c index 2235866244..c9c6c73422 100644 --- src/backend/utils/adt/varbit.c +++ src/backend/utils/adt/varbit.c @@ -36,6 +36,7 @@ #include "libpq/pqformat.h" #include "nodes/nodeFuncs.h" #include "nodes/supportnodes.h" +#include "port/pg_bitutils.h" #include "utils/array.h" #include "utils/builtins.h" #include "utils/varbit.h" @@ -1878,3 +1879,30 @@ bitgetbit(PG_FUNCTION_ARGS) else PG_RETURN_INT32(0); } + +/* + * bitpopcount + * + * Returns the number of bits set in a bit string. + * + */ +Datum +bitpopcount(PG_FUNCTION_ARGS) +{ + /* There's really no chance of an overflow here because + * to get to INT64_MAX set bits, an object would have to be + * an exbibyte long, exceeding what PostgreSQL can currently + * store by a factor of 2^28 + */ + int64 popcount; + VarBit *arg1 = PG_GETARG_VARBIT_P(0); + bits8 *p; + int len; + + p = VARBITS(arg1); + len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE; + + popcount = pg_popcount((char *)p, len); + + PG_RETURN_INT64(popcount); +} diff --git src/backend/utils/adt/varlena.c src/backend/utils/adt/varlena.c index 479ed9ae54..3f1179a0e8 100644 --- src/backend/utils/adt/varlena.c +++ src/backend/utils/adt/varlena.c @@ -3439,6 +3439,22 @@ bytea_overlay(bytea *t1, bytea *t2, int sp, int sl) return result; } +/* + * popcount + */ +Datum +byteapopcount(PG_FUNCTION_ARGS) +{ + bytea *t1 = PG_GETARG_BYTEA_PP(0); + int len; + int64 result; + + len = VARSIZE_ANY_EXHDR(t1); + result =
Re: Key management with tests
On Mon, Jan 18, 2021 at 04:38:47PM -0500, Robert Haas wrote: > To me, it wouldn't make sense to commit a full README for a TDE > feature that we don't have yet with a key management patch, but the > way that they'll interact with each other has to be clear. The > doc/database-encryption.sgml file that Bruce included in the patch is > a decent start on explaining the design, though I think it needs more > work and more details, perhaps including some of the things Andres > mentioned. Sure. > To be honest, after reading over that SGML documentation a bit, I'm > somewhat skeptical about whether it really makes sense to think about > committing the key management part separately. It seems to have no use > independent of the main feature, and it in fact embeds very specific For usefulness, it does enable passphrase prompting for the TLS private key. > details of how the main feature is expected to work. For example, the > documentation says that key #0 will be used for data files, and key #1 > for WAL. There seems to be no suggestion that the key management > portion of this can be used to manage encryption keys generally for > whatever purposes someone might have; it's all about the needs of a > particular TDE implementation. Typically, we would not commit We originally were going to have SQL-level keys, but many felt they weren't useful. > something like that separately, or only once the main patch was done, > with the two commits occurring in a relatively short time period. > Otherwise, as Bruce already noted, we can end up with something that > is documented and visible to users but doesn't actually work yet. Yep, that is the risk. > Some more specific comments on data-encryption.sgml: > > * The documentation explains that the purpose of having a WAL key > separate from the data file key is so that the data file keys can > "eventually" be rotated. It's not clear whether this means that we > might eventually have that feature or that we might eventually be able > to rotate, after failing over. If this kind of thing is possible, > we'll eventually need documentation on how to do it. I have clarified that saying "future release". > * The reasons for use a DEK and a KEK are not explained. I realize > it's not an uncommon practice and that other systems do it, but I > think a few sentences of explanation wouldn't be a bad idea. Even if > we are supposing that hackers who want to have input into this feature > have to be knowledgeable about cryptography, I don't think we can > reasonably suppose that for users. I added a little about that in the docs. > * "For example" is at one point followed by a period rather than a > colon or comma. Fixed. > * In the "Internals" subsection, the last sentence doesn't seem to be > grammatical. I wonder if it's missing the word "or"'. Fixed. > * The part about integrity-checking keys on startup isn't clear. It > makes it sound like we still have a copy of the KEK lying around > someplace against which we can compare, which I assume is not the case > since it would be really insecure. I rewored that entire section. See if it is better now. > * I think it's going to be pretty important that we can easily switch > to other cryptographic algorithms as they are discovered, so I don't > like the fact that this is tied specifically to AES. (In fact, > kmgr_utils.h makes it sound like we're specifically locked into > AES256, but that contradicts the documentation, so I guess there's > some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN > is doing.) As far as possible we should try to make this generic, like > supporting any cipher that SSL has which has property X. It seems > relatively inevitable that every currently popular cryptographic > algorithm will at some point in the future be judged weak and > worthless, just as has already happened with MD5 and some variants of > SHA, both of which used to be considered state of the art. It seems > equally inevitable that new and stronger algorithms will continued to > be devised, and we'll want to adopt those easily. That is a nifty idea. Right now I just pass the integer length around, and store it in pg_control, but if we define macros, we can easily abstract this and easily allow for new methods. If others like that, I will start on it now. > I'm not sure to what extent this a serious flaw in the patch and to > what extent it's just a matter of tweaking the wording of some things, > but I think this is actually an extremely critical design point where > we had better be certain we've got it right. Few things would be > sadder than to get a TDE feature and then have to rip it out again > because it couldn't be upgraded to work with newer crypto algorithms > with reasonable effort. Yep. > Notes on other parts of the documentation: > > * The documentation for initdb -K doesn't list the valid values of the > parameter, only the default. Probably we should be specifying an Fixed. > algorithm
Re: NOT VALID for Unique Indexes
On Fri, 15 Jan 2021 at 00:22, Simon Riggs wrote: > As you may be aware the NOT VALID qualifier currently only applies to > CHECK and FK constraints, but not yet to unique indexes. I have had > customer requests to change that. > > It's a reasonably common requirement to be able to change an index > to/from a unique index, i.e. Unique -> NonUnique or NonUnique to > Unique. Previously, it was easy enough to do that using a catalog > update, but with security concerns and the fact that the optimizer > uses the uniqueness to optimize queries means that there is a gap in > our support. We obviously need to scan the index to see if it actually > can be marked as unique. > > In terms of locking we need to exclude writes while we add uniqueness, > so scanning the index to check it is unique would cause problems. So > we need to do the same thing as we do with other constraint types: add > the constraint NOT VALID in one transaction and then later validate it > in a separate transaction (if ever). > > I present a WIP patch to show it's a small patch to change Uniqueness > for an index, with docs and tests. > > ALTER INDEX SET [NOT] UNIQUE [NOT VALID] > ALTER INDEX VALIDATE UNIQUE > > It doesn't do the index validation scan (yet), but I wanted to check > acceptability, syntax and requirements before I do that. > > I can also add similar syntax for UNIQUE and PK constraints. > > Thoughts please? Great! I have some questions. 1. In the patch, you add a new attribute named "induniquevalid" in pg_index, however, there is a "indisvalid" in pg_index, can we use "indisvalid"? 2. The foreign key and CHECK constraints are valid by using ALTER TABLE .. ADD table_constraint [ NOT VALID ] ALTER TABLE .. VALIDATE CONSTRAINT constraint_name Should we implement unique index valid/not valid same as foreign key and CHECK constraints? 3. If we use the syntax to valid/not valid the unique, should we support other constraints, such as foreign key and CHECK constraints? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR
This is an interesting feature, but it seems that the author has abandoned development, what happens now? Will this be postponed from commitfest to commitfest and never be taken over by anyone? Massimo. Il giorno ven 6 mar 2020 alle ore 22:36 Dent John ha scritto: > > On 22 Feb 2020, at 10:38, Dent John wrote: > > > >> On 18 Feb 2020, at 03:03, Thomas Munro wrote: > >> > >> From the trivialities department, I see a bunch of warnings about > >> local declaration placement (we're still using C90 rules for those by > >> project policy): > >> > >> […] > > > > […] > > My bad. I missed on declaration. > > Another patch attached. > > d. >
Re: Key management with tests
I met with Bruce and Stephen this afternoon to discuss the feedback we received so far (prior to Robert's note which I haven't fully digested yet) on this patch. Here is what we plan to do: 1) Bruce is going to gather all the details from the Wiki and build a README for the TDE Key Management patch. In addition, it will include details about the implementation, the data structures involved and the locks that are taken and general technical implementation approach. 2) Stephen is going to write up the overall design of TDE. Between these two patches, we hope to cover what Andres is asking for and what Robert is asking for in his reply on this thread which I haven't fully digested yet. Stephen's documentation patch will also make reference to Neil Chen's TDE prototype for making use of this Key Management patch to encrypt and decrypt heap pages as well as index pages. https://www.postgresql.org/message-id/CAA3qoJ=qto5jcsbjqfdbt9ikux9xkmc5bxcrd7ryse+xsme...@mail.gmail.com 3) Tom will work to find somebody who will sign up as a reviewer upon the next submission of this patch. (Somebody who is not an author). Could we get feedback if this feels like enough to get this patch (which will include just the Key Management portion of TDE) to a state where it can be reviewed and assuming the review issues are resolved with consensus be committed? On Mon, Jan 18, 2021 at 2:00 PM Andres Freund wrote: > > On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote: > > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > > > Personally, but I admit that there's legitimate reasons to differ on > > > that note, I don't think it's reasonable for a feature this invasive to > > > commit preliminary patches without the major subsequent patches being in > > > a shape that allows reviewing the whole picture. > > > > OK, if that is a requirement, I can't help anymore since there are > > already complaints that the patch is too large to review, even if broken > > into pieces. Please let me know what the community decides. > > Those aren't conflicting demands. Having later patches around to > validate the design of earlier patches doesn't necessitates that the > later patches need to be reviewed at the same time. -- Thomas John Kincaid
Re: Odd, intermittent failure in contrib/pageinspect
Alvaro Herrera writes: > On 2021-Jan-18, Tom Lane wrote: >> [ thinks for a bit... ] Does the checkpointer pin pages it's writing >> out? I guess it'd have to ... > It does, per SyncOneBuffer(), called from BufferSync(), called from > CheckPointBuffers(). Right, then we don't need any strange theories about autovacuum, just bad timing luck. whelk does seem pretty slow, so it's not much of a stretch to imagine that it's more susceptible to this corner case than faster machines. So, do we have any other tests that are invoking a manual vacuum and assuming it won't skip any pages? By this theory, they'd all be failures waiting to happen. regards, tom lane
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
David Geier writes: > On 18.01.21 19:46, Tom Lane wrote: >> Hm. I agree that we shouldn't simply assume that ss_currentRelation >> isn't null. However, we cannot make search_plan_tree() descend >> through non-leaf CustomScan nodes, because we don't know what processing >> is involved there. We need to find a scan that is guaranteed to return >> rows that are one-to-one with the cursor output. This is why the function >> doesn't descend through join or aggregation nodes, and I see no argument >> by which we should assume we know more about what a customscan node will >> do than we know about those. > That makes sense. Thanks for the explanation. OK, cool. I was afraid you'd argue that you really needed your CustomScan node to be transparent in such cases. We could imagine inventing an additional custom-scan-provider callback to embed the necessary knowledge, but I'd rather not add the complexity until someone has a use-case. > I updated the patch to match your proposal. WFM, will push in a bit. regards, tom lane
Re: Odd, intermittent failure in contrib/pageinspect
On 2021-Jan-18, Tom Lane wrote: > Right. If that's the explanation, then adding DISABLE_PAGE_SKIPPING > to the test's VACUUM options should fix it. However, to believe that > theory you have to have some reason to think that some other process > might have the page pinned. What would that be? test1 only has one > small tuple in it, so it doesn't seem credible that autovacuum or > autoanalyze would have fired on it. I guess the machine would have to be pretty constrained. (It takes almost seven minutes to go through the pg_upgrade test, so it does seems small.) > [ thinks for a bit... ] Does the checkpointer pin pages it's writing > out? I guess it'd have to ... It does, per SyncOneBuffer(), called from BufferSync(), called from CheckPointBuffers(). -- Álvaro Herrera39°49'30"S 73°17'W
Re: Odd, intermittent failure in contrib/pageinspect
Alvaro Herrera writes: > On 2021-Jan-18, Tom Lane wrote: >> Searching the buildfarm logs turned up exactly one previous occurrence, >> also on whelk [2]. So I'm not sure what to make of it. Could the >> immediately preceding VACUUM FREEZE command have silently skipped this >> page for some reason? That'd be a bug I should think. > Hmm, doesn't vacuum skip pages when they are pinned? I don't think > VACUUM FREEZE would be treated especially -- only "aggressive" > wraparound would be an exception, IIRC. Right. If that's the explanation, then adding DISABLE_PAGE_SKIPPING to the test's VACUUM options should fix it. However, to believe that theory you have to have some reason to think that some other process might have the page pinned. What would that be? test1 only has one small tuple in it, so it doesn't seem credible that autovacuum or autoanalyze would have fired on it. [ thinks for a bit... ] Does the checkpointer pin pages it's writing out? I guess it'd have to ... regards, tom lane
Re: Odd, intermittent failure in contrib/pageinspect
On 2021-Jan-18, Tom Lane wrote: > Searching the buildfarm logs turned up exactly one previous occurrence, > also on whelk [2]. So I'm not sure what to make of it. Could the > immediately preceding VACUUM FREEZE command have silently skipped this > page for some reason? That'd be a bug I should think. Hmm, doesn't vacuum skip pages when they are pinned? I don't think VACUUM FREEZE would be treated especially -- only "aggressive" wraparound would be an exception, IIRC. This would reflect in the relfrozenxid for the table after vacuum, but I'm not sure if there's a decent way to make the regression tests reflect that. > Also, not really a bug, but why is this test script running exactly > the same query twice in a row? If that's of value, and not just a > copy-and-paste error, the comments sure don't explain why. But what > it looks like is that these queries were different when first added, > and then 58b4cb30a5b made them the same when it probably should have > removed one. Agreed. -- Álvaro Herrera39°49'30"S 73°17'W
Re: CheckpointLock needed in CreateCheckPoint()?
Robert Haas writes: > On Mon, Jan 18, 2021 at 3:25 PM Tom Lane wrote: >> If memory serves, the reason for the lock was that the CHECKPOINT >> command used to run the checkpointing code directly in the calling >> backend, so we needed it to keep more than one process from doing >> that at once. AFAICS, it's no longer possible for more than one >> process to try to run that code concurrently, so we shouldn't need >> the lock anymore. > Interesting. I think that must have been a *very* long time ago. Yeah. Digging further, it looks like I oversimplified things above: we once launched special background-worker-like processes for checkpoints, and there could be more than one at a time. That seems to have changed here: Author: Tom Lane Branch: master Release: REL8_0_BR [076a055ac] 2004-05-29 22:48:23 + Separate out bgwriter code into a logically separate module, rather than being random pieces of other files. Give bgwriter responsibility for all checkpoint activity (other than a post-recovery checkpoint); so this child process absorbs the functionality of the former transient checkpoint and shutdown subprocesses. At the time I thought the CheckpointLock had become totally pro forma, but there are later references to having to prevent a checkpoint from running concurrently with a restartpoint, which was originally done within the startup/WAL-replay process. It looks like that changed here: Author: Heikki Linnakangas Branch: master Release: REL8_4_BR [7e48b77b1] 2009-06-25 21:36:00 + Fix some serious bugs in archive recovery, now that bgwriter is active during it: When bgwriter is active, the startup process can't perform mdsync() correctly because it won't see the fsync requests accumulated in bgwriter's private pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery checkpoint as well, when it's active. (The modern checkpointer process wasn't split out of bgwriter until 806a2aee3 of 2011-11-01.) regards, tom lane
Odd, intermittent failure in contrib/pageinspect
whelk failed today [1] with this surprising symptom: --- snip --- diff -w -U3 C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out --- C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out 2020-03-08 09:00:35.036254700 +0100 +++ C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out 2021-01-18 22:10:10.889655500 +0100 @@ -90,8 +90,8 @@ FROM heap_page_items(get_raw_page('test1', 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2); t_infomask | t_infomask2 | raw_flags | combined_flags -+-+---+ - 2816 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN} ++-+-+ + 2304 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} (1 row) -- output the decoded flag HEAP_XMIN_FROZEN instead @@ -99,8 +99,8 @@ FROM heap_page_items(get_raw_page('test1', 0)), LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2); t_infomask | t_infomask2 | raw_flags | combined_flags -+-+---+ - 2816 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN} ++-+-+ + 2304 | 2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {} (1 row) -- tests for decoding of combined flags --- snip --- Searching the buildfarm logs turned up exactly one previous occurrence, also on whelk [2]. So I'm not sure what to make of it. Could the immediately preceding VACUUM FREEZE command have silently skipped this page for some reason? That'd be a bug I should think. Also, not really a bug, but why is this test script running exactly the same query twice in a row? If that's of value, and not just a copy-and-paste error, the comments sure don't explain why. But what it looks like is that these queries were different when first added, and then 58b4cb30a5b made them the same when it probably should have removed one. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2021-01-18%2020%3A42%3A13 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2020-04-17%2023%3A42%3A10
Re: CheckpointLock needed in CreateCheckPoint()?
On Mon, Jan 18, 2021 at 3:25 PM Tom Lane wrote: > If memory serves, the reason for the lock was that the CHECKPOINT > command used to run the checkpointing code directly in the calling > backend, so we needed it to keep more than one process from doing > that at once. AFAICS, it's no longer possible for more than one > process to try to run that code concurrently, so we shouldn't need > the lock anymore. Interesting. I think that must have been a *very* long time ago. Perhaps 076a055acf3c55314de267c62b03191586d79cf6 from 2004? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Key management with tests
On Mon, Jan 18, 2021 at 2:00 PM Tom Kincaid wrote: > Some of the missing things you mention above are about the design of > TDE feature in general. However, this patch is about Key Management > which is going part of the larger TDE feature. So it feels as though > there is the need for a general design document about the overall > vision / approach for TDE and a specific design doc. for Key > Management. Is it appropriate to include both of those in the same > patch? To me, it wouldn't make sense to commit a full README for a TDE feature that we don't have yet with a key management patch, but the way that they'll interact with each other has to be clear. The doc/database-encryption.sgml file that Bruce included in the patch is a decent start on explaining the design, though I think it needs more work and more details, perhaps including some of the things Andres mentioned. To be honest, after reading over that SGML documentation a bit, I'm somewhat skeptical about whether it really makes sense to think about committing the key management part separately. It seems to have no use independent of the main feature, and it in fact embeds very specific details of how the main feature is expected to work. For example, the documentation says that key #0 will be used for data files, and key #1 for WAL. There seems to be no suggestion that the key management portion of this can be used to manage encryption keys generally for whatever purposes someone might have; it's all about the needs of a particular TDE implementation. Typically, we would not commit something like that separately, or only once the main patch was done, with the two commits occurring in a relatively short time period. Otherwise, as Bruce already noted, we can end up with something that is documented and visible to users but doesn't actually work yet. Some more specific comments on data-encryption.sgml: * The documentation explains that the purpose of having a WAL key separate from the data file key is so that the data file keys can "eventually" be rotated. It's not clear whether this means that we might eventually have that feature or that we might eventually be able to rotate, after failing over. If this kind of thing is possible, we'll eventually need documentation on how to do it. * The reasons for use a DEK and a KEK are not explained. I realize it's not an uncommon practice and that other systems do it, but I think a few sentences of explanation wouldn't be a bad idea. Even if we are supposing that hackers who want to have input into this feature have to be knowledgeable about cryptography, I don't think we can reasonably suppose that for users. * "For example" is at one point followed by a period rather than a colon or comma. * In the "Internals" subsection, the last sentence doesn't seem to be grammatical. I wonder if it's missing the word "or"'. * The part about integrity-checking keys on startup isn't clear. It makes it sound like we still have a copy of the KEK lying around someplace against which we can compare, which I assume is not the case since it would be really insecure. * I think it's going to be pretty important that we can easily switch to other cryptographic algorithms as they are discovered, so I don't like the fact that this is tied specifically to AES. (In fact, kmgr_utils.h makes it sound like we're specifically locked into AES256, but that contradicts the documentation, so I guess there's some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN is doing.) As far as possible we should try to make this generic, like supporting any cipher that SSL has which has property X. It seems relatively inevitable that every currently popular cryptographic algorithm will at some point in the future be judged weak and worthless, just as has already happened with MD5 and some variants of SHA, both of which used to be considered state of the art. It seems equally inevitable that new and stronger algorithms will continued to be devised, and we'll want to adopt those easily. I'm not sure to what extent this a serious flaw in the patch and to what extent it's just a matter of tweaking the wording of some things, but I think this is actually an extremely critical design point where we had better be certain we've got it right. Few things would be sadder than to get a TDE feature and then have to rip it out again because it couldn't be upgraded to work with newer crypto algorithms with reasonable effort. Notes on other parts of the documentation: * The documentation for initdb -K doesn't list the valid values of the parameter, only the default. Probably we should be specifying an algorithm here and not just a bit count. Otherwise, like I say above, what happens when AES gives way to something else? It'd be easy to say -K BFT256 instead of -K AES256, but if AES is assumed and it's no longer what we want them we have problems. This kind of thing probably needs to be cleaned up in a bunch of places. * I don't see the
Re: Deleting older versions in unique indexes to avoid page splits
On Mon, Jan 18, 2021 at 1:10 PM Victor Yegorov wrote: > I must admit, that it's a bit difficult to understand you here (at least for > me). > > I assume that by "bet" you mean flagged tuple, that we marked as such > (should we implement the suggested case). > As heapam will give up early in case there are no deletable tuples, why do > say, > that "bet" is no longer low? At the end, we still decide between page split > (and > index bloat) vs a beneficial space cleanup. Well, as I said, there are various ways in which our inferences (say the ones in nbtdedup.c) are likely to be wrong. You understand this already. For example, obviously if there are two duplicate index tuples pointing to the same heap page then it's unlikely that both will be deletable, and there is even a fair chance that neither will be (for many reasons). I think that it's important to justify why we use stuff like that to drive our decisions -- the reasoning really matters. It's very much not like the usual optimization problem thing. It's a tricky thing to discuss. I don't assume that I understand all workloads, or how I might introduce regressions. It follows that I should be extremely conservative about imposing new costs here. It's good that we currently know of no workloads that the patch is likely to regress, but absence of evidence isn't evidence of absence. I personally will never vote for a theoretical risk with only a theoretical benefit. And right now that's what the idea of doing bottom-up deletions in more marginal cases (the page flag thing) looks like. -- Peter Geoghegan
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, On 18.01.21 19:46, Tom Lane wrote: David Geier writes: search_plan_tree() assumes that CustomScanState::ScanState::ss_currentRelation is never NULL. In my understanding that only holds for CustomScanState nodes which are at the bottom of the plan and actually read from a relation. CustomScanState nodes which are not at the bottom don't have ss_currentRelation set. I believe for such nodes, instead search_plan_tree() should recurse into CustomScanState::custom_ps. Hm. I agree that we shouldn't simply assume that ss_currentRelation isn't null. However, we cannot make search_plan_tree() descend through non-leaf CustomScan nodes, because we don't know what processing is involved there. We need to find a scan that is guaranteed to return rows that are one-to-one with the cursor output. This is why the function doesn't descend through join or aggregation nodes, and I see no argument by which we should assume we know more about what a customscan node will do than we know about those. That makes sense. Thanks for the explanation. So I'm inclined to think a suitable fix is just - if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + if (sstate->ss_currentRelation && + RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; regards, tom lane I updated the patch to match your proposal. Best regards, David Swarm64 diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index c7f909241b..fb3cf9537c 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -330,7 +330,8 @@ search_plan_tree(PlanState *node, Oid table_oid, { ScanState *sstate = (ScanState *) node; - if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + if (sstate->ss_currentRelation && + RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; break; }
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Zhihong Yu writes: > I was thinking that, if sstate->ss_currentRelation is null for the other > cases, that would be a bug. > An assertion can be added for the cases ending with T_TidScanState. Maybe, but there are surely a lot of other places that would crash in such a case --- places far more often traversed than search_plan_tree. I do not see any value in complicating search_plan_tree for that. regards, tom lane
Re: Deleting older versions in unique indexes to avoid page splits
пн, 18 янв. 2021 г. в 21:43, Peter Geoghegan : > In the end, I couldn't justify imposing this cost on anything other > than a non-HOT updater, which is what the flag proposal would require > me to do -- then it's not 100% clear that the relative cost of each > "bet" placed in heapam.c really is extremely low (we can no longer say > for sure that we have very little to lose, given the certain > alternative that is a version churn page split). I must admit, that it's a bit difficult to understand you here (at least for me). I assume that by "bet" you mean flagged tuple, that we marked as such (should we implement the suggested case). As heapam will give up early in case there are no deletable tuples, why do say, that "bet" is no longer low? At the end, we still decide between page split (and index bloat) vs a beneficial space cleanup. The fact is that > "version chains in indexes" still cannot get very long. Plus there are > other subtle ways in which it's unlikely to be a problem (e.g. the > LP_DEAD deletion stuff also got much better, deduplication can combine > with bottom-up deletion so that we'll trigger a useful bottom-up > deletion pass sooner or later without much downside, and possibly > other things I haven't even thought of). > I agree with this, except for "version chains" not being long. It all really depends on the data distribution. It's perfectly common to find indexes supporting FK constraints on highly skewed sets, with 80% of the index belonging to a single value (say, huge business customer vs several thousands of one-time buyers). -- Victor Yegorov
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, Tom: I was thinking that, if sstate->ss_currentRelation is null for the other cases, that would be a bug. An assertion can be added for the cases ending with T_TidScanState. Though, the null sstate->ss_currentRelation would surface immediately (apart from assertion). So I omitted the assertion in the diff. Cheers On Mon, Jan 18, 2021 at 12:16 PM Tom Lane wrote: > Zhihong Yu writes: > > It seems sstate->ss_currentRelation being null can only > > occur for T_ForeignScanState and T_CustomScanState. > > What about the following change ? > > Seems like more code for no very good reason. > > regards, tom lane >
Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple
japin writes: > I find that the outputstr variable in logicalrep_write_tuple() only use in > `else` branch, I think we can narrow the scope, just like variable outputbytes > in `if` branch (for more readable). Agreed, done. For context, I'm not usually in favor of making one-off stylistic improvements: the benefit seldom outweighs the risk of creating merge hazards for future back-patching. But in this case, the code involved is mostly new in v14, so improving it now doesn't cost anything in that way. regards, tom lane
Re: Add primary keys to system catalogs
On Mon, Jan 18, 2021, at 19:33, Tom Lane wrote: > On second thought, a catalog is overkill; it'd only be useful if the data > could change after initdb, which this data surely cannot. The right way > to expose such info to SQL is with a set-returning function reading a > constant table in the C code, a la pg_get_keywords(). That way takes a > whole lot less infrastructure and is just as useful to SQL code. +1
Re: truncating timestamps on arbitrary intervals
On Mon, Nov 23, 2020 at 1:44 PM John Naylor wrote: > > On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > - After reading the discussion a few times, I'm not so sure anymore > > whether making this a cousin of date_trunc is the right way to go. As > > you mentioned, there are some behaviors specific to date_trunc that > > don't really make sense in date_trunc_interval, and maybe we'll have > > more of those. For v10, I simplified the behavior by decoupling the behavior from date_trunc() and putting in some restrictions as discussed earlier. It's much simpler now. It could be argued that it goes too far in that direction, but it's easy to reason about and we can put back some features as we see fit. > > Also, date_trunc_interval isn't exactly a handy name. > > Maybe something to think about. It's obviously fairly straightforward > > to change it. > > Effectively, it puts timestamps into bins, so maybe date_bin() or something like that? For v10 I went with date_bin() so we can see how that looks. > > - There were various issues with the stride interval having months and > > years. I'm not sure we even need that. It could be omitted unless you > > are confident that your implementation is now sufficient. > > Months and years were a bit tricky, so I'd be happy to leave that out if there is not much demand for it. date_trunc() already has quarters, decades, centuries, and millenia. I removed months and years for this version, but that can be reconsidered of course. The logic is really simple now. > > - Also, negative intervals could be prohibited, but I suppose that > > matters less. I didn't go this far, but probably should before long. > > - Then again, I'm thinking that maybe we should make the origin > > mandatory. Otherwise, the default answers when having strides larger > > than a day are entirely arbitrary, e.g., I've tried this and like the resulting simplification. > > - I'm wondering whether we need the date_trunc_interval(interval, > > timestamptz, timezone) variant. Isn't that the same as > > date_trunc_interval(foo AT ZONE xyz, value)? > > I based this on 600b04d6b5ef6 for date_trunc(), whose message states: > > date_trunc(field, timestamptz, zone_name) > > is the same as > > date_trunc(field, timestamptz at time zone zone_name) at time zone zone_name > > so without the shorthand, you need to specify the timezone twice, once for the calculation, and once for the output. In light of making the origin mandatory, it no longer makes sense to have a time zone parameter, since we can specify the time zone on the origin; and if desired on the output as well. -- John Naylor EDB: http://www.enterprisedb.com v10-datetrunc-interval.patch Description: Binary data
Re: Deleting older versions in unique indexes to avoid page splits
On Mon, Jan 18, 2021 at 6:11 AM Victor Yegorov wrote: > If I understand you correctly, you suggest to track _all_ the hints that came > from the executor for possible non-HOT logical duplicates somewhere on > the page. And when we hit the no-space case, we'll check not only the last > item being hinted, but all items on the page, which makes it more probable > to kick in and do something. > Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE > flag, maybe it can be reused for this purpose? There actually was a similar flag (named BTP_UNCHANGED_UPDATE and later BTP_HAS_DUPS) that appeared in earlier versions of the patch (several versions in total, up to and including v6). This was never discussed on the list because I assumed that that wouldn't be helpful (I was already writing too many overlong e-mails). It was unsettled in my mind at the time, so it didn't make sense to start discussing. I changed my mind at some point, and so it never came up until now, when Amit raised the question. Looking back on my personal notes, I am reminded that I debated this exact question with myself at length. The argument for keeping the nbtree flag (i.e. what Amit is arguing for now) involved an isolated example that seems very similar to the much more recent example from Amit (that he arrived at independently). I am at least sympathetic to this view of things, even now. Let me go into why I changed my mind after a couple of months of on and off deliberation. In general, the highly speculative nature of the extra work that heapam.c does for index deleters in the bottom-up caller case can only be justified because the cost is paid by non-HOT updaters that are definitely about to split the page just to fit another version, and because we only risk wasting one heap page access at any given point of the entire process (the latter point about risk/cost is not 100% true, because you have additional fixed CPU costs and stuff like that, but it's at least true in spirit). We can justify "gambling" like this only because the game is rigged in our favor to an *absurd* degree: there are many ways to win big (relative to the likely version churn page split baseline case), and we only have to put down relatively small "bets" at any given point -- heapam.c will give up everything when it encounters one whole heap page that lacks a single deletable entry, no matter what the reason is. The algorithm *appears* to behave very intelligently when seen from afar, but is in fact stupid and opportunistic when you look at it up close. It's possible to be so permissive about the upside benefit by also being very conservative about the downside cost. Almost all of our individual inferences can be wrong, and yet we still win in the end. And the effect is robust over time. You could say that it is an organic approach: it adapts to the workload, rather than trying to make the workload adapt to it. This is less radical than you'd think -- in some sense this is how B-Trees have always worked. In the end, I couldn't justify imposing this cost on anything other than a non-HOT updater, which is what the flag proposal would require me to do -- then it's not 100% clear that the relative cost of each "bet" placed in heapam.c really is extremely low (we can no longer say for sure that we have very little to lose, given the certain alternative that is a version churn page split). The fact is that "version chains in indexes" still cannot get very long. Plus there are other subtle ways in which it's unlikely to be a problem (e.g. the LP_DEAD deletion stuff also got much better, deduplication can combine with bottom-up deletion so that we'll trigger a useful bottom-up deletion pass sooner or later without much downside, and possibly other things I haven't even thought of). It's possible that I have been too conservative. I admit that my decision on this point is at least a little arbitrary, but I stand by it for now. I cannot feel too bad about theoretically leaving some gains on the table, given that we're only talking about deleting a group of related versions a little later than we would otherwise, and only in some circumstances, and in a way that seems likely to be imperceptible to users. I reserve the right to change my mind about it, but for now it doesn't look like an absurdly good deal, and those are the kind of deals that it makes sense to focus on here. I am very happy about the fact that it is relatively easy to understand why the worst case for bottom-up index deletion cannot be that bad. -- Peter Geoghegan
[PATCH] Full support for index LP_DEAD hint bits on standby
Hello, hackers. [ABSTRACT] Execution of queries to hot standby is one of the most popular ways to scale application workload. Most of the modern Postgres installations have two standby nodes for high-availability support. So, utilization of replica's CPU seems to be a reasonable idea. At the same time, some queries (index scans) could be much slower on hot standby rather than on the primary one. It happens because the LP_DEAD index hint bits mechanics is ignored in index scans during recovery. It is done for reasons, of course [1]: * We do this because the xmin on the primary node could easily be * later than the xmin on the standby node, so that what the primary * thinks is killed is supposed to be visible on standby. So for correct * MVCC for queries during recovery we must ignore these hints and check * all tuples. Also, according to [2] and cases like [3], it seems to be a good idea to support "ignore_killed_tuples" on standby. The goal of this patch is to provide full support for index hint bits on hot standby. The mechanism should be based on well-tested functionality and not cause a lot of recovery conflicts. This thread is the continuation (and party copy-paste) of the old previous one [4]. [PROBLEM] The standby itself can set and read hint bits during recovery. Such bits are even correct according to standby visibility rules. But the problem here - is full-page-write WAL records coming from the primary. Such WAL records could bring invalid (according to standby xmin) hint bits. So, if we could be sure the scan doesn’t see any invalid hint bit from primary - the problem is solved. And we will even be able to allow standby to set its LP_DEAD bits itself. The idea is simple: let WAL log hint bits before FPW somehow. It could cause a lot of additional logs, however... But there are ways to avoid it: 1) Send only one `latestRemovedXid` of all tuples marked as dead during page scan. 2) Remember the latest sent `latestRemovedXid` in shared memory. And optimistically skip WAL records with older xid values [5]. Such WAL records would cause a lot of recovery conflicts on standbys. But we could be tricky here - let use hint bits only if hot_standby_feedback is enabled and effective on standby. If HSF is effective - then conflicts are not possible. If HSF is off - then standby ignores both hint bits and additional conflict resolution. The major thing here is that HSF is just optimization and has nothing with MVCC correctness. [DETAILS] The patch introduces a new WAL record (named XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for standbys snapshot to use LP_DEAD bits for an index scan. `table_index_fetch_tuple` now returns `latest_removed_xid` value additionally to `all_dead`. This value is used to advance `killedLatestRemovedXid` at time of updating `killedItems` (see `IndexHintBitAdvanceLatestRemovedXid`). Primary sends the value of `killedLatestRemovedXid` in XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting LP_DEAD bits on the index page (by calling `MarkBufferDirtyIndexHint`). New WAL is always sent before possible FPW. It is required to send such a record only if its `latestRemovedXid` is newer than the one was sent before for the current database (see `LogIndexHintBitsHorizonIfNeeded`). There is a new flag in the PGPROC structure - `indexIgnoreKilledTuples`. If the flag is set to true – standby queries are going to use LP_DEAD bits in index scans. In such a case snapshot is required to satisfice the new horizon pushed by XLOG_INDEX_HINT_BITS_HORIZON records. It is safe to set `indexIgnoreKilledTuples` to any value from the perspective of correctness. But `true` value could cause recovery conflict. It is just some kind of compromise – use LP_DEAD bits but be aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa. What is the way to make the right decision about this compromise? It is pretty simple – if `hot_standby_feedback` is on and primary confirmed feedback is received – then set `indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`). While feedback is working as expected – the query will never be canceled by XLOG_INDEX_HINT_BITS_HORIZON. To support cascading standby setups (with a possible break of feedback chain in the middle) – an additional byte was added to the keep-alive message of the feedback protocol. This byte is used to make sure our xmin is honored by primary (see `sender_propagates_feedback_to_primary`). Also, the WAL sender now always sends a keep-alive after receiving a feedback message. So, this way, it is safe to use LP_DEAD bits received from the primary when we want to. And, as a result, it is safe to set LP_DEAD bits on standby. Even if: * the primary changes vacuum_defer_cleanup_age * standby restarted * standby promoted to the primary * base backup taken from standby * standby is serving queries during recovery – nothing could go wrong here. Because `HeapTupleIsSurelyDead` (and
Re: PG vs LLVM 12 on seawasp, next round
Hello Alvaro, The "no such file" error seems more like a machine local issue to me. I'll look into it when I have time, which make take some time. Hopefully over the holidays. This is still happening ... Any chance you can have a look at it? Indeed. I'll try to look (again) into it soon. I had a look but did not find anything obvious in the short time frame I had. Last two months were a little overworked for me so I let slip quite a few things. If you want to disable the animal as Tom suggests, do as you want. -- Fabien.
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On 2021-Jan-18, Matthias van de Meent wrote: > On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera wrote: > Would this not need to be the following? Right now, it resets > potentially older h->catalog_oldest_nonremovable (which is set in the > PROC_IN_SAFE_IC branch). > > > +if (statusFlags & PROC_IN_SAFE_IC) > > +h->catalog_oldest_nonremovable = > > +TransactionIdOlder(h->catalog_oldest_nonremovable, > > xmin); > > +else > > +{ > > +h->data_oldest_nonremovable = > > +TransactionIdOlder(h->data_oldest_nonremovable, xmin); > > +h->catalog_oldest_nonremovable = > > +TransactionIdOlder(h->catalog_oldest_nonremovable, > > xmin); > > +} Oops, you're right. -- Álvaro Herrera Valdivia, Chile
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On 2021-Jan-18, Matthias van de Meent wrote: > Example: > > 1.) RI starts > 2.) PHASE 2: filling the index: > 2.1.) scanning the heap (live tuple is cached) > < tuple is deleted > < last transaction other than RI commits, only snapshot of RI exists > < vacuum drops the tuple, and cannot remove it from the new index > because this new index is not yet populated. > 2.2.) sorting tuples > 2.3.) index filled with tuples, incl. deleted tuple > 3.) PHASE 3: wait for transactions > 4.) PHASE 4: validate does not remove the tuple from the index, > because it is not built to do so: it will only insert new tuples. > Tuples that are marked for deletion are removed from the index only > through VACUUM (and optimistic ALL_DEAD detection). > > According to my limited knowledge of RI, it requires VACUUM to not run > on the table during the initial index build process (which is > currently guaranteed through the use of a snapshot). VACUUM cannot run concurrently with CIC or RI in a table -- both acquire ShareUpdateExclusiveLock, which conflicts with itself, so this cannot occur. I do wonder if the problem you suggest (or something similar) can occur via HOT pruning, though. -- Álvaro Herrera Valdivia, Chile
Re: CheckpointLock needed in CreateCheckPoint()?
Robert Haas writes: > Here's a patch to remove CheckpointLock completely. ... > So I don't see any problem with just ripping this out entirely, but > I'd like to know if anybody else does. If memory serves, the reason for the lock was that the CHECKPOINT command used to run the checkpointing code directly in the calling backend, so we needed it to keep more than one process from doing that at once. AFAICS, it's no longer possible for more than one process to try to run that code concurrently, so we shouldn't need the lock anymore. regards, tom lane
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Zhihong Yu writes: > It seems sstate->ss_currentRelation being null can only > occur for T_ForeignScanState and T_CustomScanState. > What about the following change ? Seems like more code for no very good reason. regards, tom lane
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera wrote: > > So one last remaining improvement was to have VACUUM ignore processes > doing CIC and RC to compute the Xid horizon of tuples to remove. I > think we can do something simple like the attached patch. Regarding the patch: > +if (statusFlags & PROC_IN_SAFE_IC) > +h->catalog_oldest_nonremovable = > +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin); > +else > +h->data_oldest_nonremovable = h->catalog_oldest_nonremovable > = > +TransactionIdOlder(h->data_oldest_nonremovable, xmin); Would this not need to be the following? Right now, it resets potentially older h->catalog_oldest_nonremovable (which is set in the PROC_IN_SAFE_IC branch). > +if (statusFlags & PROC_IN_SAFE_IC) > +h->catalog_oldest_nonremovable = > +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin); > +else > +{ > +h->data_oldest_nonremovable = > +TransactionIdOlder(h->data_oldest_nonremovable, xmin); > +h->catalog_oldest_nonremovable = > +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin); > +} Prior to reading the rest of my response: I do not understand the intricacies of the VACUUM process, nor the systems of db snapshots, so it's reasonably possible I misunderstand this. But would this patch not allow for tuples to be created, deleted and vacuumed away from a table whilst rebuilding an index on that table, resulting in invalid pointers in the index? Example: 1.) RI starts 2.) PHASE 2: filling the index: 2.1.) scanning the heap (live tuple is cached) < tuple is deleted < last transaction other than RI commits, only snapshot of RI exists < vacuum drops the tuple, and cannot remove it from the new index because this new index is not yet populated. 2.2.) sorting tuples 2.3.) index filled with tuples, incl. deleted tuple 3.) PHASE 3: wait for transactions 4.) PHASE 4: validate does not remove the tuple from the index, because it is not built to do so: it will only insert new tuples. Tuples that are marked for deletion are removed from the index only through VACUUM (and optimistic ALL_DEAD detection). According to my limited knowledge of RI, it requires VACUUM to not run on the table during the initial index build process (which is currently guaranteed through the use of a snapshot). Regards, Matthias van de Meent
Re: [PATCH] Add support for leading/trailing bytea trim()ing
"Joel Jacobson" writes: > On Fri, Dec 4, 2020, at 22:02, Tom Lane wrote: >> (Maybe the existing ltrim/rtrim descrs are also like this, but if so >> I'd change them too.) > They weren't, but I think the description for the bytea functions > can be improved to have a more precise description > if we take inspiration from the the text functions. Yeah, I agree with making the bytea descriptions look like the text ones. Pushed with minor additional doc fixes. regards, tom lane
Re: Deleting older versions in unique indexes to avoid page splits
On Sun, Jan 17, 2021 at 10:44 PM Amit Kapila wrote: > With the above example, it seems like it would also help when this is not > true. I'll respond to your remarks here separately, in a later email. > I am not sure what I proposed fits here but the bottom-up sounds like > we are starting from the leaf level and move upwards to root level > which I think is not true here. I guess that that's understandable, because it is true that B-Trees are maintained in a bottom-up fashion. However, it's also true that you can have top-down and bottom-up approaches in query optimizers, and many other things (it's even something that is used to describe governance models). The whole point of the term "bottom-up" is to suggest that bottom-up deletion complements top-down cleanup by VACUUM. I think that this design embodies certain principles that can be generalized to other areas, such as heap pruning. I recall that Robert Haas hated the name deduplication. I'm not about to argue that my choice of "deduplication" was objectively better than whatever he would have preferred. Same thing applies here - I more or less chose a novel name because the concept is itself novel (unlike deduplication). Reasonable people can disagree about what exact name might have been better, but it's not like we're going to arrive at something that everybody can be happy with. And whatever we arrive at probably won't matter that much - the vast majority of users will never need to know what either thing is. They may be important things, but that doesn't mean that many people will ever think about them (in fact I specifically hope that they don't ever need to think about them). > How is that working? Does heapam.c can someway indicate additional > tuples (extra to what has been marked/sent by IndexAM as promising) as > deletable? I see in heap_index_delete_tuples that we mark the status > of the passed tuples as delectable (by setting knowndeletable flag for > them). The bottom-up nbtree caller to heap_index_delete_tuples()/table_index_delete_tuple() (not to be confused with the simple/LP_DEAD heap_index_delete_tuples() caller) always provides heapam.c with a complete picture of the index page, in the sense that it exhaustively has a delstate.deltids entry for each and every TID on the page, no matter what. This is the case even though in practice there is usually no possible way to check even 20% of the deltids entries within heapam.c. In general, the goal during a bottom-up pass is *not* to maximize expected utility (i.e. the number of deleted index tuples/space freed/whatever), exactly. The goal is to lower the variance across related calls, so that we'll typically manage to free a fair number of index tuples when we need to. In general it is much better for heapam.c to make its decisions based on 2 or 3 good reasons rather than just 1 excellent reason. And so heapam.c applies a power-of-two bucketing scheme, never truly giving too much weight to what nbtree tells it about duplicates. See comments above bottomup_nblocksfavorable(), and bottomup_sort_and_shrink() comments (both are from heapam.c). -- Peter Geoghegan
Re: CheckpointLock needed in CreateCheckPoint()?
On Thu, Jan 14, 2021 at 10:21 AM Robert Haas wrote: > Yeah, I think this lock is useless. In fact, I think it's harmful, > because it makes large sections of the checkpointer code, basically > all of it really, run with interrupts held off for no reason. > > It's not impossible that removing the lock could reveal bugs > elsewhere: suppose we have segments of code that actually aren't safe > to interrupt, but because of this LWLock, it never happens. But, if > that happens to be true, I think we should just view those cases as > bugs to be fixed. > > One thing that blunts the impact of this quite a bit is that the > checkpointer doesn't use rely much on CHECK_FOR_INTERRUPTS() in the > first place. It has its own machinery: HandleCheckpointerInterrupts(). > Possibly that's something we should be looking to refactor somehow as > well. Here's a patch to remove CheckpointLock completely. For ProcessInterrupts() to do anything, one of the following things would have to be true: 1. ProcDiePending = true. Normally this is set by die(), but the checkpointer does not use die(). Perhaps someday it should, or maybe not, but that's an issue for another day. 2. ClientConnectionLost = true. This gets set in pqcomm.c's internal_flush(), but the checkpointer has no client connection. 3. DoingCommandRead = true. Can't happen; only set in PostgresMain(). 4. QueryCancelPending = true. Can be set by recovery conflicts, but those shouldn't happen in the checkpointer. Normally set by StatementCancelHandler, which the checkpointer doesn't use. 5. IdleInTransactionSessionTimeoutPending = true. Set up by InitPostgres(), which is not used by auxiliary processes. 6. ProcSignalBarrierPending = true. This is the case we actually care about allowing for the ASRO patch, so if this occurs, it's good. 7. ParallelMessagePending = true. The checkpointer definitely never starts parallel workers. So I don't see any problem with just ripping this out entirely, but I'd like to know if anybody else does. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Remove-CheckpointLock.patch Description: Binary data
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, It seems sstate->ss_currentRelation being null can only occur for T_ForeignScanState and T_CustomScanState. What about the following change ? Cheers diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index 0852bb9cec..56e31951d1 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -325,12 +325,21 @@ search_plan_tree(PlanState *node, Oid table_oid, case T_IndexOnlyScanState: case T_BitmapHeapScanState: case T_TidScanState: +{ +ScanState *sstate = (ScanState *) node; + +if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) +result = sstate; +break; +} + case T_ForeignScanState: case T_CustomScanState: { ScanState *sstate = (ScanState *) node; -if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) +if (sstate->ss_currentRelation && +RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; break; } On Mon, Jan 18, 2021 at 10:46 AM Tom Lane wrote: > David Geier writes: > > search_plan_tree() assumes that > > CustomScanState::ScanState::ss_currentRelation is never NULL. In my > > understanding that only holds for CustomScanState nodes which are at the > > bottom of the plan and actually read from a relation. CustomScanState > > nodes which are not at the bottom don't have ss_currentRelation set. I > > believe for such nodes, instead search_plan_tree() should recurse into > > CustomScanState::custom_ps. > > Hm. I agree that we shouldn't simply assume that ss_currentRelation > isn't null. However, we cannot make search_plan_tree() descend > through non-leaf CustomScan nodes, because we don't know what processing > is involved there. We need to find a scan that is guaranteed to return > rows that are one-to-one with the cursor output. This is why the function > doesn't descend through join or aggregation nodes, and I see no argument > by which we should assume we know more about what a customscan node will > do than we know about those. > > So I'm inclined to think a suitable fix is just > > - if (RelationGetRelid(sstate->ss_currentRelation) == > table_oid) > + if (sstate->ss_currentRelation && > + RelationGetRelid(sstate->ss_currentRelation) == > table_oid) > result = sstate; > > regards, tom lane > > >
Re: Key management with tests
On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote: > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > > Personally, but I admit that there's legitimate reasons to differ on > > that note, I don't think it's reasonable for a feature this invasive to > > commit preliminary patches without the major subsequent patches being in > > a shape that allows reviewing the whole picture. > > OK, if that is a requirement, I can't help anymore since there are > already complaints that the patch is too large to review, even if broken > into pieces. Please let me know what the community decides. Those aren't conflicting demands. Having later patches around to validate the design of earlier patches doesn't necessitates that the later patches need to be reviewed at the same time.
Re: Key management with tests
> > I have to admit I was kind of baffled that the wiki page wasn't > > sufficient, because it is one of the longest Postgres feature > > explanations I have seen, but I now think the missing part is tying > > the wiki contents to the code implementation. If that is it, please > > confirm. If it is something else, also explain. > > I don't think the wiki right now covers what's needed. The "Overview", > "Threat model" and "Scope of TDE" are a start, but beyond that it's > missing a bunch of things. And it's not in the source tree (we'll soon > have multiple versions of postgres with increasing levels of TDE > features, the wiki doesn't help with that) > Thanks, the versioning issue makes sense for the design document needing to be part of the the source tree. As I was reading the README for the patch Amit referenced and as I am going through this patch, I feel the desire to incorporate diagrams. Are design diagrams ever incorporated in the source tree as a part of the design description of a feature? If not, any concerns about doing that? I think that is likely where I can contribute the most. > Missing: > - talks about cluster wide encyrption being simpler, without mentioning > what it's being compared to, and what makes it simpler > - no differentiation from file system / block level encryption > - there's no explanation of which/why specific crypto primitives were > chosen, what the tradeoffs are > - no explanation which keys exists, stored where > - the key management patch introduces new files, not documented > - there's new types of lock files, possibility of interrupted > operations, ... - no documentation of what that means > - there's no documentation what "key wrapping" actually precisely is, > what the danger of the two-tier model is, ... > - are there dangers in not encrypting zero pages etc? > - ... > Some of the missing things you mention above are about the design of TDE feature in general. However, this patch is about Key Management which is going part of the larger TDE feature. So it feels as though there is the need for a general design document about the overall vision / approach for TDE and a specific design doc. for Key Management. Is it appropriate to include both of those in the same patch? Something along the lines here is the overall design of TDE and here is how the Key Management portion is designed and implemented. I guess in that case, follow on patches for TDE could refer to the overall design described in this patch. > > > Personally, but I admit that there's legitimate reasons to differ on > that note, I don't think it's reasonable for a feature this invasive to > commit preliminary patches without the major subsequent patches being in > a shape that allows reviewing the whole picture. > > Greetings, > > Andres Freund -- Thomas John Kincaid
Re: Key management with tests
On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > Personally, but I admit that there's legitimate reasons to differ on > that note, I don't think it's reasonable for a feature this invasive to > commit preliminary patches without the major subsequent patches being in > a shape that allows reviewing the whole picture. OK, if that is a requirement, I can't help anymore since there are already complaints that the patch is too large to review, even if broken into pieces. Please let me know what the community decides. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: WIP: System Versioned Temporal Table
On Mon, Jan 18, 2021 at 1:43 AM Vik Fearing wrote: > > This is not good, and I see that DROP SYSTEM VERSIONING also removes > these columns which is even worse. Please read the standard that you > are trying to implement! > > The standard states the function of ALTER TABLE ADD SYSTEM VERSIONING as "Alter a regular persistent base table to a system-versioned table" and system versioned table is described in the standard by two generated stored constraint columns and implemented as such. > I will do a more thorough review of the functionalities in this patch > (not necessarily the code) this week. > > Please do regards Surafel
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
David Geier writes: > search_plan_tree() assumes that > CustomScanState::ScanState::ss_currentRelation is never NULL. In my > understanding that only holds for CustomScanState nodes which are at the > bottom of the plan and actually read from a relation. CustomScanState > nodes which are not at the bottom don't have ss_currentRelation set. I > believe for such nodes, instead search_plan_tree() should recurse into > CustomScanState::custom_ps. Hm. I agree that we shouldn't simply assume that ss_currentRelation isn't null. However, we cannot make search_plan_tree() descend through non-leaf CustomScan nodes, because we don't know what processing is involved there. We need to find a scan that is guaranteed to return rows that are one-to-one with the cursor output. This is why the function doesn't descend through join or aggregation nodes, and I see no argument by which we should assume we know more about what a customscan node will do than we know about those. So I'm inclined to think a suitable fix is just - if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + if (sstate->ss_currentRelation && + RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; regards, tom lane