Re: Different compression methods for FPI
On 15.06.21 07:37, Michael Paquier wrote: Actually, I was just thinking that default yes/no/on/off stuff maybe should be defined to mean "lz4" rather than meaning pglz for "backwards compatibility". +1 I am not sure that we have any reasons to be that aggressive about this one either, and this would mean that wal_compression=on implies a different method depending on the build options. I would just stick with the past, careful practice that we have to use a default backward-compatible value as default, while being able to use a new option. If we think this new thing is strictly better than the old thing, then why not make it the default. What would be the gain from sticking to the old default? The point that the default would depend on build options is a valid one. I'm not sure whether it's important enough by itself.
Re: Different compression methods for FPI
On Tue, Jun 15, 2021 at 08:08:54AM +0300, Heikki Linnakangas wrote: > On 15/06/2021 06:42, Justin Pryzby wrote: >> Why ? This is WAL, not table data. WAL depends on the major version, so >> I think wal_compression could provide a different set of compression methods >> at >> every major release? That may finish by being annoying to the user, but perhaps that you are right that we could have more flexibility here. That does not change the fact that we'd better choose something wisely, able to stick around for a couple of years at least, rather than revisiting this choice every year. >> Actually, I was just thinking that default yes/no/on/off stuff maybe should >> be >> defined to mean "lz4" rather than meaning pglz for "backwards compatibility". > > +1 I am not sure that we have any reasons to be that aggressive about this one either, and this would mean that wal_compression=on implies a different method depending on the build options. I would just stick with the past, careful practice that we have to use a default backward-compatible value as default, while being able to use a new option. -- Michael signature.asc Description: PGP signature
Re: Use singular number when appropriate
On Tue, Jun 15, 2021 at 04:59:24AM +, David Fetter wrote: > Hi, > > I thought about using the dual, but wasn't sure how many languages > support it. I don't think that you can assume that appending something will work in all languages. Note that it also doesn't always work in english (e.g. this/these), as seen in this inconsistent change: -_(" %d of %d tests failed, %d of these failures ignored. "), +_(" %d of %d test%s failed, %d of these failures ignored. "),
Re: Error on pgbench logs
On Sun, Jun 13, 2021 at 03:07:51AM +0900, Yugo NAGATA wrote: > It will not work if the transaction is skipped, in which case latency is 0.0. > It would work if we check also "skipped" as bellow. > > + if (!logged && !skipped && latency == 0.0) > > However, it still might not work if the latency is so small so that we could > observe latency == 0.0. I observed this when I used a script that contained > only a meta command. This is not usual and would be a corner case, though. Hmm. I am not sure to completely follow the idea here. It would be good to make this code less confusing than it is now. > /* log aggregated but not yet reported transactions */ > doLog(thread, state, , false, 0, 0); > + logAgg(thread->logfile, ); > > I think we don't have to call doLog() before logAgg(). If we call doLog(), > we will count an extra transaction that is not actually processed because > accumStats() is called in this. Yes, calling both is weird. Is using logAgg() directly in the context actually right when it comes to sample_rate? We may not log anything on HEAD if sample_rate is enabled, but we would finish by logging something all the time with this patch. If I am following this code correctly, we don't care about accumStats() in the code path of a thread we are done with, right? -- Michael signature.asc Description: PGP signature
Re: Use singular number when appropriate
On 15/06/2021 07:59, David Fetter wrote: Hi, I thought about using the dual, but wasn't sure how many languages support it. if (fail_count == 0 && fail_ignore_count == 0) snprintf(buf, sizeof(buf), _(" %s %d test%s passed. "), success_count == 1 ? "The" : "All", success_count, success_count == 1 ? "" : "s"); Constructing sentences like that is bad practice for translations. See https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html. - Heikki
Re: Different compression methods for FPI
On 15/06/2021 06:42, Justin Pryzby wrote: On Tue, Jun 15, 2021 at 11:39:24AM +0900, Michael Paquier wrote: On Mon, Jun 14, 2021 at 08:42:08PM -0500, Justin Pryzby wrote: It's USERSET following your own suggestion (which is a good suggestion): + {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS, + gettext_noop("Set the method used to compress full page images in the WAL."), + NULL + }, + _compression_method, + WAL_COMPRESSION_PGLZ, wal_compression_options, + NULL, NULL, NULL Any reason to not make that user-settable? If you merge that with wal_compression, that's not an issue. Hmm, yeah. This can be read as using PGC_USERSET. With the second part of my sentence, I think that I imply to use PGC_SUSET and be consistent with wal_compression, but I don't recall my mood from one month ago :) Sorry for the confusion. Hold on - we're all confused (and I'm to blame). The patch is changing the existing wal_compression GUC, rather than adding wal_compression_method. It's still SUSET, but in earlier messages, I called it USERSET, twice. See prior discussion on the security aspect: https://www.postgresql.org/message-id/55269915.1000309%40iki.fi. Adding different compression algorithms doesn't change anything from a security point of view AFAICS. The goal of the patch is to give options, and the overhead of adding both zlib and lz4 is low. zlib gives good compression at some CPUcost and may be preferable for (some) DWs, and lz4 is almost certainly better (than pglz) for OLTPs. Anything will be better than pglz. I am rather confident in that. What I am wondering is if we need to eat more bits than necessary for the WAL record format, because we will end up supporting it until the end of times. Why ? This is WAL, not table data. WAL depends on the major version, so I think wal_compression could provide a different set of compression methods at every major release? Actually, I was just thinking that default yes/no/on/off stuff maybe should be defined to mean "lz4" rather than meaning pglz for "backwards compatibility". +1 - Heikki
Use singular number when appropriate
Hi, I thought about using the dual, but wasn't sure how many languages support it. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate >From 1fd595576c5972d8604adf77d1959008daebacb0 Mon Sep 17 00:00:00 2001 From: David Fetter Date: Mon, 14 Jun 2021 21:51:20 -0700 Subject: [PATCH v1] Singular number when appropriate --- src/test/regress/pg_regress.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git src/test/regress/pg_regress.c src/test/regress/pg_regress.c index 05296f7ee1..d6f9e829e8 100644 --- src/test/regress/pg_regress.c +++ src/test/regress/pg_regress.c @@ -2641,25 +2641,31 @@ regression_main(int argc, char *argv[], */ if (fail_count == 0 && fail_ignore_count == 0) snprintf(buf, sizeof(buf), - _(" All %d tests passed. "), - success_count); + _(" %s %d test%s passed. "), + success_count == 1 ? "The" : "All", + success_count, + success_count == 1 ? "" : "s"); else if (fail_count == 0) /* fail_count=0, fail_ignore_count>0 */ snprintf(buf, sizeof(buf), - _(" %d of %d tests passed, %d failed test(s) ignored. "), + _(" %d of %d test%s passed, %d failed test%s ignored. "), success_count, success_count + fail_ignore_count, - fail_ignore_count); + success_count == 1 ? "" : "s", + fail_ignore_count, + fail_ignore_count == 1 ? "" : "s"); else if (fail_ignore_count == 0) /* fail_count>0 && fail_ignore_count=0 */ snprintf(buf, sizeof(buf), - _(" %d of %d tests failed. "), + _(" %d of %d test%s failed. "), fail_count, - success_count + fail_count); + success_count + fail_count, + fail_count == 1 ? "" : "s"); else /* fail_count>0 && fail_ignore_count>0 */ snprintf(buf, sizeof(buf), - _(" %d of %d tests failed, %d of these failures ignored. "), + _(" %d of %d test%s failed, %d of these failures ignored. "), fail_count + fail_ignore_count, success_count + fail_count + fail_ignore_count, + fail_count + fail_ignore_count == 1 ? "" : "s", fail_ignore_count); putchar('\n'); -- 2.31.1
Re: locking [user] catalog tables vs 2pc vs logical rep
On Mon, Jun 14, 2021 at 5:33 PM osumi.takami...@fujitsu.com wrote: > > On Friday, June 11, 2021 2:13 PM vignesh C wrote: > > Thanks for the updated patch: > > Few comments: > > 1) We have used Reordering and Clustering for the same command, we could > > rephrase similarly in both places. > > + > > +Reordering pg_class by > > CLUSTER > > +command in a transaction. > > + > > > > + > > +Clustering pg_trigger and decoding > > PREPARE > > +TRANSACTION, if any published table have a trigger > > and any > > +operations that will be decoded are conducted. > > + > > + > > > > 2) Here user_catalog_table should be user catalog table > > + > > +Executing TRUNCATE on > > user_catalog_table > > in a transaction. > > + > Thanks for your review. > > Attached the patch-set that addressed those two comments. > I also fixed the commit message a bit in the 2PC specific patch to HEAD. > No other changes. > > Please check. Thanks for the updated patches, the patch applies cleanly in all branches. Please add a commitfest entry for this, so that we don't miss it. Regards, Vignesh
Re: RFC: Logging plan of the running query
On Mon, Jun 14, 2021 at 5:48 PM torikoshia wrote: > > 1) We could just say "Requests to log query plan of the presently > > running query of a given backend along with an untruncated query > > string in the server logs." > > Instead of > > +They will be logged at LOG message level > > and > > +will appear in the server log based on the log > > +configuration set (See > linkend="runtime-config-logging"/> > > Actually this explanation is almost the same as that of > pg_log_backend_memory_contexts(). > Do you think we should change both of them? > I think it may be too detailed but accurate. I withdraw my comment. We can keep the explanation similar to pg_log_backend_memory_contexts as it was agreed upon and committed text. If the wordings are similar, then it will be easier for users to understand the documentation. > > 5) Instead of just showing the true return value of the function > > pg_log_current_plan in the sql test, which just shows that the signal > > is sent, but it doesn't mean that the backend has processed that > > signal and logged the plan. I think we can add the test using TAP > > framework, one > > I once made a tap test for pg_log_backend_memory_contexts(), but it > seemed an overkill and we agreed that it was appropriate just ensuring > the function working as below discussion. > > https://www.postgresql.org/message-id/bbecd722d4f8e261b350186ac4bf68a7%40oss.nttdata.com Okay. I withdraw my comment. > > 6) Do we unnecessarily need to signal the processes such as autovacuum > > launcher/workers, logical replication launcher/workers, wal senders, > > wal receivers and so on. only to emit a "PID %d is not executing > > queries now" message? Moreover, those processes will be waiting in > > loops for timeouts to occur, then as soon as they wake up do they need > > to process this extra uninformative signal? > > We could choose to not signal those processes at all which might or > > might not be possible. > > Otherwise, we could just emit messages like " process cannot run a > > query" in ProcessInterrupts. > > Agreed. > > However it needs to distinguish backends which can execute queries and > other processes such as autovacuum launcher, I don't come up with > easy ways to do so. > I'm going to think about it. I'm not sure if there is any information in the shared memory accessible to all the backends/sessions that can say a PID is autovacuum launcher/worker, logical replication launcher/worker or any other background or parallel worker. If we were to invent a new mechanism just for addressing the above comment, I would rather choose to not do that as it seems like an overkill. We can leave it up to the user whether or not to unnecessarily signal those processes which are bound to print "PID XXX is not executing queries now" message. > > 11) What happens if pg_log_current_plan is called for a parallel > > worker? > > AFAIU Parallel worker doesn't have ActivePortal, so it would always > emit the message 'PID %d is not executing queries now'. > As 6), it would be better to distinguish the parallel worker and normal > backend. As I said, above, I think it will be a bit tough to do. If done, it seems like an overkill. With Regards, Bharath Rupireddy.
Re: Different compression methods for FPI
On Tue, Jun 15, 2021 at 11:39:24AM +0900, Michael Paquier wrote: > On Mon, Jun 14, 2021 at 08:42:08PM -0500, Justin Pryzby wrote: It's USERSET following your own suggestion (which is a good suggestion): > >> + {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS, > >> + gettext_noop("Set the method used to compress full page images > >> in the WAL."), > >> + NULL > >> + }, > >> + _compression_method, > >> + WAL_COMPRESSION_PGLZ, wal_compression_options, > >> + NULL, NULL, NULL > >> Any reason to not make that user-settable? If you merge that with > >> wal_compression, that's not an issue. > > Hmm, yeah. This can be read as using PGC_USERSET. With the second > part of my sentence, I think that I imply to use PGC_SUSET and be > consistent with wal_compression, but I don't recall my mood from one > month ago :) Sorry for the confusion. Hold on - we're all confused (and I'm to blame). The patch is changing the existing wal_compression GUC, rather than adding wal_compression_method. It's still SUSET, but in earlier messages, I called it USERSET, twice. > You cannot do cross-checks for GUCs in their assign hooks or even rely > in the order of those parameters, but you can do that in some backend > code paths. A recent discussion on the matter is for example what led > to 79dfa8a for the GUCs controlling the min/max SSL protocols > allowed. Thank you - this is what I was remembering. > > The goal of the patch is to give options, and the overhead of adding both > > zlib > > and lz4 is low. zlib gives good compression at some CPUcost and may be > > preferable for (some) DWs, and lz4 is almost certainly better (than pglz) > > for > > OLTPs. > > Anything will be better than pglz. I am rather confident in that. > > What I am wondering is if we need to eat more bits than necessary for > the WAL record format, because we will end up supporting it until the > end of times. Why ? This is WAL, not table data. WAL depends on the major version, so I think wal_compression could provide a different set of compression methods at every major release? Actually, I was just thinking that default yes/no/on/off stuff maybe should be defined to mean "lz4" rather than meaning pglz for "backwards compatibility". > hear here, there are many cases that we may care about depending on > how much CPU one is ready to pay in order to get more compression, > knowing that there are no magic solutions for something that's cheap > in CPU with a very good compression ratio or we could just go with > that. So it seems to me that there is still an argument for adding > only one new compression method with a good range of levels, able to > support the range of cases we'd care about: > - High compression ratio but high CPU cost. > - Low compression ratio but low CPU cost. I think zlib is too expensive and lz4 doesn't get enough compression, so neither supports both cases. In a sample of 1, zlib-1 is ~35% slower than lz4 and writes half as much. I think zstd could support both cases; however, I still see it as this patch's job to provide options, rather to definitively conclude which compression algorithm is going to work best for everyone's use data and application. -- Justin
Re: PG 14 release notes, first draft
On Tue, Jun 15, 2021 at 11:49:21AM +0900, Masahiko Sawada wrote: > On Tue, Jun 15, 2021 at 10:36 AM Bruce Momjian wrote: >> OK, but I need more information on how users will see a difference based >> on this commit: +1. That would be good to have in the release notes. > I think that since with this commit the server on Windows can handle a > file over 4GB, COPY FROM loading data from an over 4GB file and > pg_dump dumping a large table work now. Segment files or WAL files larger than 4GB also gain from that. Anything for which we may finish to do a stat() on benefits from this change if running on Windows. For pg_dump, a workaround in PG <= 13 was to use --no-sync as the stat() failure came from files with a size larger than 4GB. That's rather sad as that means sacrifying durability for more usability :( -- Michael signature.asc Description: PGP signature
Improving the isolationtester: fewer failures, less delay
This is a followup to the conversation at [1], in which we speculated about constraining the isolationtester's behavior by annotating the specfiles, in order to eliminate common buildfarm failures such as [2] and reduce the need to use long delays to stabilize the test results. I've spent a couple days hacking on this idea, and I think it has worked out really well. On my machine, the time needed for "make installcheck" in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result of removing all the multiple-second delays we used before. Also, while I'm not fool enough to claim that this will reduce the rate of bogus failures to zero, I do think it addresses all the repeating failures we've seen lately. In the credit-where-credit-is-due department, this owes some inspiration to the patch Asim Praveen offered at [3], though this takes the idea a good bit further. This is still WIP to some extent, as I've not spent much time looking at specfiles other than the ones with big delays; there may be additional improvements possible in some places. Also, I've not worried about whether the tests pass in serializable mode, since we have problems there already [4]. But this seemed like a good point at which to solicit feedback and see what the cfbot thinks of it. regards, tom lane [1] https://www.postgresql.org/message-id/flat/2527507.1598237598%40sss.pgh.pa.us [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole=2021-06-13%2016%3A31%3A57 [3] https://www.postgresql.org/message-id/F8DC434A-9141-451C-857F-148CCA1D42AD%40vmware.com [4] https://www.postgresql.org/message-id/324309.1623722988%40sss.pgh.pa.us diff --git a/src/test/isolation/README b/src/test/isolation/README index 6ae7152325..35b40f8345 100644 --- a/src/test/isolation/README +++ b/src/test/isolation/README @@ -57,11 +57,16 @@ Test specification == Each isolation test is defined by a specification file, stored in the specs -subdirectory. A test specification consists of four parts, in this order: +subdirectory. A test specification defines some SQL "steps", groups them +into "sessions" (where all the steps of one session will be run in the +same backend), and specifies the "permutations" or orderings of the steps +that are to be run. + +A test specification consists of four parts, in this order: setup { } - The given SQL block is executed once, in one session only, before running + The given SQL block is executed once (per permutation) before running the test. Create any test tables or other required objects here. This part is optional. Multiple setup blocks are allowed if needed; each is run separately, in the given order. (The reason for allowing multiple @@ -81,8 +86,8 @@ session "" session is executed in its own connection. A session part consists of three parts: setup, teardown and one or more "steps". The per-session setup and teardown parts have the same syntax as the per-test setup and - teardown described above, but they are executed in each session. The - setup part typically contains a "BEGIN" command to begin a transaction. + teardown described above, but they are executed in each session. The setup + part might, for example, contain a "BEGIN" command to begin a transaction. Each step has the syntax @@ -101,7 +106,8 @@ permutation "" ... order). Note that the list of steps in a manually specified "permutation" line doesn't actually have to be a permutation of the available steps; it could for instance repeat some steps more than once, - or leave others out. + or leave others out. Also, each step name can be annotated with some + parenthesized markers, which are described below. Lines beginning with a # are considered comments. @@ -110,7 +116,8 @@ specified in the spec file, or automatically generated), the isolation tester runs the main setup part, then per-session setup parts, then the selected session steps, then per-session teardown, then the main teardown script. Each selected step is sent to the connection associated -with its session. +with its session. The main setup and teardown scripts are run in a +separate "control" session. Support for blocking commands @@ -129,3 +136,64 @@ tests take a very long time to run, and they serve no useful testing purpose. Note that isolationtester recognizes that a command has blocked by looking to see if it is shown as waiting in the pg_locks view; therefore, only blocks on heavyweight locks will be detected. + + +Dealing with race conditions + + +In some cases, the isolationtester's output for a test script may vary +due to timing issues. One way to deal with that is to create variant +expected-files, which follow the usual PG convention that variants for +foo.spec are named foo_1.out, foo_2.out, etc. However, this method is +discouraged since the extra files are a nuisance for maintenance. +Instead, it's usually
Re: Isolation tests vs. SERIALIZABLE isolation level
On Tue, Jun 15, 2021 at 2:09 PM Tom Lane wrote: > * Do we still care about that policy? > * If so, who's going to fix the above-listed problems? > * Should we try to get some routine testing of this case > in place? I wondered the same in commit 37929599 (the same problem for src/test/regress, which now passes but only in master, not the back branches). I doubt it will find real bugs very often, and I doubt many people would enjoy the slowdown if it were always on, but it might make sense to have something like PG_TEST_EXTRA that can be used to run the tests at all three levels, and then turn that on in a few strategic places like CI and a BF animal or two.
Re: PG 14 release notes, first draft
On Tue, Jun 15, 2021 at 10:36 AM Bruce Momjian wrote: > > On Tue, Jun 15, 2021 at 10:06:49AM +0900, Masahiko Sawada wrote: > > On Mon, May 10, 2021 at 3:03 PM Bruce Momjian wrote: > > > > > > I have committed the first draft of the PG 14 release notes. You can > > > see the most current build of them here: > > > > > > https://momjian.us/pgsql_docs/release-14.html > > > > > > > It might have already been discussed but I think we should mention > > commit bed90759f in the release note. The commit seems like a bug fix > > but it not back-patched to the older versions at least for now. There > > is a discussion[1] that we will revisit that a couple of months after > > 14 is released so as there is some feedback from the field with this > > change. > > OK, but I need more information on how users will see a difference based > on this commit: I think that since with this commit the server on Windows can handle a file over 4GB, COPY FROM loading data from an over 4GB file and pg_dump dumping a large table work now. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Duplicate history file?
On Tue, Jun 15, 2021 at 10:20:37AM +0900, Kyotaro Horiguchi wrote: > > Actually there's large room for losing data with cp. Yes, we would > need additional redundancy of storage and periodical integrity > inspection of the storage and archives on maybe need copies at the > other sites on the other side of the Earth. But they are too-much for > some kind of users. They have the right and responsibility to decide > how durable/reliable their archive needs to be. (Putting aside some > hardware/geological requirements :p) Note that most of those considerations are orthogonal to what a proper archive_command should be responsible for. Yes users are responsible to decide they want valid and durable backup or not, but we should assume a sensible default behavior, which is a valid and durable archive_command. We don't document a default fsync = off with later recommendation explaining why you shouldn't do that, and I think it should be the same for the archive_command. The problem with the current documentation is that many users will just blindly copy/paste whatever is in the documentation without reading any further. As an example, a few hours ago some french user on the french bulletin board said that he fixed his "postmaster.pid already exists" error with a pg_resetxlog -f, referring to some article explaining how to start postgres in case of "PANIC: could not locate a valid checkpoint record". Arguably that article didn't bother to document what are the implication for executing pg_resetxlog, but given that the user original problem had literally nothing to do with what was documented, I really doubt that it would have changed anything. > If we mandate some > characteristics on the archive_command, we should take them into core. I agree. > I remember I saw some discussions on archive command on this line but > I think it had ended at the point something like that "we cannot > design one-fits-all interface comforming the requirements" or > something (sorry, I don't remember in its detail..) I also agree, but this problem is solved by making archive_command customisable. Providing something that can reliably work in some general and limited cases would be a huge improvement. > Well. rman used rsync/ssh in its documentation in the past and now > looks like providing barman-wal-archive so it seems that you're right > in that point. So, do we recommend them in our documentation? (I'm > not sure they are actually comform the requirement, though..) We could maybe bless some third party backup solutions, but this will probably lead to a lot more discussions, so it's better to discuss that in a different thread. Note that I don't have a deep knowledge of any of those tools so I don't have an opinion. > If we write an example with a pseudo tool name, requiring some > characteristics on the tool, then not telling about the minimal tools, > I think that it is equivalent to that we are inhibiting certain users > from using archive_command even if they really don't want such level > of durability. I already saw customers complaining about losing backups because their archive_command didn't ensure that the copy was durable. Some users may not care about losing their backups in such case, but I really think that the majority of users expect a backup to be valid, durable and everything without even thinking that it may not be the case. It should be the default behavior, not optional.
Re: Teaching users how they can get the most out of HOT in Postgres 14
On Mon, Jun 14, 2021 at 5:23 PM Michael Paquier wrote: > > *Why* does it have to work at the system level? I don't understand > > what you mean about the system level. > > I mean that you lack a GUC that allows to enforce to *not* use this > optimization for all relations, for all processes. You've just explained what a GUC is. This is not helpful. > > Why is this fundamentally different to those two things? > > Because the situation looks completely different to me here. TRUNCATE > is thought as a option to be able to avoid an exclusive lock when > truncating the relation file size at the end of VACUUM. More > importantly the default of TRUNCATE is *false*, meaning that we are > never going to skip the truncation unless one specifies it at the > relation level. Maybe it looks different to you because that's not actually true; VACUUM *does* skip the truncation when it feels like it, regardless of the value of the reloption. In general there is no strict guarantee of truncation ever happening -- see lazy_truncate_heap(). Again: Why is this fundamentally different? > Here, what we have is a decision that is enforced to happen by > default, all the time, with the user not knowing about it. If there > is a bug of an issue with it, users, based on your proposal, would be > forced to change it for each *relation*. If they miss some of those > relations, they may still run into problems without knowing about it. > The change of default behavior and having no way to control it in > a simple way look incompatible to me. You've just explained what a reloption is. Again, this is not helping. > Perhaps. What I am really scared about is that you are assuming that > enforcing this decision will be *always* fine. I very clearly and repeatedly said that there was uncertainty about causing issues in rare real world cases. Are you always 100% certain that your code has no bugs before you commit it? Should I now add a GUC for every single feature that I commit? You are just asserting that we must need to add a GUC, without giving any real reasons -- you're just giving generic reasons that work just as well in most situations. I'm baffled by this. > What I am trying to > say here is that it *may not* be fine for everybody, and that there > should be an easy way to turn it off if that proves to be a problem. As I said, I think that the relationship is both necessary and sufficient. A GUC is a heavyweight solution that seems quite unnecessary. > I don't quite see how that's an implementation problem, we have > already many reloptions that are controlled with GUCs if the > reloptions have no default. I never said that there was an implementation problem with a GUC. Just that it was unnecessary, and not consistent with existing practice. Does anyone else have an opinion on this? Of course I can easily add a GUC. But I won't do so in the absence of any real argument in favor of it. > I think that a more careful choice implementation would have been to > turn this optimization off by default, while having an option to allow > one to turn it on at will. You have yet to say anything about the implementation. -- Peter Geoghegan
Re: Different compression methods for FPI
On Mon, Jun 14, 2021 at 08:42:08PM -0500, Justin Pryzby wrote: > On Tue, Jun 15, 2021 at 09:50:41AM +0900, Michael Paquier wrote: >> + {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS, >> + gettext_noop("Set the method used to compress full page images >> in the WAL."), >> + NULL >> + }, >> + _compression_method, >> + WAL_COMPRESSION_PGLZ, wal_compression_options, >> + NULL, NULL, NULL >> Any reason to not make that user-settable? If you merge that with >> wal_compression, that's not an issue. Hmm, yeah. This can be read as using PGC_USERSET. With the second part of my sentence, I think that I imply to use PGC_SUSET and be consistent with wal_compression, but I don't recall my mood from one month ago :) Sorry for the confusion. > I don't see how restricting it to superusers would mitigate the hazard at all: > If the local admin enables wal compression, then every user's data will be > compressed, and the degree of compression indicatates a bit about their data, > no matter whether it's pglz or lz4. I would vote for having some consistency with wal_compression. Perhaps we could even revisit c2e5f4d, but I'd rather not do that. >> The compression level may be better if specified with a different >> GUC. That's less parsing to have within the GUC machinery. > > I'm not sure about that - then there's an interdependency between GUCs. > If zlib range is 1..9, and zstd is -50..10, then you may have to set the > compression level first, to avoid an error. I believe there's a previous > discussion about inter-dependent GUCs, and maybe a commit fixing a problem > they > caused. But I cannot find it offhand. You cannot do cross-checks for GUCs in their assign hooks or even rely in the order of those parameters, but you can do that in some backend code paths. A recent discussion on the matter is for example what led to 79dfa8a for the GUCs controlling the min/max SSL protocols allowed. >> seems to me that if we can get the same amount of compression and CPU >> usage just by tweaking the compression level, there is no need to >> support more than one extra compression algorithm, easing the life of >> packagers and users. > > I don't think it eases it for packagers, since I anticipate the initial patch > would support {none/pglz/lz4/zlib}. I anticipate that zstd may not be in > pg15. Yes, without zstd we have all the infra to track the dependencies. > The goal of the patch is to give options, and the overhead of adding both zlib > and lz4 is low. zlib gives good compression at some CPUcost and may be > preferable for (some) DWs, and lz4 is almost certainly better (than pglz) for > OLTPs. Anything will be better than pglz. I am rather confident in that. What I am wondering is if we need to eat more bits than necessary for the WAL record format, because we will end up supporting it until the end of times. We may have twenty years from now a better solution than what has been integrated, and we may not care much about 1 extra byte for a WAL record at this point, or perhaps we will. From what I hear here, there are many cases that we may care about depending on how much CPU one is ready to pay in order to get more compression, knowing that there are no magic solutions for something that's cheap in CPU with a very good compression ratio or we could just go with that. So it seems to me that there is still an argument for adding only one new compression method with a good range of levels, able to support the range of cases we'd care about: - High compression ratio but high CPU cost. - Low compression ratio but low CPU cost. So we could also take a decision here based on the range of (compression,CPU) an algorithm is able to cover. -- Michael signature.asc Description: PGP signature
Isolation tests vs. SERIALIZABLE isolation level
In the past people have tried to ensure that the isolation tests would pass regardless of the prevailing default_transaction_isolation setting. (That was sort of the point, in fact, for the earliest tests using that infrastructure.) This seems to have been forgotten about lately, as all of these tests fail with default_transaction_isolation = serializable: test detach-partition-concurrently-1 ... FAILED 504 ms test detach-partition-concurrently-3 ... FAILED 2224 ms test detach-partition-concurrently-4 ... FAILED 1600 ms test fk-partitioned-2 ... FAILED 133 ms test lock-update-delete ... FAILED 538 ms test tuplelock-update ... FAILED10223 ms test tuplelock-upgrade-no-deadlock ... FAILED 664 ms test tuplelock-partition ... FAILED 49 ms (drop-index-concurrently-1 also failed until just now, but I resurrected its variant expected-file.) So: * Do we still care about that policy? * If so, who's going to fix the above-listed problems? * Should we try to get some routine testing of this case in place? regards, tom lane
Re: Different compression methods for FPI
On Tue, Jun 15, 2021 at 09:50:41AM +0900, Michael Paquier wrote: > On Sun, Jun 13, 2021 at 08:24:12PM -0500, Justin Pryzby wrote: > > I think it's more nuanced than just finding the algorithm with the least CPU > > use. The GUC is PGC_USERSET, and it's possible that a data-loading process > > might want to use zlib for better compress ratio, but an interactive OLTP > > process might want to use lz4 or no compression for better responsiveness. > > It seems to me that this should be a PGC_SUSET, at least? We've had > our share of problems with assumptions behind data leaks depending on > data compressibility (see ssl_compression and the kind). It's USERSET following your own suggestion (which is a good suggestion): On Mon, May 17, 2021 at 04:44:11PM +0900, Michael Paquier wrote: > + {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS, > + gettext_noop("Set the method used to compress full page images in > the WAL."), > + NULL > + }, > + _compression_method, > + WAL_COMPRESSION_PGLZ, wal_compression_options, > + NULL, NULL, NULL > Any reason to not make that user-settable? If you merge that with > wal_compression, that's not an issue. I don't see how restricting it to superusers would mitigate the hazard at all: If the local admin enables wal compression, then every user's data will be compressed, and the degree of compression indicatates a bit about their data, no matter whether it's pglz or lz4. It's probably true without compression, too - the fraction of FPW might reveal their usage patterns. > > In this patch series, I added compression information to the errcontext from > > xlog_block_info(), and allow specifying compression levels like zlib-2. > > I'll > > rearrange that commit earlier if we decide that's desirable to include. > > The compression level may be better if specified with a different > GUC. That's less parsing to have within the GUC machinery. I'm not sure about that - then there's an interdependency between GUCs. If zlib range is 1..9, and zstd is -50..10, then you may have to set the compression level first, to avoid an error. I believe there's a previous discussion about inter-dependent GUCs, and maybe a commit fixing a problem they caused. But I cannot find it offhand. > seems to me that if we can get the same amount of compression and CPU > usage just by tweaking the compression level, there is no need to > support more than one extra compression algorithm, easing the life of > packagers and users. I don't think it eases it for packagers, since I anticipate the initial patch would support {none/pglz/lz4/zlib}. I anticipate that zstd may not be in pg15. The goal of the patch is to give options, and the overhead of adding both zlib and lz4 is low. zlib gives good compression at some CPUcost and may be preferable for (some) DWs, and lz4 is almost certainly better (than pglz) for OLTPs. -- Justin
Re: Delegating superuser tasks to new security roles
On 2021-06-14 23:53, Mark Dilger wrote: On Jun 14, 2021, at 5:51 AM, torikoshia wrote: Thanks for working on this topic, I appreciate it! Thank you for taking a look! BTW, do these patches enable non-superusers to create user with bypassrls? No, I did not break out the ability to create such users. Since I failed to apply the patches and didn't test them, I may have overlooked something but I didn't find the corresponding codes. Do you believe that functionality should be added? I have not thought much about that issue. I just noticed that because I was looking into operations that can only be done by superusers. It might be somewhat inconvenient in PostgreSQL service providers that don't give users superuser privileges, but at least I don't have a specific demand for it. Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: Duplicate history file?
At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier wrote in > On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote: > > I think cp can be an example as far as we explain the limitations. (On > > the other hand "test !-f" cannot since it actually prevents server > > from working correctly.) > > Disagreed. I think that we should not try to change this area until > we can document a reliable solution, and a simple "cp" is not that. Isn't removing cp from the documentation a change in this area? I basically agree to not to change anything but the current example "test ! -f && cp .." and relevant description has been known to be problematic in a certain situation. - Do we leave it alone igonring the possible problem? - Just add a description about "the problem"? - Just remove "test ! -f" and the relevant description? - Remove "test ! -f" and rewrite the relevant description? (- or not remove "test ! -f" and rewrite the relevant description?) - Write the full (known) requirements and use a pseudo tool-name in the example? - provide a minimal implement of the command? - recommend some external tools (that we can guarantee that they comform the requriements)? - not recommend any tools? > Hmm. A simple command that could be used as reference is for example > "dd" that flushes the file by itself, or we could just revisit the > discussions about having a pg_copy command, or we could document a > small utility in perl that does the job. I think we should do that if pg_copy comforms the mandatory requirements but maybe it's in the future. Showing the minimal implement in perl looks good. However, my main point here is "what should we do for now?". Not about an ideal solution. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PG 14 release notes, first draft
On Tue, Jun 15, 2021 at 10:06:49AM +0900, Masahiko Sawada wrote: > On Mon, May 10, 2021 at 3:03 PM Bruce Momjian wrote: > > > > I have committed the first draft of the PG 14 release notes. You can > > see the most current build of them here: > > > > https://momjian.us/pgsql_docs/release-14.html > > > > It might have already been discussed but I think we should mention > commit bed90759f in the release note. The commit seems like a bug fix > but it not back-patched to the older versions at least for now. There > is a discussion[1] that we will revisit that a couple of months after > 14 is released so as there is some feedback from the field with this > change. OK, but I need more information on how users will see a difference based on this commit: commit bed90759fc Author: Tom Lane Date: Fri Oct 9 16:20:12 2020 -0400 Fix our Windows stat() emulation to handle file sizes > 4GB. Hack things so that our idea of "struct stat" is equivalent to Windows' struct __stat64, allowing it to have a wide enough st_size field. Instead of relying on native stat(), use GetFileInformationByHandle(). This avoids a number of issues with Microsoft's multiple and rather slipshod emulations of stat(). We still need to jump through hoops to deal with ERROR_DELETE_PENDING, though :-( Pull the relevant support code out of dirmod.c and put it into its own file, win32stat.c. Still TODO: do we need to do something different with lstat(), rather than treating it identically to stat()? Juan José Santamaría Flecha, reviewed by Emil Iggland; based on prior work by Michael Paquier, Sergey Zubkovsky, and others Discussion: https://postgr.es/m/1803D792815FC24D871C00D17AE95905CF5099@g01jpexmbkw24 Discussion: https://postgr.es/m/15858-9572469fd3b73...@postgresql.org -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Hi, > @@ -4032,6 +4039,24 @@ GlobalVisTestShouldUpdate(GlobalVisState *state) > static void > GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons) > { > + /* assert non-decreasing nature of horizons */ > + Assert(FullTransactionIdFollowsOrEquals( > +FullXidRelativeTo(horizons->latest_completed, > + > horizons->shared_oldest_nonremovable), > +GlobalVisSharedRels.maybe_needed)); > + Assert(FullTransactionIdFollowsOrEquals( > +FullXidRelativeTo(horizons->latest_completed, > + > horizons->catalog_oldest_nonremovable), > +GlobalVisCatalogRels.maybe_needed)); > + Assert(FullTransactionIdFollowsOrEquals( > +FullXidRelativeTo(horizons->latest_completed, > + > horizons->data_oldest_nonremovable), > +GlobalVisDataRels.maybe_needed)); > + Assert(FullTransactionIdFollowsOrEquals( > +FullXidRelativeTo(horizons->latest_completed, > + > horizons->temp_oldest_nonremovable), > +GlobalVisTempRels.maybe_needed)); > + > GlobalVisSharedRels.maybe_needed = > FullXidRelativeTo(horizons->latest_completed, > > horizons->shared_oldest_nonremovable); Thinking more about it, I don't think these are correct. See the following comment in procarray.c: * Note: despite the above, it's possible for the calculated values to move * backwards on repeated calls. The calculated values are conservative, so * that anything older is definitely not considered as running by anyone * anymore, but the exact values calculated depend on a number of things. For * example, if there are no transactions running in the current database, the * horizon for normal tables will be latestCompletedXid. If a transaction * begins after that, its xmin will include in-progress transactions in other * databases that started earlier, so another call will return a lower value. * Nonetheless it is safe to vacuum a table in the current database with the * first result. There are also replication-related effects: a walsender * process can set its xmin based on transactions that are no longer running * on the primary but are still being replayed on the standby, thus possibly * making the values go backwards. In this case there is a possibility that * we lose data that the standby would like to have, but unless the standby * uses a replication slot to make its xmin persistent there is little we can * do about that --- data is only protected if the walsender runs continuously * while queries are executed on the standby. (The Hot Standby code deals * with such cases by failing standby queries that needed to access * already-removed data, so there's no integrity bug.) The computed values * are also adjusted with vacuum_defer_cleanup_age, so increasing that setting * on the fly is another easy way to make horizons move backwards, with no * consequences for data integrity. Greetings, Andres Freund
Re: Duplicate history file?
At Fri, 11 Jun 2021 15:18:03 +0800, Julien Rouhaud wrote in > On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote: > I disagree, cp is probably the worst command that can be used for this > purpose. > On top on the previous problems already mentioned, you also have the fact that > the copy isn't atomic. It means that any concurrent restore_command (or > anything that would consume the archived files) will happily process a half > copied WAL file, and in case of any error during the copy you end up with a > file for which you don't know if it contains valid data or not. I don't see > any case where you would actually want to use that, unless maybe if you want > to > benchmark how long it takes before you lose some data. Actually there's large room for losing data with cp. Yes, we would need additional redundancy of storage and periodical integrity inspection of the storage and archives on maybe need copies at the other sites on the other side of the Earth. But they are too-much for some kind of users. They have the right and responsibility to decide how durable/reliable their archive needs to be. (Putting aside some hardware/geological requirements :p) If we mandate some characteristics on the archive_command, we should take them into core. I remember I saw some discussions on archive command on this line but I think it had ended at the point something like that "we cannot design one-fits-all interface comforming the requirements" or something (sorry, I don't remember in its detail..) > I don't know, I'm assuming that barman also provides one, such as wal-e and > wal-g (assuming that the distant providers do their part of the job > correctly). Well. rman used rsync/ssh in its documentation in the past and now looks like providing barman-wal-archive so it seems that you're right in that point. So, do we recommend them in our documentation? (I'm not sure they are actually comform the requirement, though..) > Maybe there are other tools too. But as long as we don't document what > exactly > are the requirements, it's not really a surprise that most people don't > implement them. I strongly agree to describe the requirements. My point is that if all of them are really mandatory, it is mandatory for us to officially provide or at least recommend the minimal implement(s) that covers all of them. If we recommended some external tools, that would mean that we ensure that the tools qualify the requirements. If we write an example with a pseudo tool name, requiring some characteristics on the tool, then not telling about the minimal tools, I think that it is equivalent to that we are inhibiting certain users from using archive_command even if they really don't want such level of durability. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: PG 14 release notes, first draft
On Mon, May 10, 2021 at 3:03 PM Bruce Momjian wrote: > > I have committed the first draft of the PG 14 release notes. You can > see the most current build of them here: > > https://momjian.us/pgsql_docs/release-14.html > It might have already been discussed but I think we should mention commit bed90759f in the release note. The commit seems like a bug fix but it not back-patched to the older versions at least for now. There is a discussion[1] that we will revisit that a couple of months after 14 is released so as there is some feedback from the field with this change. Regards, [1] https://www.postgresql.org/message-id/ycszix2a2ilsv...@paquier.xyz -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Different compression methods for FPI
Hi, On 2021-06-15 09:50:41 +0900, Michael Paquier wrote: > On Sun, Jun 13, 2021 at 08:24:12PM -0500, Justin Pryzby wrote: > > I think it's more nuanced than just finding the algorithm with the least CPU > > use. The GUC is PGC_USERSET, and it's possible that a data-loading process > > might want to use zlib for better compress ratio, but an interactive OLTP > > process might want to use lz4 or no compression for better responsiveness. > > It seems to me that this should be a PGC_SUSET, at least? We've had > our share of problems with assumptions behind data leaks depending on > data compressibility (see ssl_compression and the kind). -1. Currently wal_compression has too much overhead for some workloads, but not for others. Disallowing normal users to set it would break cases where it's set for users that can tolerate the perf impact (which I have done at least). And the scenarios where it can leak information that couldn't otherwise be leaked already don't seem all that realistic? Greetings, Andres Freund
Re: Different compression methods for FPI
On Sun, Jun 13, 2021 at 08:24:12PM -0500, Justin Pryzby wrote: > I think it's more nuanced than just finding the algorithm with the least CPU > use. The GUC is PGC_USERSET, and it's possible that a data-loading process > might want to use zlib for better compress ratio, but an interactive OLTP > process might want to use lz4 or no compression for better responsiveness. It seems to me that this should be a PGC_SUSET, at least? We've had our share of problems with assumptions behind data leaks depending on data compressibility (see ssl_compression and the kind). > In this patch series, I added compression information to the errcontext from > xlog_block_info(), and allow specifying compression levels like zlib-2. I'll > rearrange that commit earlier if we decide that's desirable to include. The compression level may be better if specified with a different GUC. That's less parsing to have within the GUC machinery. So, how does the compression level influence those numbers? The level of compression used by LZ4 here is the fastest-CPU/least-compression, same for zlib and zstd? Could we get some data with getrusage()? It seems to me that if we can get the same amount of compression and CPU usage just by tweaking the compression level, there is no need to support more than one extra compression algorithm, easing the life of packagers and users. -- Michael signature.asc Description: PGP signature
Re: Skipping logical replication transactions on subscriber side
On Wed, Jun 2, 2021 at 3:07 PM Amit Kapila wrote: > > On Tue, Jun 1, 2021 at 9:05 PM Peter Eisentraut > wrote: > > > > On 01.06.21 06:01, Amit Kapila wrote: > > > But, won't that be costly in cases where we have errors in the > > > processing of very large transactions? Subscription has to process all > > > the data before it gets an error. I think we can even imagine this > > > feature to be extended to use commitLSN as a skip candidate in which > > > case we can even avoid getting the data of that transaction from the > > > publisher. So if this information is persistent, the user can even set > > > the skip identifier after the restart before the publisher can send > > > all the data. > > > > At least in current practice, skipping parts of the logical replication > > stream on the subscriber is a rare, emergency-level operation when > > something that shouldn't have happened happened. So it doesn't really > > matter how costly it is. It's not going to be more costly than the > > error happening in the first place. All you'd need is one shared memory > > slot per subscription to store a xid to skip. > > > > Leaving aside the performance point, how can we do by just storing > skip identifier (XID/commitLSN) in shared_memory? How will the apply > worker know about that information after restart? Do you expect the > user to set it again, if so, I think users might not like that? Also, > how will we prohibit users to give some identifier other than for > failed transactions, and if users provide that what should be our > action? Without that, if users provide XID of some in-progress > transaction, we might need to do more work (rollback) than just > skipping it. I think the simplest solution would be to have a fixed-size array on the shared memory to store information of skipping transactions on the particular subscription. Given that this feature is meant to be a repair tool in emergency cases, 32 or 64 entries seem enough. That information should be visible to users via a system view and each entry is cleared once the worker has skipped the transaction. Also, we also would need to clear the entry if the meta information of the subscription such as conninfo and slot name has been changed. The worker reads that information at least when starting logical replication. The worker receives changes from the publication and checks if the transaction should be skipped when start to apply those changes. If so the worker skips applying all changes of the transaction and removes stream files if exist. Regarding the point of how to check if the specified XID by the user is valid, I guess it’s not easy to do that since XIDs sent from the publisher are in random order. Considering the use case of this tool, the situation seems like the logical replication gets stuck due to a problem transaction and the worker repeatedly restarts and raises an error. So I guess it also would be a good idea that the user can specify to skip the first transaction (or first N transactions) since the subscription starts logical replication. It’s less flexible but seems enough to solve such a situation and doesn’t have such a problem of validating the XID. If the functionality like letting the subscriber know the oldest XID that is possibly sent is useful also for other purposes it would also be a good idea to implement it but not sure about other use cases. Anyway, it seems to me that we need to consider the user interface first, especially how and what the user specifies the transaction to skip. My current feeling is that specifying XID is intuitive and flexible but the user needs to have 2 steps: checks XID and then specifies it, and there is a risk that the user mistakenly specifies a wrong XID. On the other hand, the idea of specifying to skip the first transaction doesn’t require the user to check and specify XID but is less flexible, and “the first” transaction might be ambiguous for the user. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: A new function to wait for the backend exit after termination
On Mon, Jun 14, 2021 at 07:34:59PM +0530, Bharath Rupireddy wrote: > Thanks. +1 to remove the pg_wait_for_backend_termination function. The > patch basically looks good to me. I'm attaching an updated patch. I > corrected a minor typo in the commit message, took docs and code > comment changes suggested by Justin Pryzby. Pushed as two commits. I used your log message typo fix. Here were the diffs in your v2 and not in an earlier patch: > -+ Add an optional wait parameter to ++ Add an optional timeout parameter to -+int timeout; /* milliseconds */ > ++int timeout; /* milliseconds */ pgindent chooses a third option, so I ran pgindent instead of using this. > Please note that release notes still have the headline "Add functions > to wait for backend termination" of the original commit that added the > pg_wait_for_backend_termination. With the removal of it, I'm not quite > sure if we retain the the commit message or tweak it to something like > "Add optional timeout parameter to pg_terminate_backend". > That part is supposed to mirror "git log --pretty=format:%s", no matter what happens later. The next set of release note updates might add my latest commit (5f1df62) to this SGML comment, on another line.
Re: Teaching users how they can get the most out of HOT in Postgres 14
On Fri, Jun 11, 2021 at 02:46:20PM -0700, Peter Geoghegan wrote: > On Thu, Jun 3, 2021 at 11:15 PM Michael Paquier wrote: >> I have read through the patch, and I am surprised to see that this >> only makes possible to control the optimization at relation level. >> The origin of the complaints is that this index cleanup optimization >> has been introduced as a new rule that gets enforced at *system* >> level, so I think that we should have an equivalent with a GUC to >> control the behavior for the whole system. > > *Why* does it have to work at the system level? I don't understand > what you mean about the system level. I mean that you lack a GUC that allows to enforce to *not* use this optimization for all relations, for all processes. > As Masahiko pointed out, adding a GUC isn't what we've done in other > similar cases -- that's how DISABLE_PAGE_SKIPPING works, which was a > defensive option that seems similar enough to what we want to add now. > To give another example, the TRUNCATE VACUUM option (or the related > reloption) can be used to disable relation truncation, a behavior that > sometimes causes *big* issues in production. The truncate behavior is > determined dynamically in most situations -- which is another > similarity to the optimization we've added here. > Why is this fundamentally different to those two things? Because the situation looks completely different to me here. TRUNCATE is thought as a option to be able to avoid an exclusive lock when truncating the relation file size at the end of VACUUM. More importantly the default of TRUNCATE is *false*, meaning that we are never going to skip the truncation unless one specifies it at the relation level. Here, what we have is a decision that is enforced to happen by default, all the time, with the user not knowing about it. If there is a bug of an issue with it, users, based on your proposal, would be forced to change it for each *relation*. If they miss some of those relations, they may still run into problems without knowing about it. The change of default behavior and having no way to control it in a simple way look incompatible to me. >> With what you are >> presenting here, one could only disable the optimization for each >> relation, one-by-one. If this optimization proves to be a problem, >> it's just going to be harder to users to go through all the relations >> and re-tune autovacuum. Am I missing something? > > Why would you expect autovacuum to run even when the optimization is > unavailable (e.g. with Postgres 13)? After all, the specifics of when > the bypass optimization kicks in make it very unlikely that ANALYZE > will ever be able to notice enough dead tuples to trigger an > autovacuum (barring antiwraparound and insert-driven autovacuums). > > There will probably be very few LP_DEAD items remaining. Occasionally > there will be somewhat more LP_DEAD items, that happen to be > concentrated in less than 2% of the table's blocks -- but block-based > sampling by ANALYZE is likely to miss most of them and underestimate > the total number. The sampling approach taken by acquire_sample_rows() > ensures this with larger tables. With small tables the chances of the > optimization kicking in are very low, unless perhaps fillfactor has > been tuned very aggressively. > > There has never been a guarantee that autovacuum will be triggered > (and do index vacuuming) in cases that have very few LP_DEAD items, no > matter how the system has been tuned. The main reason why making the > optimization behavior controllable is for the VACUUM command. > Principally for hackers. I can imagine myself using the VACUUM option > to disable the optimization when I was interested in testing VACUUM or > space utilization in some specific, narrow way. > > Of course it's true that there is still some uncertainty about the > optimization harming production workloads -- that is to be expected > with an enhancement like this one. But there is still no actual > example or test case that shows the optimization doing the wrong > thing, or anything like it. Anything is possible, but I am not > expecting there to be even one user complaint about the feature. > Naturally I don't want to add something as heavyweight as a GUC, given > all that. Perhaps. What I am really scared about is that you are assuming that enforcing this decision will be *always* fine. What I am trying to say here is that it *may not* be fine for everybody, and that there should be an easy way to turn it off if that proves to be a problem. I don't quite see how that's an implementation problem, we have already many reloptions that are controlled with GUCs if the reloptions have no default. I think that a more careful choice implementation would have been to turn this optimization off by default, while having an option to allow one to turn it on at will. -- Michael signature.asc Description: PGP signature
Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)
I took it a step further. Transactions HEAD patched 1000220710586781 1014616710388685 100489191059 10065764,333 10436275 3,55021946687555 TPS HEAD patched 33469,016009 35399,010472 33950,624679 34733,252336 33639,8429 34578,495043 33686,494529 34903,585950 3,48700968070122 3,55% Is it worth touch procarray.c for real? With msvc 64 bits, the asm generated: HEAD 213.731 bytes procarray.asm patched 212.035 bytes procarray.asm Patch attached. regards, Ranier Vilela improve_procarray.patch Description: Binary data
Re: [PATCH] Fix select from wrong table array_op_test
On 2021-06-11T19:00:41+0300, Heikki Linnakangas wrote: > We already have these same queries in the 'arrays' test against the > 'array_op_test' table, though. It sure looks like a copy-paste error to me > as well. That's reason enough, but another reason is that I don't think GIN_CAT_NULL_KEY is covered without this change.
Re: Support for NSS as a libpq TLS backend
On Fri, 2021-05-28 at 11:04 +0200, Daniel Gustafsson wrote: > Attached is a rebase to keep bitrot at bay. I get a failure during one of the CRL directory tests due to a missing database -- it looks like the Makefile is missing an entry. (I'm dusting off my build after a few months away, so I don't know if this latest rebase introduced it or not.) Attached is a quick patch; does it work on your machine? --Jacob From 7a7b8904ef22212190bb988fab1ef696fe1a59de Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Mon, 14 Jun 2021 15:04:26 -0700 Subject: [PATCH] test/ssl: fix NSS server-side CRL test Make sure the database is created during `make nssfiles`, and expect a revocation failure message. --- src/test/ssl/Makefile | 2 ++ src/test/ssl/t/001_ssltests.pl | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/src/test/ssl/Makefile b/src/test/ssl/Makefile index 14ca1f8bf3..557cbe223f 100644 --- a/src/test/ssl/Makefile +++ b/src/test/ssl/Makefile @@ -45,6 +45,7 @@ NSSFILES := ssl/nss/client_ca.crt.db \ ssl/nss/client-revoked.crt__client-revoked.key.db \ ssl/nss/server-cn-only.crt__server-password.key.db \ ssl/nss/server-cn-only.crt__server-cn-only.key.db \ + ssl/nss/server-cn-only.crt__server-cn-only.key.crldir.db \ ssl/nss/root.crl \ ssl/nss/server.crl \ ssl/nss/client.crl \ @@ -167,6 +168,7 @@ ssl/nss/server-cn-only.crt__server-cn-only.key.db: ssl/server-cn-only.crt ssl/se pk12util -i ssl/nss/server-cn-only.pfx -d "sql:$@" -W '' ssl/nss/server-cn-only.crt__server-cn-only.key.crldir.db: ssl/nss/server-cn-only.crt__server-cn-only.key.db + cp -R $< $@ for c in $(shell ls ssl/root+client-crldir) ; do \ echo $${c} ; \ openssl crl -in ssl/root+client-crldir/$${c} -outform der -out ssl/nss/$${c} ; \ diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl index 4105a67b94..aec99e7bf6 100644 --- a/src/test/ssl/t/001_ssltests.pl +++ b/src/test/ssl/t/001_ssltests.pl @@ -664,7 +664,7 @@ $node->connect_fails( "$common_connstr user=ssltestuser sslcert=ssl/client-revoked.crt sslkey=ssl/client-revoked_tmp.key ssldatabase=ssl/nss/client-revoked.crt__client-revoked.key.db", "certificate authorization fails with revoked client cert with server-side CRL directory", expected_stderr => - qr/SSL error: sslv3 alert certificate revoked|SSL error: Encountered end of file/); + qr/SSL error: sslv3 alert certificate revoked|SSL peer rejected your certificate as revoked/); # clean up -- 2.25.1
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Hi, On 2021-06-14 11:53:47 +0200, Matthias van de Meent wrote: > On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan wrote: > > > > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent > > wrote: > > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed > > > xvac (?), and we might also check against recently committed > > > transactions if xmin == xmax, although apparently that is not > > > implemented right now. > > > > I don't follow. Perhaps you can produce a test case? > > If you were to delete a tuple in the same transaction that you create > it (without checkpoints / subtransactions), I would assume that this > would allow us to vacuum the tuple, as the only snapshot that could > see the tuple must commit or roll back. Right now we do not do so, but I think we talked about adding such logic a couple times. I think a more robust assertion than aborted-ness could be to assert that repeated retries are not allowed to have the same "oldest xid" than a previous retry. With oldest xid be the older of xmin/xmax? Greetings, Andres Freund
Re: [Proposal] Add accumulated statistics for wait event
Hi, On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote: > > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > > In the patch in attachment, I tried to fix this by using kind of an > > > internal > > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows > > > to > > > "instrument" wait events only when required, on the fly, dynamically. > > > > That's *far worse*. You're adding an indirect function call. Which requires > > loading a global variable and then a far call to a different function. > > You're > > changing a path that's ~2 instructions with minimal dependencies (and no > > branches (i.e. fully out of order executable) to something on the order of > > ~15 > > instructions with plenty dependencies and at least two branches (call, ret). > > Oh, I didn't realized it would affect all queries, even when > log_statement_stats > was off. Thank you for your explanation. Maybe I just am misunderstanding what you were doing? As far as I can tell your patch changed pgstat_report_wait_start() to be an indirect function call - right? Then yes, this adds overhead to everything. You *could* add a pgstat_report_wait_(start|end)_with_time() or such and only use that in places that won't have a high frequency. But I just don't quite see the use-case for that. Greetings, Andres Freund
Re: Question about StartLogicalReplication() error path
On Mon, 2021-06-14 at 13:13 -0400, Robert Haas wrote: > I'm happy to hear other opinions, but I think I would be inclined to > vote in favor of #1 and/or #2 but against #3. What about upgrading it to, say, LOG? It seems like it would happen pretty infrequently, and in the event something strange happens, might rule out some possibilities. Regards, Jeff Davis
Re: [Proposal] Add accumulated statistics for wait event
Hi, On Mon, 14 Jun 2021 11:27:21 -0700 Andres Freund wrote: > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > In the patch in attachment, I tried to fix this by using kind of an internal > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > > "instrument" wait events only when required, on the fly, dynamically. > > That's *far worse*. You're adding an indirect function call. Which requires > loading a global variable and then a far call to a different function. You're > changing a path that's ~2 instructions with minimal dependencies (and no > branches (i.e. fully out of order executable) to something on the order of ~15 > instructions with plenty dependencies and at least two branches (call, ret). Oh, I didn't realized it would affect all queries, even when log_statement_stats was off. Thank you for your explanation. > I doubt there's a path towards this feature without adding the necessary > infrastructure to hot-patch the code - which is obviously quite a > substantial project. Right. Sadly, this kind of project is far above what I can do. So I suppose it's a dead end for me. I'll study if/how the sampling approach can be done dynamically. Thank you,
Re: a path towards replacing GEQO with something better
On Mon, Jun 14, 2021 at 06:10:28PM +0200, Tomas Vondra wrote: > On 6/14/21 1:16 PM, John Naylor wrote: > > On Sun, Jun 13, 2021 at 9:50 AM Tomas Vondra > > mailto:tomas.von...@enterprisedb.com>> > > wrote: > > > >> > 2) We can still keep GEQO around (with some huge limit by default) for a > >> > few years as an escape hatch, while we refine the replacement. If there > >> > is some bug that prevents finding a plan, we can emit a WARNING and fall > >> > back to GEQO. Or if a user encounters a regression in a big query, they > >> > can lower the limit to restore the plan they had in an earlier version. > >> > > >> > >> Not sure. Keeping significant amounts of code may not be free - both for > >> maintenance and new features. It'd be a bit sad if someone proposed > >> improvements to join planning, but had to do 2x the work to support it > >> in both the DP and GEQO branches, or risk incompatibility. > > > > Looking back again at the commit history, we did modify geqo to support > > partial paths and partition-wise join, so that's a fair concern. > > Right. I think the question is how complex those changes were. If it was > mostly mechanical, it's not a big deal and we can keep both, but if it > requires deeper knowledge of the GEQO inner workings it may be an issue > (planner changes are already challenging enough). The random plan nature of GEQO, along with its "cliff", make it something I would be glad to get rid of if we can get an improved approach to large planning needs. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Failure in subscription test 004_sync.pl
On 2021-Jun-14, Masahiko Sawada wrote: > The condition should be the opposite; we should raise the error when > 'nowait' is true. I think this is the cause of the test failure. Even > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we > don't wait but raise the error. Hi, thanks for diagnosing this and producing a patch! I ended up turning the condition around, so that all three stanzas still test "!nowait"; which seems a bit easier to follow. TBH I'm quite troubled by the fact that this test only failed once on each animal; they all had a lot of successful runs after that. I wonder if this is because coverage is insufficient, or is it just bad luck. I also wonder if this bug is what caused the random failures in the test case I tried to add. I should look at that some more now ... -- Álvaro Herrera Valdivia, Chile Al principio era UNIX, y UNIX habló y dijo: "Hello world\n". No dijo "Hello New Jersey\n", ni "Hello USA\n".
Re: PG 14 release notes, first draft
On Mon, Jun 14, 2021 at 01:29:42PM -0400, John Naylor wrote: > Hi Bruce, > > For this item: > > > > > Allow BRIN indexes to use bloom filters > (Tomas Vondra) > > > > This allows bloom indexes to be used effectively with data that > is not physically localized in the heap. > > > > The text implies that this affects contrib/bloom. I think it should be "This > allows BRIN indexes...". Ah, I see your point. Updated text is: Allow BRIN indexes to use bloom filters (Tomas Vondra) This allows BRIN indexes to be used effectively with data that is not physically localized in the heap. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [PATCH] expand the units that pg_size_pretty supports on output
> I don't see the need to extend the unit to YB. > What use case do you have in mind? Practical or no, I saw no reason not to support all defined units. I assume we’ll get to a need sooner or later. :) David
Re: PG 14 release notes, first draft
On Mon, Jun 14, 2021 at 06:57:41PM +0200, Tomas Vondra wrote: > Sorry it took me a while to look at the release notes. I have one > suggestion regarding this item: > > > Allow logical replication to stream long in-progress transactions to > subscribers (Tomas Vondra, Dilip Kumar, Amit Kapila, Ajin Cherian, > Nikhil Sontakke, Stas Kelvich) > > AFAICS the authors are generally ordered by how much they contributed to > the feature. In that case I'd move myself down the list - certainly > after Dilip and Amit, perhaps after Ajin. While I posted the original > patch, but most of the work after that to get it committed was done by > those two/three people. OK, I moved you after Ajin. Sometimes it isn't clear how much of an original patch was modified by later authors. FYI, the most recent PG 14 relnote doc build is at: https://momjian.us/pgsql_docs/release-14.html -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: PG 14 release notes, first draft
On Mon, Jun 14, 2021 at 11:37:58AM -0500, Justin Pryzby wrote: > Some more: > > | VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST > processing, and vacuumdb has a --no-process-toast option. > has a process_toast *option Agreed. > | Previously, if the object already exists, EXPLAIN would fail. > already existed Fixed. > | Function pg_stat_reset_replication_slot() resets slot statistics. > *The function. But maybe it should be omitted. OK, I went with "The function". > | New options are read-only, primary, standby, and prefer-standby. > *The new options Agreed. > | Allow reindexdb to change the tablespace of the new index (Michael Paquier) > | This is done by specifying --tablespace. > I think this should be merged with the corresponding server feature, like > this one: > |Add ability to skip vacuuming of TOAST tables (Nathan Bossart) > |VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST > processing, and vacuumdb has a --no-process-toast option. > > Or, the client-side option could be omitted. This is distinguished from > vacuumdb --no-index-cleanup and --no-truncate, for which the server support > was > added in v12, and the client support was essentially an omision. I am inclined to mention reindexdb because we mention the SQL command option in the text. Here is the updated text: Allow REINDEX to change the tablespace of the new index (Alexey Kondratov, Michael Paquier, Justin Pryzby) This is done by specifying a TABLESPACE clause. A --tablespace option was also added to reindexdb to control this. > | Add documentation for the factorial() function (Peter Eisentraut) > | With the removal of the ! operator in this release, factorial() is the only > built-in way to compute a factorial. > Could be ommited or collapsed into the other item. I know Tom thinks that > it's unnecesary to document changes to documentation. Uh, I think we need both items. We are removing a feature and asking people to use an existing feature that was previously undocumented. I think having two items makes it clear that the feature existed in previous releases. I just tried merging them into one item and there were just too many changes for it to be clear. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Race condition in recovery?
On 6/14/21 3:32 PM, Andrew Dunstan wrote: > On 6/14/21 1:50 PM, Andrew Dunstan wrote: >> On 6/14/21 1:11 PM, Robert Haas wrote: >>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan wrote: $^X is not at all broken. The explanation here is pretty simple - the argument to perl2host is meant to be a directory. If we're going to accomodate plain files then we have some more work to do in TestLib. >>> This explanation seems to contradict the documentation in TestLib.pm, >>> which makes no mention of any such restriction. >> Heres a snippet: >> >> >> sub perl2host >> { >> my ($subject) = @_; >> ... >> if (chdir $subject) >> >> >> Last time I looked you can't chdir to anything except a directory. > > > Actually, I take it back, it does work for a file. I'll change it. I > probably did this when something else wasn't working. So, will you feel happier with this applied? I haven't tested it yet but I'm confident it will work. diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl index e4e58cb8ab..3e19bc4c50 100644 --- a/src/test/recovery/t/025_stuck_on_old_timeline.pl +++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl @@ -24,11 +24,11 @@ my $node_primary = get_new_node('primary'); # the timeline history file reaches the archive but before any of the WAL files # get there. $node_primary->init(allows_streaming => 1, has_archiving => 1); -my $perlbin = $^X; -if ($^O eq 'msys') -{ - $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X); -} + +# Note: consistent use of forward slashes here avoids any escaping problems +# that arise from use of backslashes. That means we need to double-quote all +# the paths in the archive_command +my $perlbin = TestLib::perl2host(^X); $perlbin =~ s!\\!/!g if $TestLib::windows_os; my $archivedir_primary = $node_primary->archive_dir; $archivedir_primary =~ s!\\!/!g if $TestLib::windows_os; @@ -36,6 +36,8 @@ $node_primary->append_conf('postgresql.conf', qq( archive_command = '"$perlbin" "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' wal_keep_size=128MB )); +# make sure that Msys perl doesn't complain about difficulty in setting locale +# when called this way. local $ENV{PERL_BADLANG}=0; $node_primary->start; cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Race condition in recovery?
On 6/14/21 1:50 PM, Andrew Dunstan wrote: > On 6/14/21 1:11 PM, Robert Haas wrote: >> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan wrote: >>> $^X is not at all broken. >>> >>> The explanation here is pretty simple - the argument to perl2host is >>> meant to be a directory. If we're going to accomodate plain files then >>> we have some more work to do in TestLib. >> This explanation seems to contradict the documentation in TestLib.pm, >> which makes no mention of any such restriction. > > Heres a snippet: > > > sub perl2host > { > my ($subject) = @_; > ... > if (chdir $subject) > > > Last time I looked you can't chdir to anything except a directory. Actually, I take it back, it does work for a file. I'll change it. I probably did this when something else wasn't working. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Race condition in recovery?
On Mon, Jun 14, 2021 at 1:50 PM Andrew Dunstan wrote: > Heres a snippet: > > sub perl2host > { > my ($subject) = @_; > ... > if (chdir $subject) > > Last time I looked you can't chdir to anything except a directory. OK, but like I said, you can't tell that from the documentation. The documentation says: "Translate a virtual file name to a host file name. Currently, this is a no-op except for the case of Perl=msys and host=mingw32. The subject need not exist, but its parent or grandparent directory must exist unless cygpath is available." If you look just at that, there's nothing that would lead you to believe that it has to be a directory name. > I was trying to get the buildfarm green again. There have been plenty of > times when small patches directly for such fixes have been committed > directly. And that's the only circumstance when I do. I wasn't intending to criticize your work on this. I really appreciate it, in fact, as I also said to you off-list. I do think that there were some small things in those patches where a little bit of quick discussion might have been useful: e.g. should the archive_command change have gone in in the first place? Do we need any comments to explain the fixes? But it's not like it's a big deal either. I'm certainly not disagreeing with the goodness of making the buildfarm green as expediently as possible. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Position of ClientAuthentication hook
On Mon, Jun 14, 2021 at 8:51 AM Rafia Sabih wrote: > I have a doubt regarding the positioning of clientAuthentication hook > in function ClientAuthentication. Particularly, in case of hba errors, > i.e. cases uaReject or uaImplicitReject it errors out, leading to no > calls to any functions attached to the authentication hook. Can't we > process the hook function first and then error out...? Maybe. One potential problem is that if the hook errors out, the original error would be lost and only the error thrown by the hook would be logged or visible to the client. Whether or not that's a problem depends, I suppose, on what you're trying to do with the hook. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MultiXact\SLRU buffers configuration
пн, 14 июн. 2021 г. в 15:07, Andrey Borodin : > PFA patch implementing this idea. > I'm benchmarked v17 patches. Testing was done on a 96-core machine, with PGDATA completely placed in tmpfs. PostgreSQL was built with CFLAGS -O2. for-update PgBench script: \set aid random_zipfian(1, 100, 2) begin; select :aid from pgbench_accounts where aid = :aid for update; update pgbench_accounts set abalance = abalance + 1 where aid = :aid; update pgbench_accounts set abalance = abalance * 2 where aid = :aid; update pgbench_accounts set abalance = abalance - 2 where aid = :aid; end; Before each test sample data was filled with "pgbench -i -s 100", testing was performed 3 times for 1 hour each test. The benchmark results are presented with changing multi_xact_members_buffers and multicast_offsets_buffers (1:2 respectively): settings tps multixact_members_buffers_64Kb 693.2 multixact_members_buffers_128Kb 691.4 multixact_members_buffers_192Kb 696.3 multixact_members_buffers_256Kb 694.4 multixact_members_buffers_320Kb 692.3 multixact_members_buffers_448Kb 693.7 multixact_members_buffers_512Kb 693.3 vanilla 676.1 Best regards, Dmitry Vasiliev.
Re: PG 14 release notes, first draft
On Fri, Jun 11, 2021 at 10:45:51PM -0500, Justin Pryzby wrote: > | Add Set Server Name Indication (SNI) for SSL connection packets (Peter > Eisentraut) > Remove "Set" Fixed. > | Reduce the default value of vacuum_cost_page_miss from 10 milliseconds to 2 > (Peter Geoghegan) > Peter mentioned that this should not say "milliseconds" (but maybe the page > I'm > looking at is old). It is old. It is now: Reduce the default value of to better reflects current hardware capabilities (Peter Geoghegan) > | Cause vacuum operations to be aggressive if the table is near xid or > multixact wraparound (Masahiko Sawada, Peter Geoghegan) > Say "become aggressive" ? Updated text: Cause vacuum operations to be more aggressive if the table is near xid or multixact wraparound (Masahiko Sawada, Peter Geoghegan) > | Allow the arbitrary collations of partition boundary values (Tom Lane) > Remove "the" Agreed, removed. > | Generate WAL invalidations message during command completion when using > logical replication (Dilip Kumar, Tomas Vondra, Amit Kapila) > invalidation messages Fixed. > | Add support for infinity and -infinity values to the numeric data type > (Tom Lane) > "-infinity" has markup but not "infinity" ? Fixed. > > | Allow vacuum to deallocate space reserved by trailing unused heap line > pointers (Matthias van de Meent, Peter Geoghegan) > say "reclaim space" ? OK, new text is: Allow vacuum to reclaim space used by unused trailing heap line pointers (Matthias van de Meent, Peter Geoghegan) -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [Proposal] Add accumulated statistics for wait event
Hi, On 2021-06-14 11:27:21 -0700, Andres Freund wrote: > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > In the patch in attachment, I tried to fix this by using kind of an internal > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > > "instrument" wait events only when required, on the fly, dynamically. > > That's *far worse*. You're adding an indirect function call. Which requires > loading a global variable and then a far call to a different function. You're > changing a path that's ~2 instructions with minimal dependencies (and no > branches (i.e. fully out of order executable) to something on the order of ~15 > instructions with plenty dependencies and at least two branches (call, ret). In the case at hand it might even be worse, because the external function call will require registers to be spilled for the function call. Right now wait events "use" two register (one for the wait event, one for my_wait_event_info), but otherwise don't cause additional spilling. With your change you'd see register spill/reload around both wait start and end. Greetings, Andres Freund
Re: [Proposal] Add accumulated statistics for wait event
Hi, On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > In the patch in attachment, I tried to fix this by using kind of an internal > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > "instrument" wait events only when required, on the fly, dynamically. That's *far worse*. You're adding an indirect function call. Which requires loading a global variable and then a far call to a different function. You're changing a path that's ~2 instructions with minimal dependencies (and no branches (i.e. fully out of order executable) to something on the order of ~15 instructions with plenty dependencies and at least two branches (call, ret). I doubt there's a path towards this feature without adding the necessary infrastructure to hot-patch the code - which is obviously quite a substantial project. Greetings, Andres Freund
Re: Race condition in recovery?
On 6/14/21 1:11 PM, Robert Haas wrote: > On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan wrote: >> $^X is not at all broken. >> >> The explanation here is pretty simple - the argument to perl2host is >> meant to be a directory. If we're going to accomodate plain files then >> we have some more work to do in TestLib. > This explanation seems to contradict the documentation in TestLib.pm, > which makes no mention of any such restriction. Heres a snippet: sub perl2host { my ($subject) = @_; ... if (chdir $subject) Last time I looked you can't chdir to anything except a directory. > >>> +local $ENV{PERL_BADLANG}=0; >>> >>> Similarly here. There's not a single other reference to PERL_BADLANG >>> in the repository, so if we need this one here, there should be a >>> comment explaining why this is different from all the places where we >>> don't need it. >> Here's why this is different: this is the only place that we invoke the >> msys perl in this way (i.e. from a non-msys aware environment - the >> binaries we build are not msys-aware). We need to do that if for no >> other reason than that it might well be the only perl available. Doing >> so makes it complain loudly about missing locale info. Setting this >> variable makes it shut up. I can add a comment on that if you like. > Yes, please, but perhaps you'd like to post patches for discussion > first instead of committing directly. I was trying to get the buildfarm green again. There have been plenty of times when small patches directly for such fixes have been committed directly. And that's the only circumstance when I do. cheers andrew
Re: using extended statistics to improve join estimates
Hi, Here's a slightly improved / cleaned up version of the PoC patch, removing a bunch of XXX and FIXMEs, adding comments, etc. The approach is sound in principle, I think, although there's still a bunch of things to address: 1) statext_compare_mcvs only really deals with equijoins / inner joins at the moment, as it's based on eqjoinsel_inner. It's probably desirable to add support for additional join types (inequality and outer joins). 2) Some of the steps are performed multiple times - e.g. matching base restrictions to statistics, etc. Those probably can be cached somehow, to reduce the overhead. 3) The logic of picking the statistics to apply is somewhat simplistic, and maybe could be improved in some way. OTOH the number of candidate statistics is likely low, so this is not a big issue. 4) statext_compare_mcvs is based on eqjoinsel_inner and makes a bunch of assumptions similar to the original, but some of those assumptions may be wrong in multi-column case, particularly when working with a subset of columns. For example (ndistinct - size(MCV)) may not be the number of distinct combinations outside the MCV, when ignoring some columns. Same for nullfract, and so on. I'm not sure we can do much more than pick some reasonable approximation. 5) There are no regression tests at the moment. Clearly a gap. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index d263ecf082..dca1e7d34e 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root, , false); } + /* + * Try applying extended statistics to joins. There's not much we can + * do to detect when this makes sense, but we can check that there are + * join clauses, and that at least some of the rels have stats. + * + * XXX Isn't this mutualy exclusive with the preceding block which + * calculates estimates for a single relation? + */ + if (use_extended_stats && + statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo, + estimatedclauses)) + { + s1 *= statext_clauselist_join_selectivity(root, clauses, varRelid, + jointype, sjinfo, + ); + } + /* * Apply normal selectivity estimates for remaining clauses. We'll be * careful to skip any clauses which were already estimated above. diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index b05e818ba9..d4cbbee785 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -30,6 +30,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/optimizer.h" +#include "optimizer/pathnode.h" #include "pgstat.h" #include "postmaster/autovacuum.h" #include "statistics/extended_stats_internal.h" @@ -101,6 +102,16 @@ static StatsBuildData *make_build_data(Relation onerel, StatExtEntry *stat, int numrows, HeapTuple *rows, VacAttrStats **stats, int stattarget); +static bool stat_covers_expressions(StatisticExtInfo *stat, List *exprs, + Bitmapset **expr_idxs); + +static List *statext_mcv_get_conditions(PlannerInfo *root, + RelOptInfo *rel, + StatisticExtInfo *info); + +static bool *statext_mcv_eval_conditions(PlannerInfo *root, RelOptInfo *rel, + StatisticExtInfo *stat, MCVList *mcv, + Selectivity *sel); /* * Compute requested extended stats, using the rows sampled for the plain @@ -1154,6 +1165,89 @@ has_stats_of_kind(List *stats, char requiredkind) return false; } +/* + * find_matching_mcv + * Search for a MCV covering all the attributes. + * + * XXX Should consider both attnums and expressions. Also should consider + * additional restrictinfos as conditions (but only as optional). + * + * XXX The requirement that all the attributes need to be covered might be + * too strong. Maybe we could relax it a bit, and search for MCVs (on both + * sides of the join) with the largest overlap. But we don't really expect + * many candidate MCVs, so this simple approach seems sufficient. + */ +StatisticExtInfo * +find_matching_mcv(PlannerInfo *root, RelOptInfo *rel, Bitmapset *attnums, List *exprs) +{ + ListCell *l; + StatisticExtInfo *mcv = NULL; + List *stats = rel->statlist; + + foreach(l, stats) + { + StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l); + List *conditions1 = NIL, + *conditions2 = NIL; + + /* We only care about MCV statistics here. */ + if (stat->kind != STATS_EXT_MCV) + continue; + + /* + * Ignore MCVs not covering all the attributes/expressions. + * + * XXX Maybe we shouldn't be so strict and consider only partial + * matches for join clauses too? + */ + if (!bms_is_subset(attnums, stat->keys) || + !stat_covers_expressions(stat, exprs,
Re: PG 14 release notes, first draft
Hi Bruce, For this item: Allow BRIN indexes to use bloom filters (Tomas Vondra) This allows bloom indexes to be used effectively with data that is not physically localized in the heap. The text implies that this affects contrib/bloom. I think it should be "This allows BRIN indexes...". -- John Naylor EDB: http://www.enterprisedb.com
Re: recent failures on lorikeet
I wrote: > What that'd have to imply is that get_ps_display() messed up, > returning a bad pointer or a bad length. > A platform-specific problem in get_ps_display() seems plausible > enough. The apparent connection to a concurrent VACUUM FULL seems > pretty hard to explain that way ... but maybe that's a mirage. If I understand correctly that you're only seeing this in v13 and HEAD, then it seems like bf68b79e5 (Refactor ps_status.c API) deserves a hard look. regards, tom lane
Re: recent failures on lorikeet
Andrew Dunstan writes: > The line in lmgr.c is where the process title gets changed to "waiting". > I recently stopped setting process title on this animal on REL_13_STABLE > and its similar errors have largely gone away. Oooh, that certainly seems like a smoking gun. > I can do the same on > HEAD. But it does make me wonder what the heck has changed to make this > code fragile. So what we've got there is old_status = get_ps_display(); new_status = (char *) palloc(len + 8 + 1); memcpy(new_status, old_status, len); strcpy(new_status + len, " waiting"); set_ps_display(new_status); new_status[len] = '\0'; /* truncate off " waiting" */ Line 1831 is the strcpy, but it seems entirely impossible that that could fail, unless palloc has shirked its job. I'm thinking that the crash is really in the memcpy --- looking at the other lines in your trace, fingering the line after the call seems common. What that'd have to imply is that get_ps_display() messed up, returning a bad pointer or a bad length. A platform-specific problem in get_ps_display() seems plausible enough. The apparent connection to a concurrent VACUUM FULL seems pretty hard to explain that way ... but maybe that's a mirage. regards, tom lane
Re: Question about StartLogicalReplication() error path
On Mon, Jun 14, 2021 at 12:50 PM Jeff Davis wrote: > It seems that there's not much agreement in a behavior change here. I > suggest one or more of the following: > > 1. change the logical rep protocol docs to match the current behavior > a. also briefly explain in the docs why it's different from > physical replication (which does always start at the provided LSN as > far as I can tell) > > 2. Change the comment to add something like "Starting at a different > LSN than requested might not catch certain kinds of client errors. > Clients should be careful to check confirmed_flush_lsn if starting at > the requested LSN is required." > > 3. upgrade DEBUG1 message to a WARNING > > Can I get agreement on any of the above suggestions? I'm happy to hear other opinions, but I think I would be inclined to vote in favor of #1 and/or #2 but against #3. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Race condition in recovery?
On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan wrote: > $^X is not at all broken. > > The explanation here is pretty simple - the argument to perl2host is > meant to be a directory. If we're going to accomodate plain files then > we have some more work to do in TestLib. This explanation seems to contradict the documentation in TestLib.pm, which makes no mention of any such restriction. > > +local $ENV{PERL_BADLANG}=0; > > > > Similarly here. There's not a single other reference to PERL_BADLANG > > in the repository, so if we need this one here, there should be a > > comment explaining why this is different from all the places where we > > don't need it. > > Here's why this is different: this is the only place that we invoke the > msys perl in this way (i.e. from a non-msys aware environment - the > binaries we build are not msys-aware). We need to do that if for no > other reason than that it might well be the only perl available. Doing > so makes it complain loudly about missing locale info. Setting this > variable makes it shut up. I can add a comment on that if you like. Yes, please, but perhaps you'd like to post patches for discussion first instead of committing directly. > Part of the trouble is that I've been living and breathing some of these > issues so much recently that I forget that what might be fairly obvious > to me isn't to others. I assure you there is not the faintest touch of > pixy dust involved. Every pixie with whom I've spoken today says otherwise! -- Robert Haas EDB: http://www.enterprisedb.com
Re: PG 14 release notes, first draft
On 5/10/21 8:03 AM, Bruce Momjian wrote: > I have committed the first draft of the PG 14 release notes. You can > see the most current build of them here: > > https://momjian.us/pgsql_docs/release-14.html > > I need clarification on many items, and the document still needs its > items properly ordered, and markup added. I also expect a lot of > feedback. > > I plan to work on completing this document this coming week in > preparation for beta next week. > Sorry it took me a while to look at the release notes. I have one suggestion regarding this item: Allow logical replication to stream long in-progress transactions to subscribers (Tomas Vondra, Dilip Kumar, Amit Kapila, Ajin Cherian, Nikhil Sontakke, Stas Kelvich) AFAICS the authors are generally ordered by how much they contributed to the feature. In that case I'd move myself down the list - certainly after Dilip and Amit, perhaps after Ajin. While I posted the original patch, but most of the work after that to get it committed was done by those two/three people. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Race condition in recovery?
On 6/14/21 11:52 AM, Robert Haas wrote: > On Sat, Jun 12, 2021 at 10:20 AM Tom Lane wrote: >> Andrew Dunstan writes: >>> I have pushed a fix, tested on a replica of fairywren/drongo, >> This bit seems a bit random: >> >> # WAL segment, this is enough to guarantee that the history file was >> # archived. >> my $archive_wait_query = >> - "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM >> pg_stat_archiver;"; >> + "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " >> . >> + "FROM pg_stat_archiver"; >> $node_standby->poll_query_until('postgres', $archive_wait_query) >>or die "Timed out while waiting for WAL segment to be archived"; >> my $last_archived_wal_file = $walfile_to_be_archived; >> >> I wonder whether that is a workaround for the poll_query_until bug >> I proposed to fix at [1]. This has been reverted. > I found that a bit random too, but it wasn't the only part of the > patch I found a bit random. Like, what can this possibly be doing? > > +if ($^O eq 'msys') > +{ > +$perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X); > +} > > The idea here is apparently that on msys, the directory name that is > part of $^X needs to be passed through perl2host, but the file name > that is part of the same $^X needs to NOT be passed through perl2host. > Is $^X really that broken? If so, I think some comments are in order. $^X is not at all broken. The explanation here is pretty simple - the argument to perl2host is meant to be a directory. If we're going to accomodate plain files then we have some more work to do in TestLib. > +local $ENV{PERL_BADLANG}=0; > > Similarly here. There's not a single other reference to PERL_BADLANG > in the repository, so if we need this one here, there should be a > comment explaining why this is different from all the places where we > don't need it. Here's why this is different: this is the only place that we invoke the msys perl in this way (i.e. from a non-msys aware environment - the binaries we build are not msys-aware). We need to do that if for no other reason than that it might well be the only perl available. Doing so makes it complain loudly about missing locale info. Setting this variable makes it shut up. I can add a comment on that if you like. > On those occasions when I commit TAP test cases, I do try to think > about whether they are going to be portable, but I find these kinds of > changes indistinguishable from magic. Part of the trouble is that I've been living and breathing some of these issues so much recently that I forget that what might be fairly obvious to me isn't to others. I assure you there is not the faintest touch of pixy dust involved. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Question about StartLogicalReplication() error path
On Sat, 2021-06-12 at 16:17 +0530, Amit Kapila wrote: > AFAIU, currently, in such a case, the subscriber (client) won't > advance the flush location (confirmed_flush_lsn). So, we won't lose > any data. I think you are talking about the official Logical Replication specifically, rather than an arbitrary client that's using the logical replication protocol based on the protocol docs. It seems that there's not much agreement in a behavior change here. I suggest one or more of the following: 1. change the logical rep protocol docs to match the current behavior a. also briefly explain in the docs why it's different from physical replication (which does always start at the provided LSN as far as I can tell) 2. Change the comment to add something like "Starting at a different LSN than requested might not catch certain kinds of client errors. Clients should be careful to check confirmed_flush_lsn if starting at the requested LSN is required." 3. upgrade DEBUG1 message to a WARNING Can I get agreement on any of the above suggestions? Regards, Jeff Davis
Re: array_cat anycompatible change is breaking xversion upgrade tests (v14 release notes)
On Fri, Jun 11, 2021 at 09:40:07PM -0400, Bruce Momjian wrote: > On Fri, Jun 11, 2021 at 09:17:46PM -0400, Tom Lane wrote: > > Bruce Momjian writes: > > > OK, I used some of your ideas and tried for something more general; > > > patch attached. > > > > I think it's a good idea to mention custom aggregates and operators > > specifically, as otherwise people will look at this and have little > > idea what you're on about. I just want wording like "such as custom > > aggregates and operators", in case somebody has done some other creative > > thing that breaks. > > Agreed, updated patch attached. Patch applied. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: A new function to wait for the backend exit after termination
On Mon, Jun 14, 2021 at 09:40:27AM -0700, Noah Misch wrote: > > > Agreed, these strings generally give less detail. I can revert that to > > > the > > > v13 wording, 'terminate a server process'. ... > > Maybe you'd also update the release notes. > > What part of the notes did you expect to change that the patch did not change? Sorry, I didn't notice that your patch already adjusted the v14 notes. Note that Bharath also corrected your commit message to say "unable *to*", and revert the verbose pg_proc.dat descr change. Thanks, -- Justin
Re: recent failures on lorikeet
On 6/14/21 12:33 PM, Andrew Dunstan wrote: > > The line in lmgr.c is where the process title gets changed to "waiting". > I recently stopped setting process title on this animal on REL_13_STABLE > and its similar errors have largely gone away. I can do the same on > HEAD. But it does make me wonder what the heck has changed to make this > code fragile. Of course I meant the line (1831) in lock.c. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: A new function to wait for the backend exit after termination
On Sat, Jun 12, 2021 at 01:27:43PM -0500, Justin Pryzby wrote: > On Sat, Jun 12, 2021 at 08:21:39AM -0700, Noah Misch wrote: > > On Sat, Jun 12, 2021 at 12:12:12AM -0500, Justin Pryzby wrote: > > > Even if it's not removed, the descriptions should be cleaned up. > > > > > > | src/include/catalog/pg_proc.dat- descr => 'terminate a backend process > > > and if timeout is specified, wait for its exit or until timeout occurs', > > > => I think doesn't need to change or mention the optional timeout at all > > > > Agreed, these strings generally give less detail. I can revert that to the > > v13 wording, 'terminate a server process'. > > Maybe you'd also update the release notes. What part of the notes did you expect to change that the patch did not change? > I suggest some edits from the remaining parts of the original patch. These look good. I will push them when I push the other part.
Re: PG 14 release notes, first draft
On Fri, Jun 11, 2021 at 10:45:51PM -0500, Justin Pryzby wrote: > | Add Set Server Name Indication (SNI) for SSL connection packets (Peter > Eisentraut) > Remove "Set" > > | Reduce the default value of vacuum_cost_page_miss from 10 milliseconds to 2 > (Peter Geoghegan) > Peter mentioned that this should not say "milliseconds" (but maybe the page > I'm > looking at is old). > > | Cause vacuum operations to be aggressive if the table is near xid or > multixact wraparound (Masahiko Sawada, Peter Geoghegan) > Say "become aggressive" ? > > | Allow the arbitrary collations of partition boundary values (Tom Lane) > Remove "the" > > | Generate WAL invalidations message during command completion when using > logical replication (Dilip Kumar, Tomas Vondra, Amit Kapila) > invalidation messages > > | Add support for infinity and -infinity values to the numeric data type > (Tom Lane) > "-infinity" has markup but not "infinity" ? > > | Allow vacuum to deallocate space reserved by trailing unused heap line > pointers (Matthias van de Meent, Peter Geoghegan) > say "reclaim space" ? Some more: | VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST processing, and vacuumdb has a --no-process-toast option. has a process_toast *option | Previously, if the object already exists, EXPLAIN would fail. already existed | Function pg_stat_reset_replication_slot() resets slot statistics. *The function. But maybe it should be omitted. | New options are read-only, primary, standby, and prefer-standby. *The new options | Allow reindexdb to change the tablespace of the new index (Michael Paquier) | This is done by specifying --tablespace. I think this should be merged with the corresponding server feature, like this one: |Add ability to skip vacuuming of TOAST tables (Nathan Bossart) |VACUUM now has a PROCESS_TOAST which can be set to false to disable TOAST processing, and vacuumdb has a --no-process-toast option. Or, the client-side option could be omitted. This is distinguished from vacuumdb --no-index-cleanup and --no-truncate, for which the server support was added in v12, and the client support was essentially an omision. | Add documentation for the factorial() function (Peter Eisentraut) | With the removal of the ! operator in this release, factorial() is the only built-in way to compute a factorial. Could be ommited or collapsed into the other item. I know Tom thinks that it's unnecesary to document changes to documentation. -- Justin
Re: recent failures on lorikeet
On 6/14/21 9:39 AM, Tom Lane wrote: > Andrew Dunstan writes: >> I've been looking at the recent spate of intermittent failures on my >> Cygwin animal lorikeet. Most of them look something like this, where >> there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE' >> which fails. > Do you have any idea what "exit code 127" signifies on that platform? > (BTW, not all of them look like that; many are reported as plain > segfaults.) I hadn't spotted the association with a concurrent "VACUUM > FULL pg_class" before, that does seem interesting. > >> Getting stack traces in this platform can be very difficult. I'm going >> to try forcing complete serialization of the regression tests >> (MAX_CONNECTIONS=1) to see if the problem goes away. Any other >> suggestions might be useful. Note that we're not getting the same issue >> on REL_13_STABLE, where the same group pf tests run together (inherit >> typed_table, vacuum) > If it does go away, that'd be interesting, but I don't see how it gets > us any closer to a fix. Seems like a stack trace is a necessity to > narrow it down. > > Some have given stack traces and some not, not sure why. The one from June 13 has this: backtrace ?? ??:0 WaitOnLock src/backend/storage/lmgr/lock.c:1831 LockAcquireExtended src/backend/storage/lmgr/lock.c:1119 LockRelationOid src/backend/storage/lmgr/lmgr.c:135 relation_open src/backend/access/common/relation.c:59 table_open src/backend/access/table/table.c:43 ScanPgRelation src/backend/utils/cache/relcache.c:322 RelationBuildDesc src/backend/utils/cache/relcache.c:1039 RelationIdGetRelation src/backend/utils/cache/relcache.c:2045 relation_open src/backend/access/common/relation.c:59 table_open src/backend/access/table/table.c:43 ExecInitPartitionInfo src/backend/executor/execPartition.c:510 ExecPrepareTupleRouting src/backend/executor/nodeModifyTable.c:2311 ExecModifyTable src/backend/executor/nodeModifyTable.c:2559 ExecutePlan src/backend/executor/execMain.c:1557 The line in lmgr.c is where the process title gets changed to "waiting". I recently stopped setting process title on this animal on REL_13_STABLE and its similar errors have largely gone away. I can do the same on HEAD. But it does make me wonder what the heck has changed to make this code fragile. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Use extended statistics to estimate (Var op Var) clauses
On 6/14/21 5:36 PM, Mark Dilger wrote: > > >> On Jun 13, 2021, at 1:28 PM, Tomas Vondra >> wrote: >> >> Here is a slightly updated version of the patch > > Thanks for taking this up again! > > Applying the new test cases from your patch, multiple estimates have > gotten better. That looks good. I wrote a few extra test cases and > saw no change, which is fine. I was looking for regressions where > the estimates are now worse than before. Do you expect there to be > any such cases? > Not really. These clauses could not be estimated before, we generally used default estimates for them. So WHERE a = b would use 0.5%, while WHERE a < b would use 33%, and so on. OTOH it depends on the accuracy of the extended statistics - particularly the MCV list (what fraction of the data it covers, etc.). So it's possible the default estimate is very accurate by chance, and MCV list represents only a tiny fraction of the data. Then the new estimate could we worse. Consider for example this: create table t (a int, b int); insert into t select 100, 100 from generate_series(1,5000) s(i); insert into t select i, i+1 from generate_series(1,995000) s(i); This has exactly 0.5% of rows with (a=b). Without extended stats it's perfect: explain analyze select * from t where a = b; Seq Scan on t (cost=0.00..16925.00 rows=5000 width=8) (actual time=0.064..159.928 rows=5000 loops=1) while with statistics it gets worse: create statistics s (mcv) on a, b from t; analyze t; Seq Scan on t (cost=0.00..16925.00 rows=9810 width=8) (actual time=0.059..160.467 rows=5000 loops=1) It's not terrible, although we could construct worse examples. But the same issue applies to other clauses, not just to these new ones. And it relies on the regular estimation producing better estimate by chance. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: a path towards replacing GEQO with something better
On 6/14/21 1:16 PM, John Naylor wrote: > On Sun, Jun 13, 2021 at 9:50 AM Tomas Vondra > mailto:tomas.von...@enterprisedb.com>> > wrote: > >> > 2) We can still keep GEQO around (with some huge limit by default) for a >> > few years as an escape hatch, while we refine the replacement. If there >> > is some bug that prevents finding a plan, we can emit a WARNING and fall >> > back to GEQO. Or if a user encounters a regression in a big query, they >> > can lower the limit to restore the plan they had in an earlier version. >> > >> >> Not sure. Keeping significant amounts of code may not be free - both for >> maintenance and new features. It'd be a bit sad if someone proposed >> improvements to join planning, but had to do 2x the work to support it >> in both the DP and GEQO branches, or risk incompatibility. > > Looking back again at the commit history, we did modify geqo to support > partial paths and partition-wise join, so that's a fair concern. Right. I think the question is how complex those changes were. If it was mostly mechanical, it's not a big deal and we can keep both, but if it requires deeper knowledge of the GEQO inner workings it may be an issue (planner changes are already challenging enough). > My concern is the risk of plan regressions after an upgrade, even if > for a small number of cases. > I don't know. My impression/experience with GEQO is that getting a good join order for queries with many joins is often a matter of luck, and the risk of getting poor plan just forces me to increase geqo_threshold or disable it altogether. Or just rephrase the the query to nudge the planner to use a good join order (those large queries are often star or chain shaped, so it's not that difficult). So I'm not sure "GEQO accidentally produces a better plan for this one query" is a good argument to keep it around. We should probably evaluate the overall behavior, and then make a decision. FWIW I think we're facing this very dilemma for every optimizer change, to some extent. Every time we make the planner smarter by adding a new plan variant or heuristics, we're also increasing the opportunity for errors. And every time we look (or should look) at average behavior and worst case behavior ... >> OTOH maybe this concern is unfounded in practice - I don't think we've >> done very many big changes to geqo in the last few years. > > Yeah, I get the feeling that it's already de facto obsolete, and we > could make it a policy not to consider improvements aside from bug fixes > where it can't find a valid plan, or forced API changes. Which I guess > is another way of saying "deprecated". > > (I briefly considered turning it into a contrib module, but that seems > like the worst of both worlds.) > True. I'm fine with deprecating / not improving geqo. What would worry me is incompatibility, i.e. if geqo could not support some features. I'm thinking of storage engines in MySQL not supporting some features, leading to a mine field for users. Producing poor plans is fine, IMO. >> This reminds me the proposals to have a GUC that'd determine how much >> effort should the planner invest into various optimizations. For OLTP it >> might be quite low, for large OLAP queries it'd be economical to spend >> more time trying some more expensive optimizations. >> >> The challenge of course is how / in what units / to define the budget, >> so that it's meaningful and understandable for users. Not sure if >> "number of join rels generated" will be clear enough for users. But it >> seems good enough for PoC / development, and hopefully people won't have >> to tweak it very often. > > I'm also in favor of having some type of "planner effort" or "OLTP to > OLAP spectrum" guc, but I'm not yet sure whether it'd be better to have > it separate or to couple the joinrel budget to it. If we go that route, > I imagine there'll be many things that planner_effort changes that we > don't want to give a separate knob for. And, I hope with graceful > degradation and a good enough heuristic search, it won't be necessary to > change in most cases. > Yeah. I'm really worried about having a zillion separate GUC knobs for tiny parts of the code. That's impossible to tune, and it also exposes details about the implementation. And some people just can't resist touching all the available options ;-) The thing I like about JIT tunables is that it's specified in "cost" which the users are fairly familiar with. Having another GUC with entirely different unit is not great. But as I said - it seems perfectly fine for PoC / development, and we can revisit that later. Adding some sort of "planner effort" or multiple optimization passes is a huge project on it's own. >> I haven't read the [Kossmann & Stocker, 2000] paper yet, but the >> [Neumann, 2018] paper seems to build on it, and it seems to work with >> much larger subtrees of the join tree than k=5. > > Right, in particular it builds on "IDP-2" from Kossmann & Stocker. Okay, > so
Re: Transactions involving multiple postgres foreign servers, take 2
On Sun, Jun 13, 2021 at 10:04 PM tsunakawa.ta...@fujitsu.com wrote: > I know sending a commit request may get an error from various underlying > functions, but we're talking about the client side, not the Postgres's server > side that could unexpectedly ereport(ERROR) somewhere. So, the new FDW > commit routine won't lose control and can return an error code as its return > value. For instance, the FDW commit routine for DBMS-X would typically be: > > int > DBMSXCommit(...) > { > int ret; > > /* extract info from the argument to pass to xa_commit() */ > > ret = DBMSX_xa_commit(...); > /* This is the actual commit function which is exposed to the app > server (e.g. Tuxedo) through the xa_commit() interface */ > > /* map xa_commit() return values to the corresponding return values > of the FDW commit routine */ > switch (ret) > { > case XA_RMERR: > ret = ...; > break; > ... > } > > return ret; > } Well, we're talking about running this commit routine from within CommitTransaction(), right? So I think it is in fact running in the server. And if that's so, then you have to worry about how to make it respond to interrupts. You can't just call some functions DBMSX_xa_commit() and wait for infinite time for it to return. Look at pgfdw_get_result() for an example of what real code to do this looks like. > So, we need to design how commit behaves from the user's perspective. That's > the functional design. We should figure out what's the desirable response of > commit first, and then see if we can implement it or have to compromise in > some way. I think we can reference the X/Open TX standard and/or JTS (Java > Transaction Service) specification (I haven't had a chance to read them yet, > though.) Just in case we can't find the requested commit behavior in the > volcano case from those specifications, ... (I'm hesitant to say this because > it may be hard,) it's desirable to follow representative products such as > Tuxedo and GlassFish (the reference implementation of Java EE specs.) Honestly, I am not quite sure what any specification has to say about this. We're talking about what happens when a user does something with a foreign table and then type COMMIT. That's all about providing a set of behaviors that are consistent with how PostgreSQL works in other situations. You can't negotiate away the requirement to handle errors in a way that works with PostgreSQL's infrastructure, or the requirement that any length operation handle interrupts properly, by appealing to a specification. > Concurrent transactions are serialized at the resolver. I heard that the > current patch handles 2PC like this: the TM (transaction manager in Postgres > core) requests prepare to the resolver, the resolver sends prepare to the > remote server and wait for reply, the TM gets back control from the resolver, > TM requests commit to the resolver, the resolver sends commit to the remote > server and wait for reply, and TM gets back control. The resolver handles > one transaction at a time. That sounds more like a limitation of the present implementation than a fundamental problem. We shouldn't reject the idea of having a resolver process handle this just because the initial implementation might be slow. If there's no fundamental problem with the idea, parallelism and concurrency can be improved in separate patches at a later time. It's much more important at this stage to reject ideas that are not theoretically sound. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Race condition in recovery?
On Sat, Jun 12, 2021 at 10:20 AM Tom Lane wrote: > Andrew Dunstan writes: > > I have pushed a fix, tested on a replica of fairywren/drongo, > > This bit seems a bit random: > > # WAL segment, this is enough to guarantee that the history file was > # archived. > my $archive_wait_query = > - "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM > pg_stat_archiver;"; > + "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " . > + "FROM pg_stat_archiver"; > $node_standby->poll_query_until('postgres', $archive_wait_query) >or die "Timed out while waiting for WAL segment to be archived"; > my $last_archived_wal_file = $walfile_to_be_archived; > > I wonder whether that is a workaround for the poll_query_until bug > I proposed to fix at [1]. I found that a bit random too, but it wasn't the only part of the patch I found a bit random. Like, what can this possibly be doing? +if ($^O eq 'msys') +{ +$perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X); +} The idea here is apparently that on msys, the directory name that is part of $^X needs to be passed through perl2host, but the file name that is part of the same $^X needs to NOT be passed through perl2host. Is $^X really that broken? If so, I think some comments are in order. +local $ENV{PERL_BADLANG}=0; Similarly here. There's not a single other reference to PERL_BADLANG in the repository, so if we need this one here, there should be a comment explaining why this is different from all the places where we don't need it. On those occasions when I commit TAP test cases, I do try to think about whether they are going to be portable, but I find these kinds of changes indistinguishable from magic. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila wrote: > Why do you think we don't need to check index AM functions? Say we > have an index expression that uses function and if its parallel safety > is changed then probably that also impacts whether we can do insert in > parallel. Because otherwise, we will end up executing some parallel > unsafe function in parallel mode during index insertion. I'm not saying that we don't need to check index expressions. I agree that we need to check those. The index AM functions are things like btint4cmp(). I don't think that a function like that should ever be parallel-unsafe. > Yeah, this could be one idea but I think even if we use pg_proc OID, > we still need to check all the rel cache entries to find which one > contains the invalidated OID and that could be expensive. I wonder > can't we directly find the relation involved and register invalidation > for the same? We are able to find the relation to which trigger > function is associated during drop function via findDependentObjects > by scanning pg_depend. Assuming, we are able to find the relation for > trigger function by scanning pg_depend, what kinds of problems do we > envision in registering the invalidation for the same? I don't think that finding the relation involved and registering an invalidation for the same will work properly. Suppose there is a concurrently-running transaction which has created a new table and attached a trigger function to it. You can't see any of the catalog entries for that relation yet, so you can't invalidate it, but invalidation needs to happen. Even if you used some snapshot that can see those catalog entries before they are committed, I doubt it fixes the race condition. You can't hold any lock on that relation, because the creating transaction holds AccessExclusiveLock, but the whole invalidation mechanism is built around the assumption that the sender puts messages into the shared queue first and then releases locks, while the receiver first acquires a conflicting lock and then processes messages from the queue. Without locks, that synchronization algorithm can't work reliably. As a consequence of all that, I believe that, not just in this particular case but in general, the invalidation message needs to describe the thing that has actually changed, rather than any derived property. We can make invalidations that say "some function's parallel-safety flag has changed" or "this particular function's parallel-safety flag has changed" or "this particular function has changed in some way" (this one, we have already), but anything that involves trying to figure out what the consequences of such a change might be and saying "hey, you, please update XYZ because I changed something somewhere that could affect that" is not going to be correct. > I think we probably need to worry about the additional cost to find > dependent objects and if there are any race conditions in doing so as > pointed out by Tom in his email [1]. The concern related to cost could > be addressed by your idea of registering such an invalidation only > when the user changes the parallel safety of the function which we > don't expect to be a frequent operation. Now, here the race condition, > I could think of could be that by the time we change parallel-safety > (say making it unsafe) of a function, some of the other sessions might > have already started processing insert on a relation where that > function is associated via trigger or check constraint in which case > there could be a problem. I think to avoid that we need to acquire an > Exclusive lock on the relation as we are doing in Rename Policy kind > of operations. Well, the big issue here is that we don't actually lock functions while they are in use. So there's absolutely nothing that prevents a function from being altered in any arbitrary way, or even dropped, while code that uses it is running. I don't really know what happens in practice if you do that sort of thing: can you get the same query to run with one function definition for the first part of execution and some other definition for the rest of execution? I tend to doubt it, because I suspect we cache the function definition at some point. If that's the case, caching the parallel-safety marking at the same point seems OK too, or at least no worse than what we're doing already. But on the other hand if it is possible for a query's notion of the function definition to shift while the query is in flight, then this is just another example of that and no worse than any other. Instead of changing the parallel-safety flag, somebody could redefine the function so that it divides by zero or produces a syntax error and kaboom, running queries break. Either way, I don't see what the big deal is. As long as we make the handling of parallel-safety consistent with other ways the function could be concurrently redefined, it won't suck any more than the current system already does, or in any fundamentally
Re: Use extended statistics to estimate (Var op Var) clauses
> On Jun 13, 2021, at 1:28 PM, Tomas Vondra > wrote: > > Here is a slightly updated version of the patch Thanks for taking this up again! Applying the new test cases from your patch, multiple estimates have gotten better. That looks good. I wrote a few extra test cases and saw no change, which is fine. I was looking for regressions where the estimates are now worse than before. Do you expect there to be any such cases? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Delegating superuser tasks to new security roles
> On Jun 14, 2021, at 5:51 AM, torikoshia wrote: > > Thanks for working on this topic, I appreciate it! Thank you for taking a look! > BTW, do these patches enable non-superusers to create user with > bypassrls? No, I did not break out the ability to create such users. > Since I failed to apply the patches and didn't test them, > I may have overlooked something but I didn't find the > corresponding codes. Do you believe that functionality should be added? I have not thought much about that issue. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Replication protocol doc fix
On Fri, Jun 11, 2021 at 6:12 PM Jeff Davis wrote: > On Fri, 2021-06-11 at 16:57 -0400, Robert Haas wrote: > > My impression was that CopyBoth can be initiated either way, > > The docs say: "In either physical replication or logical replication > walsender mode, only the simple query protocol can be used." Is there > some way to initiate CopyBoth outside of walsender? Currently, no, or at least not to my knowledge. I just meant that there seems to be nothing in the protocol specification which prevents CopyBothResponse from being sent in response to a query sent using the extended protocol. > > but if > > you use the extended query protocol, then the result is a hopeless > > mess, because the protocol is badly designed: > > > https://www.postgresql.org/message-id/ca+tgmoa4ea+cpxaigqmebp9xisvd3ze9dbvnbzevx9ucmiw...@mail.gmail.com > > It seems like you're saying that CopyIn and CopyBoth are both equally > broken in extended query mode. Is that right? Yeah. > > But I think you're correct in saying that the discard-until-Sync > > behavior only happens if the extended query protocol is used, so I > > agree that the current text is wrong. > > Should we just document how CopyBoth works with the simple query > protocol, or should we make it match the CopyIn docs? I think it would make sense to make it match the CopyIn docs. Possibly the CopyOut docs should be made more similar as well. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Failure in subscription test 004_sync.pl
On 2021-Jun-14, Amit Kapila wrote: > On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada > wrote: > > The condition should be the opposite; we should raise the error when > > 'nowait' is true. I think this is the cause of the test failure. Even > > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we > > don't wait but raise the error. > > Yes, this should fix the recent buildfarm failures. Alvaro, would you > like to take care of this? Ugh, thanks for CCing me. Yes, I'll get this fixed ASAP. -- Álvaro Herrera Valdivia, Chile "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
Re: [Proposal] Add accumulated statistics for wait event
Hi Andres, Hi all, First, thank you for your feedback! Please find in attachment a patch implementing accumulated wait event stats only from the backend point of view. As I wrote when I reviewed and rebased the existing patch, I was uncomfortable with the global approach. I still volunteer to work/review the original approach is required. See bellow for comments and some more explanations about what I think might be improvements over the previous patch. On Fri, 11 Jun 2021 12:18:07 -0700 Andres Freund wrote: > On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote: > > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais > > Date: Tue, 1 Jun 2021 13:25:57 +0200 > > Subject: [PATCH 1/2] Add pg_stat_waitaccum view. > > > > pg_stat_waitaccum shows counts and duration of each wait events. > > Each backend/backgrounds counts and measures the time of wait event > > in every pgstat_report_wait_start and pgstat_report_wait_end. They > > store those info into their local variables and send to Statistics > > Collector. We can get those info via Statistics Collector. > > > > For reducing overhead, I implemented statistic hash instead of > > dynamic hash. I also implemented track_wait_timing which > > determines wait event duration is collected or not. > > I object to adding this overhead. The whole selling point for wait > events was that they are low overhead. I since spent time reducing the > overhead further, because even just the branches for checking if > track_activity is enabled are measurable (225a22b19ed). Agree. The previous patch I rebased was to review it and reopen this discussion, I even added a small FIXME in pgstat_report_wait_end and pgstat_report_wait_start about your work: //FIXME: recent patch to speed up this call. In the patch in attachment, I tried to fix this by using kind of an internal hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to "instrument" wait events only when required, on the fly, dynamically. Moreover, I removed the hash structure for a simple static array for faster access. > > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais > > Date: Fri, 4 Jun 2021 18:14:51 +0200 > > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from > > INSTR_TIME to rdtsc. > > > > This patch changes measuring method of wait event time from INSTR_TIME > > (which uses gettimeofday or clock_gettime) to rdtsc. This might reduce the > > overhead of measuring overhead. > > > > Any supports like changing clock cycle to actual time or error handling are > > not currently implemented. > > rdtsc is a serializing (*) instruction - that's the expensive part. On linux > clock_gettime() doesn't actually need a syscall. While the vdso call > implies a bit of overhead over a raw rdtsc, it's a relatively small part > of the overhead. See > https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de I choose to remove all this rdtsc part from my patch as this wasn't clear how much faster it was compare to simpler vdso functions and how to accurately extract a human time. About my take on $subject, for the sake of simplicity of this PoC, I added instrumentation to log_statement_stats. Despite the query context of the reported log, they are really accumulated stats. The patch updated pg_stat_get_waitaccum() as well to be able to report the accumulated wait events from your interactive or batch session. So using my previous fe-time demo client, you can test it using: PGOPTIONS="--log_statement_stats=on" ./fe-time 100 From logs, I now have (notice the last line): LOG: duration: 3837.194 ms statement: SELECT * FROM pgbench_accounts LOG: QUERY STATISTICS DETAIL: ! system usage stats: ! 0.087444 s user, 0.002106 s system, 3.837202 s elapsed ! [0.087444 s user, 0.003974 s system total] ! 25860 kB max resident size ! 0/0 [0/0] filesystem blocks in/out ! 0/303 [0/697] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 4/18 [5/18] voluntary/involuntary context switches ! Client/ClientWrite 4 calls, 3747102 us elapsed Using pgbench scale factor 10, the copy query for pgbench_accounts looks like: LOG: duration: 2388.081 ms statement: copy pgbench_accounts from stdin LOG: QUERY STATISTICS DETAIL: ! system usage stats: ! 1.373756 s user, 0.252860 s system, 2.388100 s elapsed ! [1.397015 s user, 0.264951 s system total] ! 37788 kB max resident size ! 0/641584 [0/642056] filesystem blocks in/out ! 194/4147 [195/4728] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
Re: A new function to wait for the backend exit after termination
On Sat, Jun 12, 2021 at 10:07 AM Noah Misch wrote: > > On Fri, Jun 11, 2021 at 08:54:08PM -0500, Justin Pryzby wrote: > > On Sat, Jun 05, 2021 at 12:08:01PM -0700, Noah Misch wrote: > > > > > My preference is to remove pg_wait_for_backend_termination(). The > > > > > use case > > > > > that prompted this thread used pg_terminate_backend(pid, 18); it > > > > > doesn't > > > > > need pg_wait_for_backend_termination(). > > > > Is this an Opened Issue ? > > An Open Item? Not really, since there's no objective defect. Nonetheless, > the attached is what I'd like to use. Thanks. +1 to remove the pg_wait_for_backend_termination function. The patch basically looks good to me. I'm attaching an updated patch. I corrected a minor typo in the commit message, took docs and code comment changes suggested by Justin Pryzby. Please note that release notes still have the headline "Add functions to wait for backend termination" of the original commit that added the pg_wait_for_backend_termination. With the removal of it, I'm not quite sure if we retain the the commit message or tweak it to something like "Add optional timeout parameter to pg_terminate_backend". With Regards, Bharath Rupireddy. v2-0001-Remove-pg_wait_for_backend_termination-function.patch Description: Binary data
pg_dumpall --exclude-database case folding
[discussion transferred from psql-performance] Summary: pg_dumpall and pg_dump fold non-quoted commandline patterns to lower case Tom lane writes: Andrew Dunstan writes: > On 6/10/21 2:23 PM, Andrew Dunstan wrote: >> Ouch. That looks like a plain old bug. Let's fix it. IIRC I just used >> the same logic that we use for pg_dump's --exclude-* options, so we need >> to check if they have similar issues. > Peter Eisentraut has pointed out to me that this is documented, albeit a > bit obscurely for pg_dumpall. But it is visible on the pg_dump page. Hmm. > Nevertheless, it's a bit of a POLA violation as we've seen above, and > I'd like to get it fixed, if there's agreement, both for this pg_dumpall > option and for pg_dump's pattern matching options. +1, but the -performance list isn't really where to hold that discussion. Please start a thread on -hackers. regards, tom lane
Re: recent failures on lorikeet
Andrew Dunstan writes: > I've been looking at the recent spate of intermittent failures on my > Cygwin animal lorikeet. Most of them look something like this, where > there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE' > which fails. Do you have any idea what "exit code 127" signifies on that platform? (BTW, not all of them look like that; many are reported as plain segfaults.) I hadn't spotted the association with a concurrent "VACUUM FULL pg_class" before, that does seem interesting. > Getting stack traces in this platform can be very difficult. I'm going > to try forcing complete serialization of the regression tests > (MAX_CONNECTIONS=1) to see if the problem goes away. Any other > suggestions might be useful. Note that we're not getting the same issue > on REL_13_STABLE, where the same group pf tests run together (inherit > typed_table, vacuum) If it does go away, that'd be interesting, but I don't see how it gets us any closer to a fix. Seems like a stack trace is a necessity to narrow it down. regards, tom lane
Re: unnesting multirange data types
On Mon, Jun 14, 2021 at 3:50 PM Jonathan S. Katz wrote: > On 6/13/21 5:18 PM, Alexander Korotkov wrote: > > >> "Expands an array into a set of rows. The array's elements are read out > >> in storage order." > >> > >> If we tweaked the multirange "unnest" function, we could change it to: > >> > >> + > >> +Expands a multirange into a set of rows. > >> +The ranges are read out in storage order (ascending). > >> + > >> > >> to match what the array "unnest" function docs, or > >> > >> + > >> +Expands a multirange into a set of rows that each > >> +contain an individual range. > >> +The ranges are read out in storage order (ascending). > >> + > >> > >> to be a bit more specific. However, I think this is also bordering on > >> overengineering the text, given there has been a lack of feedback on the > >> "unnest" array function description being confusing. > > > > I think it's not necessarily to say about rows here. Our > > documentation already has already a number of examples, where we > > describe set of returned values without speaking about rows including: > > json_array_elements, json_array_elements_text, json_object_keys, > > pg_listening_channels, pg_tablespace_databases... > > I do agree -- my main point was that I don't think we need to change > anything. I proposed alternatives just to show some other ways of > looking at it. But as I mentioned, at this point I think it's > overengineering the text. > > If folks are good with the method + code, I think this is ready. Cool, thank you for the summary. I'll wait for two days since I've published the last revision of the patch [1] (comes tomorrow), and push it if no new issues arise. Links 1. https://www.postgresql.org/message-id/CAPpHfdvG%3DJR5kqmZx7KvTyVgtQePX0QJ09TO1y3sN73WOfJf1Q%40mail.gmail.com -- Regards, Alexander Korotkov
Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS
On Mon, Jun 7, 2021 at 10:01 PM Tomas Vondra wrote: > > > > On 6/7/21 2:11 PM, Masahiko Sawada wrote: > > On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada > > wrote: > >> > >> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane wrote: > >>> > >>> husky just reported $SUBJECT: > >>> > >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky=2021-06-05%2013%3A42%3A17 > >>> > >>> and I find I can reproduce that locally: > >>> > >>> diff -U3 > >>> /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out > >>> /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out > >>> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out > >>> 2021-01-20 11:12:24.854346717 -0500 > >>> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out > >>> 2021-06-06 22:12:07.948890104 -0400 > >>> @@ -215,7 +215,8 @@ > >>> 0 | f | f > >>> 1 | f | f > >>> 2 | t | t > >>> -(3 rows) > >>> + 3 | t | t > >>> +(4 rows) > >>> > >>> select * from pg_check_frozen('copyfreeze'); > >>> t_ctid > >>> @@ -235,7 +236,8 @@ > >>> 0 | t | t > >>> 1 | f | f > >>> 2 | t | t > >>> -(3 rows) > >>> + 3 | t | t > >>> +(4 rows) > >>> > >>> select * from pg_check_frozen('copyfreeze'); > >>> t_ctid > >>> > >>> > >>> The test cases that are failing date back to January (7db0cd2145f), > >>> so I think this is some side-effect of a recent commit, but I have > >>> no idea which one. > >> > >> It seems like the recent revert (8e03eb92e9a) is relevant. > >> > >> After committing 7db0cd2145f we had the same regression test failure > >> in January[1]. Then we fixed that issue by 39b66a91b. But since we > >> recently reverted most of 39b66a91b, the same issue happened again. > >> > > > > So the cause of this failure seems the same as before. The failed test is, > > > > begin; > > truncate copyfreeze; > > copy copyfreeze from stdin freeze; > > 1 '1' > > 2 '2' > > 3 '3' > > 4 '4' > > 5 '5' > > \. > > copy copyfreeze from stdin; > > 6 '6' > > \. > > copy copyfreeze from stdin freeze; > > 7 '7' > > 8 '8' > > 9 '9' > > 10 '10' > > 11 '11' > > 12 '12' > > \. > > commit; > > > > If the target block cache is invalidated before the third COPY, we > > will start to insert the frozen tuple into a new page, resulting in > > adding two blocks in total during the third COPY. I think we still > > need the following part of the reverted code so that we don't leave > > the page partially empty after relcache invalidation: > > > > --- a/src/backend/access/heap/hio.c > > +++ b/src/backend/access/heap/hio.c > > @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len, > > * target. > > */ > > targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); > > - } > > > > - /* > > -* If the FSM knows nothing of the rel, try the last page before we give > > -* up and extend. This avoids one-tuple-per-page syndrome during > > -* bootstrapping or in a recently-started system. > > -*/ > > - if (targetBlock == InvalidBlockNumber) > > - { > > - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > + /* > > +* If the FSM knows nothing of the rel, try the last page before we > > +* give up and extend. This avoids one-tuple-per-page syndrome > > during > > +* bootstrapping or in a recently-started system. > > +*/ > > + if (targetBlock == InvalidBlockNumber) > > + { > > + BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > > > - if (nblocks > 0) > > - targetBlock = nblocks - 1; > > + if (nblocks > 0) > > + targetBlock = nblocks - 1; > > + } > > } > > > > Attached the patch that brings back the above change. > > > > Thanks for the analysis! I think you're right - this bit should have > been kept. Partial reverts are tricky :-( > > I'll get this fixed / pushed later today, after a bit more testing. I'd > swear I ran tests with CCA, but it's possible I skipped contrib. I had missed this mail. Thank you for pushing the fix! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Position of ClientAuthentication hook
Hello hackers, I have a doubt regarding the positioning of clientAuthentication hook in function ClientAuthentication. Particularly, in case of hba errors, i.e. cases uaReject or uaImplicitReject it errors out, leading to no calls to any functions attached to the authentication hook. Can't we process the hook function first and then error out...? -- Regards, Rafia Sabih
Re: Delegating superuser tasks to new security roles
On 2021-05-26 05:33, Mark Dilger wrote: On May 13, 2021, at 12:30 PM, Mark Dilger wrote: On May 13, 2021, at 12:18 PM, Jacob Champion wrote: On Thu, 2021-05-13 at 11:42 -0700, Mark Dilger wrote: The distinction that Theme+Security would make is that capabilities can be categorized by the area of the system: -- planner -- replication -- logging ... but also by the security implications of what is being done: -- host -- schema -- network Since the "security" buckets are being used for both proposals -- how you would deal with overlap between them? When a GUC gives you enough host access to bleed into the schema and network domains, does it get all three attributes assigned to it, and thus require membership in all three roles? Yeah, from a security standpoint, pg_host_admin basically gives everything away. I doubt service providers would give the "host" or "network" security to their tenants, but they would probably consider giving "schema" security to the tenants. (Thanks, by the way, for this thread -- I think a "capability system" for superuser access is a great idea.) I am happy to work on this, and appreciate feedback Please find attached five new patches each intended to reduce the number of administrative tasks that require superuser privileges. v3-0001 adds a new pg_logical_replication role with permission to manage publications and subscriptions. v3-0002 adds a new pg_host_security role with permission to manage extensions, event triggers and tablespaces. v3-0003 adds a new pg_network_security role with pemission to manage foreign servers and data wrappers. v3-0004 adds a new pg_database_security role with permission to perform many actions that would otherwise require superuser, so long as those actions do not compromise the security of the host or network. This role, along with pg_logical_replication, is intended to be safe to delegate to the tenant of a database provided as a service. v3-0005 associates all GUC variables with security roles and allows both SET and ALTER SYSTEM SET on those variables by users belonging to the necessary security role(s). This patch extends the significance of the pg_host_security, pg_network_security, and pg_database_security roles added in the previous patches, as those roles are associated with GUC variables that implicate the same security concerns. These patches likely still need some adjustment, as there are a large number of security relevant permission decisions in here which some hackers may debate, but I think these are mature enough to solicit feedback. I admit right upfront that the regression tests guc_priv_admin and guc_priv_tenant in v3-0005 could be made to cover a subset of GUC variables rather than the full set of them, but I'm delaying pruning them down until I know if the rest of the patches are basically acceptable. Thanks for working on this topic, I appreciate it! BTW, do these patches enable non-superusers to create user with bypassrls? Since I failed to apply the patches and didn't test them, I may have overlooked something but I didn't find the corresponding codes. Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: unnesting multirange data types
On 6/13/21 5:18 PM, Alexander Korotkov wrote: >> "Expands an array into a set of rows. The array's elements are read out >> in storage order." >> >> If we tweaked the multirange "unnest" function, we could change it to: >> >> + >> +Expands a multirange into a set of rows. >> +The ranges are read out in storage order (ascending). >> + >> >> to match what the array "unnest" function docs, or >> >> + >> +Expands a multirange into a set of rows that each >> +contain an individual range. >> +The ranges are read out in storage order (ascending). >> + >> >> to be a bit more specific. However, I think this is also bordering on >> overengineering the text, given there has been a lack of feedback on the >> "unnest" array function description being confusing. > > I think it's not necessarily to say about rows here. Our > documentation already has already a number of examples, where we > describe set of returned values without speaking about rows including: > json_array_elements, json_array_elements_text, json_object_keys, > pg_listening_channels, pg_tablespace_databases... I do agree -- my main point was that I don't think we need to change anything. I proposed alternatives just to show some other ways of looking at it. But as I mentioned, at this point I think it's overengineering the text. If folks are good with the method + code, I think this is ready. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Fdw batch insert error out when set batch_size > 65535
On Mon, Jun 14, 2021, 5:33 PM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > Okay. Here are the readings on my dev system: > 1) on master with the existing test case with inserting 70K rows: > 4263200 ms (71.05 min) > 2) with Tomas's patch with the test case modified with 1500 table > columns and 50 rows, (majority of the time ~30min it took in SELECT > create_batch_tables(1500); statement. I measured this time manually > looking at the start and end time of the statement - 6649312 ms (110.8 > min) > 3) with my patch with test case modified with 1600 table columns and > 41 rows: 4003007 ms (66.71 min) > 4) on master without the test case at all: 3770722 ms (62.84 min) > I forgot to mention one thing: I ran the above tests with CLOBBER_CACHE_ALWAYS. Regards, Bharath Rupireddy. >
recent failures on lorikeet
I've been looking at the recent spate of intermittent failures on my Cygwin animal lorikeet. Most of them look something like this, where there's 'VACUUM FULL pg_class' and an almost simultaneous "CREATE TABLE' which fails. 2021-06-14 05:04:00.220 EDT [60c71b7f.e8bf:60] pg_regress/vacuum LOG: statement: VACUUM FULL pg_class; 2021-06-14 05:04:00.222 EDT [60c71b80.e8c0:7] pg_regress/typed_table LOG: statement: CREATE TABLE persons OF person_type; 2021-06-14 05:04:00.232 EDT [60c71b80.e8c1:3] pg_regress/inherit LOG: statement: CREATE TABLE a (aa TEXT); *** starting debugger for pid 59584, tid 9640 2021-06-14 05:04:14.678 EDT [60c71b53.e780:4] LOG: server process (PID 59584) exited with exit code 127 2021-06-14 05:04:14.678 EDT [60c71b53.e780:5] DETAIL: Failed process was running: CREATE TABLE persons OF person_type; 2021-06-14 05:04:14.678 EDT [60c71b53.e780:6] LOG: terminating any other active server processes Getting stack traces in this platform can be very difficult. I'm going to try forcing complete serialization of the regression tests (MAX_CONNECTIONS=1) to see if the problem goes away. Any other suggestions might be useful. Note that we're not getting the same issue on REL_13_STABLE, where the same group pf tests run together (inherit typed_table, vacuum) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: RFC: Logging plan of the running query
On 2021-06-11 01:20, Bharath Rupireddy wrote: Thanks for your review! On Wed, Jun 9, 2021 at 1:14 PM torikoshia wrote: Updated the patch. Thanks for the patch. Here are some comments on v3 patch: 1) We could just say "Requests to log query plan of the presently running query of a given backend along with an untruncated query string in the server logs." Instead of +They will be logged at LOG message level and +will appear in the server log based on the log +configuration set (See linkend="runtime-config-logging"/> Actually this explanation is almost the same as that of pg_log_backend_memory_contexts(). Do you think we should change both of them? I think it may be too detailed but accurate. 2) It's better to do below, for reference see how pg_backend_pid, pg_terminate_backend, pg_relpages and so on are used in the tests. +SELECT pg_log_current_plan(pg_backend_pid()); rather than using the function in the FROM clause. +SELECT * FROM pg_log_current_plan(pg_backend_pid()); If okay, also change it for pg_log_backend_memory_contexts. Agreed. 3) Since most of the code looks same in pg_log_backend_memory_contexts and pg_log_current_plan, I think we can have a common function something like below: Agreed. I'll do some refactoring. bool SendProcSignalForLogInfo(ProcSignalReason reason) { Assert(reason == PROCSIG_LOG_MEMORY_CONTEXT || reason == PROCSIG_LOG_CURRENT_PLAN); if (!superuser()) { if (reason == PROCSIG_LOG_MEMORY_CONTEXT) errmsg("must be a superuser to log memory contexts") else if (reason == PROCSIG_LOG_CURRENT_PLAN) errmsg("must be a superuser to log plan of the running query") } if (SendProcSignal(pid, reason, proc->backendId) < 0) { } } Then we could just do: Datum pg_log_backend_memory_contexts(PG_FUNCTION_ARGS) { PG_RETURN_BOOL(SendProcSignalForLogInfo(PROCSIG_LOG_MEMORY_CONTEXT)); } Datum pg_log_current_plan(PG_FUNCTION_ARGS) { PG_RETURN_BOOL(SendProcSignalForLogInfo(PROCSIG_LOG_CURRENT_PLAN)); } We can have SendProcSignalForLogInfo function defined in procsignal.c and procsignal.h 4) I think we can have a sample function usage and how it returns true value, how the plan looks for a simple query(select 1 or some other simple/complex generic query or simply select pg_log_current_plan(pg_backend_pid());) in the documentation, much like pg_log_backend_memory_contexts. +1. 5) Instead of just showing the true return value of the function pg_log_current_plan in the sql test, which just shows that the signal is sent, but it doesn't mean that the backend has processed that signal and logged the plan. I think we can add the test using TAP framework, one I once made a tap test for pg_log_backend_memory_contexts(), but it seemed an overkill and we agreed that it was appropriate just ensuring the function working as below discussion. https://www.postgresql.org/message-id/bbecd722d4f8e261b350186ac4bf68a7%40oss.nttdata.com 6) Do we unnecessarily need to signal the processes such as autovacuum launcher/workers, logical replication launcher/workers, wal senders, wal receivers and so on. only to emit a "PID %d is not executing queries now" message? Moreover, those processes will be waiting in loops for timeouts to occur, then as soon as they wake up do they need to process this extra uninformative signal? We could choose to not signal those processes at all which might or might not be possible. Otherwise, we could just emit messages like " process cannot run a query" in ProcessInterrupts. Agreed. However it needs to distinguish backends which can execute queries and other processes such as autovacuum launcher, I don't come up with easy ways to do so. I'm going to think about it. 7)Instead of (errmsg("logging plan of the running query on PID %d\n%s", how about below? (errmsg("plan of the query running on backend with PID %d is:\n%s", +1. 8) Instead of errmsg("PID %d is not executing queries now") how about below? errmsg("Backend with PID %d is not running a query") +1. 9) We could just do: void ProcessLogCurrentPlanInterrupt(void) { ExplainState *es; LogCurrentPlanPending = false; if (!ActivePortal || !ActivePortal->queryDesc) errmsg("PID %d is not executing queries now"); es = NewExplainState(); ExplainQueryText(); ExplainPrintPlan(); 10) How about renaming the function pg_log_current_plan to pg_log_query_plan or pg_log_current_query_plan? +1. 11) What happens if pg_log_current_plan is called for a parallel worker? AFAIU Parallel worker doesn't have ActivePortal, so it would always emit the message 'PID %d is not executing queries now'. As 6), it would be better to distinguish the parallel worker and normal backend. Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events
Hi all! I still can reproduce it on 14beta1 version. I adapted a patch I found in this thread https://github.com/seregayoga/postgres/commit/338bc33f2cf77edde7c45bfdfb9f39a92ec57eb8 . It solved this bug for me (tested with simple Go program using https://github.com/lib/pq ). It would be nice to have this bug fixed. I’m not so familiar with postgres code base, but would glad to help with testing. >Monday, June 14, 2021 12:39 PM +03:00 from Daniel Danzberger < >dan...@dd-wrt.com >: > >On Fri, 2019-06-28 at 17:22 -0400, Alvaro Herrera wrote: >> On 2019-Feb-22, Robert Welin wrote: >> >> > I have reproduced this bug on PostgreSQL version 10.7 and 11.2 >> > using the >> > steps described in Michael Powers' original report. The issue also >> > still >> > seems to be present even with the patch provided by Sergei >> > Kornilov. >> > >> > Are there plans to address this issue any time soon or is there >> > some way >> > I can assist in fixing it? It would be great to have notifications >> > from >> > logical replication. >> >> This issue seems largely forgotten about :-( >> >Are there any news to this issue ? >I can reproduce this bug up to now with version 12. > >-- >Regards > >Daniel Danzberger >embeDD GmbH, Alter Postplatz 2, CH-6370 Stans > > > > Best regards, Sergey Fedchenko.
Re: Fdw batch insert error out when set batch_size > 65535
On Sun, Jun 13, 2021 at 9:28 PM Tomas Vondra wrote: > On 6/13/21 5:25 PM, Bharath Rupireddy wrote: > > On Sun, Jun 13, 2021 at 6:10 AM Alvaro Herrera > > wrote: > >> > >> On 2021-Jun-12, Tomas Vondra wrote: > >> > >>> There's one caveat, though - for regular builds the slowdown is pretty > >>> much eliminated. But with valgrind it's still considerably slower. For > >>> postgres_fdw the "make check" used to take ~5 minutes for me, now it > >>> takes >1h. And yes, this is entirely due to the new test case which is > >>> generating / inserting 70k rows. So maybe the test case is not worth it > >>> after all, and we should get rid of it. > >> > >> Hmm, what if the table is made 1600 columns wide -- would inserting 41 > >> rows be sufficient to trigger the problem case? If it does, maybe it > >> would reduce the runtime for valgrind/cache-clobber animals enough that > >> it's no longer a concern. > > > > Yeah, that's a good idea. PSA patch that creates the table of 1600 > > columns and inserts 41 rows into the foreign table. If the batch_size > > adjustment fix isn't there, we will hit the error. On my dev system, > > postgres_fdw contrib regression tests execution time: with and without > > the attached patch 4.5 sec and 5.7 sec respectively. > > > > But we're discussing cases with valgrind and/or CLOBBER_CACHE_ALWAYS. Okay. Here are the readings on my dev system: 1) on master with the existing test case with inserting 70K rows: 4263200 ms (71.05 min) 2) with Tomas's patch with the test case modified with 1500 table columns and 50 rows, (majority of the time ~30min it took in SELECT create_batch_tables(1500); statement. I measured this time manually looking at the start and end time of the statement - 6649312 ms (110.8 min) 3) with my patch with test case modified with 1600 table columns and 41 rows: 4003007 ms (66.71 min) 4) on master without the test case at all: 3770722 ms (62.84 min) With Regards, Bharath Rupireddy.
RE: locking [user] catalog tables vs 2pc vs logical rep
On Friday, June 11, 2021 2:13 PM vignesh C wrote: > Thanks for the updated patch: > Few comments: > 1) We have used Reordering and Clustering for the same command, we could > rephrase similarly in both places. > + > +Reordering pg_class by > CLUSTER > +command in a transaction. > + > > + > +Clustering pg_trigger and decoding > PREPARE > +TRANSACTION, if any published table have a trigger > and any > +operations that will be decoded are conducted. > + > + > > 2) Here user_catalog_table should be user catalog table > + > +Executing TRUNCATE on > user_catalog_table > in a transaction. > + Thanks for your review. Attached the patch-set that addressed those two comments. I also fixed the commit message a bit in the 2PC specific patch to HEAD. No other changes. Please check. Best Regards, Takamichi Osumi HEAD_deadlock_documentation_of_logical_decoding_v06.patch Description: HEAD_deadlock_documentation_of_logical_decoding_v06.patch HEAD_with_2PC_deadlock_documentation_of_logical_decoding_v06.patch Description: HEAD_with_2PC_deadlock_documentation_of_logical_decoding_v06.patch PG10_deadlock_documentation_of_logical_decoding_v06.patch Description: PG10_deadlock_documentation_of_logical_decoding_v06.patch PG11_deadlock_documentation_of_logical_decoding_v06.patch Description: PG11_deadlock_documentation_of_logical_decoding_v06.patch PG12_deadlock_documentation_of_logical_decoding_v06.patch Description: PG12_deadlock_documentation_of_logical_decoding_v06.patch PG13_deadlock_documentation_of_logical_decoding_v06.patch Description: PG13_deadlock_documentation_of_logical_decoding_v06.patch PG96_deadlock_documentation_of_logical_decoding_v06.patch Description: PG96_deadlock_documentation_of_logical_decoding_v06.patch
Fwd: BUG #17056: Segmentation fault on altering the type of the foreign table column with a default
reposting to -hackers to get more eyeballs. Summary: only RELKIND_RELATION type relations should have attributes with atthasmissing/attmissingval Forwarded Message Subject:Re: BUG #17056: Segmentation fault on altering the type of the foreign table column with a default Date: Sat, 12 Jun 2021 21:59:24 -0400 From: Andrew Dunstan To: Andres Freund CC: Tom Lane , exclus...@gmail.com, pgsql-b...@lists.postgresql.org On 6/12/21 5:50 PM, Andres Freund wrote: > Hi, > > On 2021-06-12 17:40:38 -0400, Andrew Dunstan wrote: >> Ok, I think the attached is the least we need to do. Given this I >> haven't been able to induce a crash even when the catalog is hacked with >> bogus missing values on a foreign table. But I'm not 100% convinced I >> have fixed all the places that need to be fixed. > Hm. There's a few places that look at atthasmissing and just assume that > there's corresponding information about the missing field. And as far as > I can see the proposed changes in RelationBuildTupleDesc() don't unset > atthasmissing, they just prevent the constraint part of the tuple desc > from being filled. Wouldn't this cause problems if we reach code like > Yes, you're right. This version should take care of things better. Thanks for looking. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index afa830d924..bcf80d865f 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -2161,6 +2161,13 @@ SetAttrMissing(Oid relid, char *attname, char *value) /* lock the table the attribute belongs to */ tablerel = table_open(relid, AccessExclusiveLock); + /* Don't do anything unless it's a RELKIND type relation */ + if (tablerel->rd_rel->relkind != RELKIND_RELATION) + { + table_close(tablerel, AccessExclusiveLock); + return; + } + /* Lock the attribute row and get the data */ attrrel = table_open(AttributeRelationId, RowExclusiveLock); atttup = SearchSysCacheAttName(relid, attname); @@ -2287,7 +2294,8 @@ StoreAttrDefault(Relation rel, AttrNumber attnum, valuesAtt[Anum_pg_attribute_atthasdef - 1] = true; replacesAtt[Anum_pg_attribute_atthasdef - 1] = true; - if (add_column_mode && !attgenerated) + if (rel->rd_rel->relkind == RELKIND_RELATION && add_column_mode && + !attgenerated) { expr2 = expression_planner(expr2); estate = CreateExecutorState(); diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 028e8ac46b..2b18eb89bf 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -12215,9 +12215,11 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel, /* * Here we go --- change the recorded column type and collation. (Note * heapTup is a copy of the syscache entry, so okay to scribble on.) First - * fix up the missing value if any. + * fix up the missing value if any. There shouldn't be any missing values + * for anything except RELKIND_RELATION relations, but if there are, ignore + * them. */ - if (attTup->atthasmissing) + if (rel->rd_rel->relkind == RELKIND_RELATION && attTup->atthasmissing) { Datum missingval; bool missingNull; diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index fd05615e76..588f9f0067 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -559,6 +559,15 @@ RelationBuildTupleDesc(Relation relation) elog(ERROR, "invalid attribute number %d for relation \"%s\"", attp->attnum, RelationGetRelationName(relation)); + /* + * Fix atthasmissing flag - it's only for RELKIND relations. Others + * should not have missing values set, but there may be some left from + * before when we placed that check, so this code defensively ignores + * such values. + */ + if (relation->rd_rel->relkind != RELKIND_RELATION) + attp->atthasmissing = false; + memcpy(TupleDescAttr(relation->rd_att, attnum - 1), attp, ATTRIBUTE_FIXED_PART_SIZE);
Re: a path towards replacing GEQO with something better
On Sun, Jun 13, 2021 at 9:50 AM Tomas Vondra wrote: > > 2) We can still keep GEQO around (with some huge limit by default) for a > > few years as an escape hatch, while we refine the replacement. If there > > is some bug that prevents finding a plan, we can emit a WARNING and fall > > back to GEQO. Or if a user encounters a regression in a big query, they > > can lower the limit to restore the plan they had in an earlier version. > > > > Not sure. Keeping significant amounts of code may not be free - both for > maintenance and new features. It'd be a bit sad if someone proposed > improvements to join planning, but had to do 2x the work to support it > in both the DP and GEQO branches, or risk incompatibility. Looking back again at the commit history, we did modify geqo to support partial paths and partition-wise join, so that's a fair concern. My concern is the risk of plan regressions after an upgrade, even if for a small number of cases. > OTOH maybe this concern is unfounded in practice - I don't think we've > done very many big changes to geqo in the last few years. Yeah, I get the feeling that it's already de facto obsolete, and we could make it a policy not to consider improvements aside from bug fixes where it can't find a valid plan, or forced API changes. Which I guess is another way of saying "deprecated". (I briefly considered turning it into a contrib module, but that seems like the worst of both worlds.) > This reminds me the proposals to have a GUC that'd determine how much > effort should the planner invest into various optimizations. For OLTP it > might be quite low, for large OLAP queries it'd be economical to spend > more time trying some more expensive optimizations. > > The challenge of course is how / in what units / to define the budget, > so that it's meaningful and understandable for users. Not sure if > "number of join rels generated" will be clear enough for users. But it > seems good enough for PoC / development, and hopefully people won't have > to tweak it very often. I'm also in favor of having some type of "planner effort" or "OLTP to OLAP spectrum" guc, but I'm not yet sure whether it'd be better to have it separate or to couple the joinrel budget to it. If we go that route, I imagine there'll be many things that planner_effort changes that we don't want to give a separate knob for. And, I hope with graceful degradation and a good enough heuristic search, it won't be necessary to change in most cases. > I haven't read the [Kossmann & Stocker, 2000] paper yet, but the > [Neumann, 2018] paper seems to build on it, and it seems to work with > much larger subtrees of the join tree than k=5. Right, in particular it builds on "IDP-2" from Kossmann & Stocker. Okay, so Neumann's favorite algorithm stack "Adaptive" is complex, and I believe you are referring to cases where they can iteratively improve up to 100 rels at a time because of linearization. That's a separate algorithm (IKKBZ) that complicates the cost model and also cannot have outer joins. If it has outer joins, they use regular DP on subsets of size up to 10. It's not substantively different from IDP-2, and that's the one I'd like to try to gracefully fall back to. Or something similar. > What I find fairly interesting is the section in [Neumann, 2018] about > cardinality estimates, and quality of query plans when the estimates are > off. The last paragraph in the 6.5 section essentially says that despite > poor estimates, the proposed algorithm performs better than the simple > (and cheap) heuristics. I'm not sure what to think about that, because > my "intuitive" understanding is that the more elaborate the planning is, > the more errors it can make when the estimates are off. Yeah, I'm not sure this part is very useful and seems almost like an afterthought. In table 3, all those poor examples are "pure" greedy algorithms and don't have iterative refinement added, so it kind of makes sense that poor estimates would hurt them more. But they don't compare those with *just* a refinement step added. I also don't know how realistic their "estimate fuzzing" is. -- John Naylor EDB: http://www.enterprisedb.com
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
On Thu, 10 Jun 2021 at 19:43, Peter Geoghegan wrote: > > On Thu, Jun 10, 2021 at 10:29 AM Matthias van de Meent > wrote: > > I see one exit for HEAPTUPLE_DEAD on a potentially recently committed > > xvac (?), and we might also check against recently committed > > transactions if xmin == xmax, although apparently that is not > > implemented right now. > > I don't follow. Perhaps you can produce a test case? If you were to delete a tuple in the same transaction that you create it (without checkpoints / subtransactions), I would assume that this would allow us to vacuum the tuple, as the only snapshot that could see the tuple must commit or roll back. In any case, inside the transaction the tuple is not visible anymore, and outside the transaction the tuple will never be seen. That being the case, any such tuple that has xmin == xmax should be vacuumable at any time, except that you might want to wait for the transaction to have committed/rolled back to prevent any race conditions with (delayed) index insertions. example: BEGIN; INSERT INTO tab VALUES (1); DELETE FROM tab; -- At this point, the tuple inserted cannot be seen in any -- current or future snapshot, and could thus be vacuumed. COMMIT; Because I am not quite yet well versed with the xid assignment and heapam deletion subsystems, it could very well be that either this case is impossible to reach, or that the heapam tuple delete logic already applies this at tuple delete time. With regards, Matthias van de Meent
Re: Different compression methods for FPI
Thanks for benchmarks, Justin! > 14 июня 2021 г., в 06:24, Justin Pryzby написал(а): > > The GUC is PGC_USERSET Oh, wow, that's neat. I did not realize that we can tune this for each individual client connection. Cool! > pglz writes ~half as much, but takes twice as long as uncompressed: > |Time: 3362.912 ms (00:03.363) > |wal_bytes| 11644224 > > zlib writes ~4x less than ncompressed, and still much faster than pglz > |Time: 2167.474 ms (00:02.167) > |wal_bytes| 5611653 > > lz4 is as fast as uncompressed, and writes a bit more than pglz: > |Time: 1612.874 ms (00:01.613) > |wal_bytes| 12397123 > > zstd(6) is slower than lz4, but compresses better than anything but zlib. > |Time: 1808.881 ms (00:01.809) > |wal_bytes| 6395993 I was wrong about zlib: it has its point on Pareto frontier. At least for this test. Thanks! Best regards, Andrey Borodin.
Re: GiST operator class for bool
> 8 июня 2021 г., в 19:53, Emre Hasegeli написал(а): > >> But patch that you propose does not support sorting build added in PG14. > > It looks like the change to btree_gist is not committed yet. I'll > rebase my patch once it's committed. Changes to GiST are committed. There will be no need to rebase anyway :) > > It was a long thread. I couldn't read all of it. Though, the last > patches felt to me like a part of what's already been committed. > Shouldn't they also be committed to version 14? Well, yeah, it could would be cool to have gist build and gist_btree support it together, but there was many parts and we could did not finish it before feature freeze. Best regards, Andrey Borodin.
Re: pgbench bug candidate: negative "initial connection time"
Hmmm. Possibly. Another option could be not to report anything after some errors. I'm not sure, because it would depend on the use case. I guess the command returned an error status as well. I did not know any use cases and decisions , but I vote to report nothing when error occurs. I would prefer to abort the thread whose connection got an error and report results for other threads, as handled when doConnect fails in CSTATE_START_TX state. It is unclear to me whether it makes much sense to report performance when things go wrong. At least when a one connection per client bench is run ISTM that it should not proceed, because the bench could not even start as prescribe. When connection break while the bench has already started, maybe it makes more sense to proceed, although I guess that maybe reattempting connections would make also sense in such case. In this case, we have to set the state to CSTATE_ABORT before going to 'done' as fixed in the attached patch, in order to ensure that exit status is 2 and the result reports "pgbench: fatal: Run was aborted; the above results are incomplete." Hmmm. I agree that at least reporting that there was an issue is a good idea. Otherwise, if we want pgbench to exit immediately when a connection error occurs, we have tocall exit(1) to ensure the exit code is 1, of course. Anyway, it is wrong that thecurrent pgbench exit successfully with exit code 0 when doConnnect fails. Indeed, I can only agree on this one. -- Fabien.
Re: SQLSTATE for replication connection failures
On Sat, Jun 12, 2021 at 9:12 PM Tom Lane wrote: > > So far as I can find, just about everyplace that deals with replication > connections has slipshod error reporting. An example from worker.c is > > LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, > true, > MySubscription->name, ); > if (LogRepWorkerWalRcvConn == NULL) > ereport(ERROR, > (errmsg("could not connect to the publisher: %s", err))); > > Because of the lack of any errcode() call, this failure will be reported > as XX000 ERRCODE_INTERNAL_ERROR, which is surely not appropriate. > worker.c is in good company though, because EVERY caller of walrcv_connect > is equally slipshod. > > Shall we just use ERRCODE_CONNECTION_FAILURE for these failures, or > would it be better to invent another SQLSTATE code? Arguably, > ERRCODE_CONNECTION_FAILURE is meant for failures of client connections; > but on the other hand, a replication connection is a sort of client. > Your reasoning sounds good to me. So, +1 for using ERRCODE_CONNECTION_FAILURE. -- With Regards, Amit Kapila.
Re: Fix around conn_duration in pgbench
Hello Yugo-san, TState has a field called "conn_duration" and this is, the comment says, "cumulated connection and deconnection delays". This value is summed over threads and reported as "average connection time" under -C/--connect. If this options is not specified, the value is never used. Yep. However, I found that conn_duration is calculated even when -C/--connect is not specified, which is waste. SO we can remove this code as fixed in the attached patch. I'm fine with the implied code simplification, but it deserves a comment. In addition, deconnection delays are not cumulated even under -C/--connect in spite of mentioned in the comment. I also fixed this in the attached patch. I'm fine with that, even if it only concerns is_connect. I've updated the command to work whether now is initially set or not. I'm unsure whether closing a pg connection actually waits for anything, so probably the impact is rather small anyway. It cannot be usefully measured when !is_connect, because thread do that when then feel like it, whereas on connection we can use barriers to have something which makes sense. Also, there is the issue of connection failures: the attached version adds an error message and exit for initial connections consistently. This is not done with is_connect, though, and I'm unsure what we should really do. -- Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index d7479925cb..9d2abbfb68 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3171,6 +3171,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if ((st->con = doConnect()) == NULL) { + /* as the bench is already running, we do not abort */ pg_log_error("client %d aborted while establishing connection", st->id); st->state = CSTATE_ABORTED; break; @@ -3536,8 +3537,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (is_connect) { + pg_time_usec_t start = now; + + pg_time_now_lazy(); finishCon(st); - now = 0; + now = pg_time_now(); + thread->conn_duration += now - start; } if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded) @@ -3561,6 +3566,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) */ case CSTATE_ABORTED: case CSTATE_FINISHED: +/* per-thread last disconnection time is not measured */ finishCon(st); return; } @@ -4405,7 +4411,10 @@ runInitSteps(const char *initialize_steps) initPQExpBuffer(); if ((con = doConnect()) == NULL) + { + pg_log_fatal("connection for initialization failed"); exit(1); + } setup_cancel_handler(NULL); SetCancelConn(con); @@ -6332,7 +6341,10 @@ main(int argc, char **argv) /* opening connection... */ con = doConnect(); if (con == NULL) + { + pg_log_fatal("setup connection failed"); exit(1); + } pg_log_debug("pghost: %s pgport: %s nclients: %d %s: %d dbName: %s", PQhost(con), PQport(con), nclients, @@ -6548,7 +6560,7 @@ threadRun(void *arg) if (thread->logfile == NULL) { pg_log_fatal("could not open logfile \"%s\": %m", logpath); - goto done; + exit(1); } } @@ -6561,6 +6573,7 @@ threadRun(void *arg) thread_start = pg_time_now(); thread->started_time = thread_start; + thread->conn_duration = 0; last_report = thread_start; next_report = last_report + (int64) 100 * progress; @@ -6572,26 +6585,13 @@ threadRun(void *arg) { if ((state[i].con = doConnect()) == NULL) { -/* - * On connection failure, we meet the barrier here in place of - * GO before proceeding to the "done" path which will cleanup, - * so as to avoid locking the process. - * - * It is unclear whether it is worth doing anything rather - * than coldly exiting with an error message. - */ -THREAD_BARRIER_WAIT(); -goto done; +/* coldly abort on connection failure */ +pg_log_fatal("cannot create connection for thread %d client %d", + thread->tid, i); +exit(1); } } - - /* compute connection delay */ - thread->conn_duration = pg_time_now() - thread->started_time; - } - else - { - /* no connection delay to record */ - thread->conn_duration = 0; + /* connection delay is measured globally between the barriers */ } /* GO */
Re: Failure in subscription test 004_sync.pl
On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada wrote: > > > > > I think it is showing a race condition issue in the code. In > > DropSubscription, we first stop the worker that is receiving the WAL, > > and then in a separate connection with the publisher, it tries to drop > > the slot which leads to this error. The reason is that walsender is > > still active as we just wait for wal receiver (or apply worker) to > > stop. Normally, as soon as the apply worker is stopped the walsender > > detects it and exits but in this case, it took some time to exit, and > > in the meantime, we tried to drop the slot which is still in use by > > walsender. > > There might be possible. > > That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT > command with WAIT option. I found a bug that is possibly an oversight > of commit 1632ea4368. > .. > > The condition should be the opposite; we should raise the error when > 'nowait' is true. I think this is the cause of the test failure. Even > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we > don't wait but raise the error. > > Attached a small patch fixes it. > Yes, this should fix the recent buildfarm failures. Alvaro, would you like to take care of this? -- With Regards, Amit Kapila.