Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
On 2021/03/23 11:40, Fujii Masao wrote: > > > On 2021/03/23 9:05, Masahiro Ikeda wrote: >> Yes. I attached the v5 patch based on v3 patch. >> I renamed SignalHandlerForUnsafeExit() and fixed the following comment. > > Thanks for updating the patch! > > When the startup process exits because of recovery_target_action=shutdown, > reaper() calls TerminateChildren(SIGTERM). This function sends SIGTERM to > the stats collector. Currently the stats collector ignores SIGTERM, but with > the patch it exits normally. This change of behavior might be problematic. > > That is, TerminateChildren(SIGTERM) sends SIGTERM to various processes. > But currently the stats collector and checkpointer don't exit even when > SIGTERM arrives because they ignore SIGTERM. After several processes > other than the stats collector and checkpointer exit by SIGTERM, > PostmasterStateMachine() and reaper() make checkpointer exit and then > the stats collector exit. The shutdown terminates the processes in this order. > > On the other hand, with the patch, the stats collector exits by SIGTERM > before checkpointer exits. This is not normal order of processes to exit in > shutdown. > > To address this issue, one idea is to use SIGUSR2 for normal exit of the stats > collector, instead of SIGTERM. If we do this, TerminateChildren(SIGTERM) > cannot terminate the stats collector. Thought? > > If we adopt this idea, the detail comment about why SIGUSR2 is used for that > needs to be added. Thanks for your comments. I agreed your concern and suggestion. Additionally, we need to consider system shutdown cycle too as CheckpointerMain()'s comment said. ``` * Note: we deliberately ignore SIGTERM, because during a standard Unix * system shutdown cycle, init will SIGTERM all processes at once. We * want to wait for the backends to exit, whereupon the postmaster will * tell us it's okay to shut down (via SIGUSR2) ``` I changed the signal from SIGTERM to SIGUSR2 and add the comments why SIGUSR2 is used. (v6-0001-pgstat_avoid_writing_on_sigquit.patch) Regards, -- Masahiro Ikeda NTT DATA CORPORATION diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c index dd9136a942..e09a7b3cad 100644 --- a/src/backend/postmaster/interrupt.c +++ b/src/backend/postmaster/interrupt.c @@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS) } /* - * Simple signal handler for exiting quickly as if due to a crash. + * Simple signal handler for processes which have not yet touched or do not + * touch shared memory to exit quickly. * - * Normally, this would be used for handling SIGQUIT. + * If processes already touched shared memory, call + * SignalHandlerForCrashExit() because shared memory may be corrupted. + */ +void +SignalHandlerForNonCrashExit(SIGNAL_ARGS) +{ + /* + * Since we don't touch shared memory, we can just pull the plug and exit + * without running any atexit handlers. + */ + _exit(1); +} + +/* + * Simple signal handler for processes which have touched shared memory to + * exit quickly. + * + * Normally, this would be used for handling SIGQUIT as if due to a crash. */ void SignalHandlerForCrashExit(SIGNAL_ARGS) @@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS) * shut down and exit. * * Typically, this handler would be used for SIGTERM, but some processes use - * other signals. In particular, the checkpointer exits on SIGUSR2, the - * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT - * or SIGTERM. + * other signals. In particular, the checkpointer and the stats collector exit + * on SIGUSR2, and the WAL writer exits on either SIGINT or SIGTERM. * * ShutdownRequestPending should be checked at a convenient place within the * main loop, or else the main loop should call HandleMainLoopInterrupts. diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b7af7c2707..53b84eda56 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -1,7 +1,21 @@ /* -- * pgstat.c * - * All the statistics collector stuff hacked up in one big, ugly file. + * All the statistics collector stuff hacked up in one big, ugly file. + * + * The statistics collector is started by the postmaster as soon as the + * startup subprocess finishes. It remains alive until the postmaster + * commands it to terminate. Normal termination is by SIGUSR2 after the + * checkpointer must exit(0), which instructs the statistics collector + * to save the final statistics to reuse at next startup and then exit(0). + * Emergency termination is by SIGQUIT; like any backend, the statistics + * collector will exit quickly without saving the final statistics. It's + * ok because the startup process will remove all statistics at next + * startup after emergency termination. + * + * Because the statistics collector doesn't touch shared memory, even if + * the statistics collector exits unexpectedly, the
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
On 2021/03/23 10:52, Thomas Munro wrote: On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao wrote: I found 0001 patch was committed in de829ddf23, and which added new wait event WalrcvExit. This name seems not consistent with other wait events. I'm thinking it's better to rename it to WalReceiverExit. Thought? Patch attached. Agreed, your names are better. Thanks. Thanks! I will commit the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Nicer error when connecting to standby with hot_standby=off
On 2021/03/23 3:59, James Coleman wrote: Are you saying we should only change the message for a single case: the case where we'd otherwise allow connections but EnableHotStandby is false? No. Let me clarify my opinion. At PM_STARTUP, "the database system is starting up" should be logged whatever the setting of hot_standby is. This is the same as the original behavior. During crash recovery, this message is output. Also at archive recovery or standby server, until the startup process sends PMSIGNAL_RECOVERY_STARTED, this message is logged. At PM_RECOVERY, originally "the database system is starting up" was logged whatever the setting of hot_standby is. My opinion is the same as our consensus, i.e., "the database system is not accepting connections" and "Hot standby mode is disabled." are logged if hot_standby is disabled. "the database system is not accepting connections" and "Consistent recovery state has not been yet reached." are logged if hot_standby is enabled. After the consistent recovery state is reached, if hot_standby is disabled, the postmaster state is still PM_RECOVERY. So "Hot standby mode is disabled." is still logged in this case. This is also different behavior from the original. If hot_standby is enabled, read-only connections can be accepted because the consistent state is reached. So no message needs to be logged. Therefore for now what we've not reached the consensus is what message should be logged at PM_STARTUP. I'm thinking it's better to log "the database system is starting up" in that case because of the reasons that I explained upthread. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: a verbose option for autovacuum
On Tue, Mar 23, 2021 at 1:31 PM Michael Paquier wrote: > > On Mon, Mar 22, 2021 at 12:17:37PM +0900, Masahiko Sawada wrote: > > I've updated the patch. I saved the index names at the beginning of > > heap_vacuum_rel() for autovacuum logging, and add indstats and > > nindexes to LVRelStats. Some functions still have nindexes as a > > function argument but it seems to make sense since it corresponds the > > list of index relations (*Irel). Please review the patch. > > Going back to that, the structure of the static APIs in this file make > the whole logic a bit hard to follow, but the whole set of changes you > have done here makes sense. It took me a moment to recall and > understand why it is safe to free *stats at the end of > vacuum_one_index() and if the index stats array actually pointed to > the DSM segment correctly within the shared stats. > > I think that there is more consolidation possible within LVRelStats, > but let's leave that for another day, if there is any need for such a > move. While studying your patch (v5-index_stat_log.patch) I found we can polish the parallel vacuum code in some places. I'll try it another day. > > To keep it short. Sold. Thank you! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Change default of checkpoint_completion_target
On Mon, Mar 22, 2021 at 01:11:00PM -0400, Stephen Frost wrote: > Unless there's anything further on this, I'll plan to commit it tomorrow > or Wednesday. Cool, looks fine to me. This version of the patch has forgotten to update one spot: src/backend/postmaster/checkpointer.c:double CheckPointCompletionTarget = 0.5; -- Michael signature.asc Description: PGP signature
Re: proposal - psql - use pager for \watch command
út 23. 3. 2021 v 0:35 odesílatel Thomas Munro napsal: > On Sun, Mar 21, 2021 at 11:43 PM Pavel Stehule > wrote: > > so 20. 3. 2021 v 23:45 odesílatel Thomas Munro > napsal: > >> Thoughts? I put my changes into a separate patch for clarity, but > >> they need some more tidying up. > > > > yes, your solution is much better. > > Hmm, there was a problem with it though: it blocked ^C while running > the query, which is bad. I fixed that. I did some polishing of the > code and some editing on the documentation and comments. I disabled > the feature completely on Windows, because it seems unlikely that > we'll be able to know if it even works, in this cycle. > > - output = PageOutput(158, pager ? &(pset.popt.topt) : NULL); > + output = PageOutput(160, pager ? &(pset.popt.topt) : NULL); > > What is that change for? > This is correct - this is the number of printed rows - it is used for decisions about using a pager for help. There are two new rows, and the number is correctly +2 Pavel
Re: Proposal: Save user's original authenticated identity for logging
On Mon, Mar 22, 2021 at 06:22:52PM +0100, Magnus Hagander wrote: > The 0002/0001/whateveritisaftertherebase is tracked over at > https://www.postgresql.org/message-id/flat/92e70110-9273-d93c-5913-0bccb6562...@dunslane.net > isn't it? I've assumed the expectation is to have that one committed > from that thread, and then rebase using that. Independent and useful pieces could just be extracted and applied separately where it makes sense. I am not sure if that's the case here, so I'll do a patch_to_review++. -- Michael signature.asc Description: PGP signature
Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?
On Mon, Mar 22, 2021 at 7:57 AM Amit Kapila wrote: > > On Mon, Mar 22, 2021 at 3:20 AM Peter Smith wrote: > > > > On Sun, Mar 21, 2021 at 8:54 PM Amit Kapila wrote: > > > > > > On Sat, Mar 20, 2021 at 12:54 PM Peter Smith > > > wrote: > > > > > > > > PSA my patch to correct this by firstly doing a HASH_FIND, then only > > > > HASH_REMOVE after we've finished using the ent. > > > > > > > > > > Why can't we keep using HASH_REMOVE as it is but get the output (entry > > > found or not) in the last parameter of hash_search API and then > > > perform Assert based on that? See similar usage in reorderbuffer.c and > > > rewriteheap.c. > > > > > > > Changing the Assert doesn't do anything to fix the problem as > > described, i.e. dereferencing of ent after the HASH_REMOVE. > > > > The real problem isn't the Assert. It's all those other usages of ent > > disobeying the API rule: "(NB: in the case of the REMOVE action, the > > result is a dangling pointer that shouldn't be dereferenced!)" > > > > Right, that is a problem. I see that your patch will fix it. Thanks. > Pushed your patch. -- With Regards, Amit Kapila.
Re: Proposal: Save user's original authenticated identity for logging
On Mon, Mar 22, 2021 at 07:17:26PM +, Jacob Champion wrote: > v8's test_access lost the in-order log search from v7; I've added it > back in v9. The increased resistance to entropy seems worth the few > extra lines. Thoughts? I am not really sure that we need to bother about the ordering of the entries here, as long as we check for all of them within the same fragment of the log file, so I would just go down to the simplest solution that I posted upthread that is enough to make sure that the verbosity is protected. That's what we do elsewhere, like with command_checks_all() and such. -- Michael signature.asc Description: PGP signature
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
On Tue, Mar 23, 2021 at 10:08 AM Amul Sul wrote: > On Mon, Mar 22, 2021 at 3:03 PM Amit Langote > wrote: > > > > On Mon, Mar 22, 2021 at 5:26 PM Amul Sul wrote: > > > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets > > > rwstate->rs_new_rel->rd_smgr correctly but next line > tuplesort_begin_cluster() > > > get called which cause the system cache invalidation and due to CCA > setting, > > > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the > subsequent > > > operations and causes segmentation fault. > > > > > > By calling RelationOpenSmgr() before calling smgrimmedsync() in > > > end_heap_rewrite() would fix the failure. Did the same in the attached > patch. > > > > That makes sense. I see a few commits in the git history adding > > RelationOpenSmgr() before a smgr* operation, whenever such a problem > > would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7, > > for example. > > > > Thanks for the confirmation. > > > I do wonder if there are still other smgr* operations in the source > > code that are preceded by operations that would invalidate the > > SMgrRelation that those smgr* operations would be called with. For > > example, the smgrnblocks() in gistBuildCallback() may get done too > > late than a corresponding RelationOpenSmgr() on the index relation. > > > > I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) > before > smgrnblocks() with CCA setting and didn't see any problem there. > > I think the easiest way to find that is to run a regression suite with CCA > build, perhaps, there is no guarantee that regression will hit all smgr* > operations, but that might hit most of them. Sure, will give a regression run with CCA enabled. > > Regards, > Amul > Regards, Neha Sharma
Re: Handling of opckeytype / CREATE OPERATOR CLASS (bug?)
Tomas Vondra writes: > On 3/23/21 3:13 AM, Tom Lane wrote: >> Hm. Both catalogs.sgml and pg_opclass.h say specifically that >> opckeytype should be zero if it's to be the same as the input >> column type. I don't think just dropping the enforcement of that >> is the right answer. > Yeah, that's possible. I was mostly just demonstrating the difference in > behavior. Maybe the right fix is to fix the catalog contents and then > tweak the AM code, or something. Digging in our git history, the rule about zero opckeytype dates to 2001 (f933766ba), which precedes our invention of polymorphic types in 2003 (somewhere around 730840c9b). So I'm pretty sure that that was a poor man's substitute for polymorphic opclasses, which we failed to clean up entirely after we got real polymorphic opclasses. Now, I'd be in favor of cleaning that up and just using standard polymorphism rules throughout. But (without having studied the code) it looks like the immediate issue is that something in the BRIN code is unfamiliar with the rule for zero opckeytype. It might be a noticeably smaller patch to fix that than to get rid of the convention about zero. regards, tom lane
RE: [POC] Fast COPY FROM command for the table with foreign partitions
From: Justin Pryzby > On Mon, Mar 22, 2021 at 08:18:56PM -0700, Zhihong Yu wrote: > > with data_dest_cb callback. It is used for send text representation of a > > tuple to a custom destination. > > > > send text -> sending text > > I would say "It is used to send the text representation ..." I took Justin-san's suggestion. (It feels like I'm in a junior English class...) > It's actually a pointer: > src/include/commands/copy.h:typedef struct CopyToStateData *CopyToState; > > There's many data structures like this, where a structure is typedefed with a > "Data" suffix and the pointer is typedefed without the "Data" Yes. Thank you for good explanation, Justin-san. Regards TakayukiTsunakawa v22-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch Description: v22-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL
(I finally get to catch up here..) At Mon, 22 Mar 2021 13:59:47 +0900, Fujii Masao wrote in > > > On 2021/03/22 12:01, Kyotaro Horiguchi wrote: > > Mmm. I agree that it waits for WAL in most cases, but still WAL-wait > > is activity for me because it is not waiting for being cued by > > someone, but waiting for new WAL to come to perform its main purpose. > > If it's an IPC, all waits on other than pure sleep should fall into > > IPC? (I was confused by the comment of WalSndWait, which doesn't > > state that it is waiting for latch..) > > Other point I'd like to raise is that the client_wait case should be > > distinctive from the WAL-wait since it is significant sign of what is > > happening. > > So I propose two chagnes here. > > a. Rewrite the comment of WalSndWait so that it states that "also > >waiting for latch-set". > > +1 Cool. > > b. Split the event to two different events. > > - WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN); > > + WalSndWait(wakeEvents, sleeptime, > > + pq_is_send_pending() ? WAIT_EVENT_WAL_SENDER_WRITE_DATA: > > + WAIT_EVENT_WAL_SENDER_MAIN); > > And _WRITE_DATA as client_wait and _SENDER_MAIN as activity. > > What do you think about this? > > I'm ok with this. What about the attached patch > (WalSenderWriteData.patch)? Yeah, that is better. I'm fine with it as a whole. + * Overwrite wait_event with WAIT_EVENT_WAL_SENDER_WRITE_DATA + * if we have pending data in the output buffer and are waiting to write + * data to a client. Since the function doesn't check for that directly, I'd like to write as the following. Overwrite wait_event with WAIT_EVENT_WAL_SENDER_WRITE_DATA if the caller told to wait for WL_SOCKET_WRITEABLE, which means that we have pending data in the output buffer and are waiting to write data to a client. > > Yes. The WAIT_EVENT_WAL_SENDER_WAIT_WAL is equivalent to > > WAIT_EVENT_WAL_SENDER_MAIN as function. So I think it should be in > > the same category as WAIT_EVENT_WAL_SENDER_MAIN. And like the 1 above, > > wait_client case should be distinctive from the _MAIN event. > > +1. What about the attached patch (WalSenderWaitForWAL.patch)? Looks good to me. Thanks. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote wrote: > > On Mon, Mar 22, 2021 at 5:26 PM Amul Sul wrote: > > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets > > rwstate->rs_new_rel->rd_smgr correctly but next line > > tuplesort_begin_cluster() > > get called which cause the system cache invalidation and due to CCA setting, > > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the > > subsequent > > operations and causes segmentation fault. > > > > By calling RelationOpenSmgr() before calling smgrimmedsync() in > > end_heap_rewrite() would fix the failure. Did the same in the attached > > patch. > > That makes sense. I see a few commits in the git history adding > RelationOpenSmgr() before a smgr* operation, whenever such a problem > would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7, > for example. > Thanks for the confirmation. > I do wonder if there are still other smgr* operations in the source > code that are preceded by operations that would invalidate the > SMgrRelation that those smgr* operations would be called with. For > example, the smgrnblocks() in gistBuildCallback() may get done too > late than a corresponding RelationOpenSmgr() on the index relation. > I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) before smgrnblocks() with CCA setting and didn't see any problem there. I think the easiest way to find that is to run a regression suite with CCA build, perhaps, there is no guarantee that regression will hit all smgr* operations, but that might hit most of them. Regards, Amul
Re: a verbose option for autovacuum
On Mon, Mar 22, 2021 at 12:17:37PM +0900, Masahiko Sawada wrote: > I've updated the patch. I saved the index names at the beginning of > heap_vacuum_rel() for autovacuum logging, and add indstats and > nindexes to LVRelStats. Some functions still have nindexes as a > function argument but it seems to make sense since it corresponds the > list of index relations (*Irel). Please review the patch. Going back to that, the structure of the static APIs in this file make the whole logic a bit hard to follow, but the whole set of changes you have done here makes sense. It took me a moment to recall and understand why it is safe to free *stats at the end of vacuum_one_index() and if the index stats array actually pointed to the DSM segment correctly within the shared stats. I think that there is more consolidation possible within LVRelStats, but let's leave that for another day, if there is any need for such a move. To keep it short. Sold. -- Michael signature.asc Description: PGP signature
Re: proposal - psql - use pager for \watch command
po 22. 3. 2021 v 22:07 odesílatel Thomas Munro napsal: > On Tue, Mar 23, 2021 at 1:53 AM Pavel Stehule > wrote: > > po 22. 3. 2021 v 13:13 odesílatel Thomas Munro > napsal: > >> The problem is that Apple's /dev/tty device is defective, and doesn't > >> work in poll(). It always returns immediately with revents=POLLNVAL, > >> but pspg assumes that data is ready and tries to read the keyboard and > >> then blocks until I press a key. This seems to fix it: > >> > >> +#ifndef __APPLE__ > >> + /* macOS can't use poll() on /dev/tty */ > >> state.tty = fopen("/dev/tty", "r+"); > >> +#endif > >> if (!state.tty) > >> state.tty = fopen(ttyname(fileno(stdout)), "r"); > > > > > > it is hell. > > Heh. I've recently spent many, many hours trying to make AIO work on > macOS, and nothing surprises me anymore. BTW I found something from > years ago on the 'net that fits with my observation about /dev/tty: > > https://www.mail-archive.com/bug-gnulib@gnu.org/msg00296.html > > Curious, which other OS did you put that fallback case in for? I'm a > little confused about why it works, so I'm not sure if it's the best > possible change, but I'm not planning to dig any further now, too many > patches, not enough time :-) > Unfortunately, I have no exact evidence. My original implementation was very primitive if (freopen("/dev/tty", "rw", stdin) == NULL) { fprintf(stderr, "cannot to reopen stdin: %s\n", strerror(errno)); exit(1); } Some people reported problems, but I don't know if these issues was related to tty or to freopen In some discussion I found a workaround with reusing stdout and stderr - and then this works well, but I have no feedback about these fallback cases. And because this strategy is used by "less" pager too, I expect so this is a common and widely used workaround. I remember so there was a problems with cygwin and some unix platforms (but maybe very old) there was problem in deeper nesting - some like screen -> psql -> pspg. Directly pspg was working, but it didn't work from psql. Probably somewhere the implementation of pty was not fully correct. > > > Please, can you verify this fix? > > It works perfectly for me on a macOS 11.2 system with that change, > repainting the screen exactly when it should. I'm happy about that > because (1) it means I can confirm that the proposed change to psql is > working correctly on the 3 Unixes I have access to, and (2) I am sure > that a lot of Mac users will appreciate being able to use super-duper > \watch mode when this ships (a high percentage of PostgreSQL users I > know use a Mac as their client machine). > Thank you for verification. I fixed it in master branch > >> A minor problem is that on macOS, _GNU_SOURCE doesn't seem to imply > >> NCURSES_WIDECHAR, so I suspect Unicode will be broken unless you > >> manually add -DNCURSES_WIDECHAR=1, though I didn't check. > > > > It is possible - > > > > can you run "pspg --version" > > Looks like I misunderstood: it is showing "with wide char support", > it's just that the "num" is 0 rather than 1. I'm not planning to > investigate that any further now, but I checked that it can show the > output of SELECT 'špeĉiäl chârãçtérs' correctly. > It is the job of ncursesw - pspg sends data to ncurses in original format - it does only some game with attributes.
Re: UPDATE ... SET (single_column) = row_constructor is a bit broken from V10 906bfcad7ba7c
On Tue, 23 Mar 2021 at 02:43, Justin Pryzby wrote: > > On Mon, Mar 22, 2021 at 02:10:49PM +0530, Mahendra Singh Thalor wrote: > > Hi Hackers, > > > > Commit 906bfcad7ba7c has improved handling for "UPDATE ... SET > > (column_list) = row_constructor", but it has broken in some cases where it > > was working prior to this commit. > > After this commit query “DO UPDATE SET (t1_col)” is giving an error which > > was working fine earlier. > > See prior discussions: > > https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com > https://www.postgresql.org/message-id/flat/camjna7cdlzpcs0xnrpkvqmj6vb6g3eh8cygp9zbjxdpfftz...@mail.gmail.com > https://www.postgresql.org/message-id/flat/87sh5rs74y@news-spur.riddles.org.uk > https://git.postgresql.org/gitweb/?p=postgresql.git=commit=86182b18957b8f9e8045d55b137aeef7c9af9916 > Thanks Justin. Sorry, my mistake is that without checking prior discussion, i opened a new thread. From next time, I will take care of this. -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
Re: Handling of opckeytype / CREATE OPERATOR CLASS (bug?)
On 3/23/21 3:13 AM, Tom Lane wrote: > Tomas Vondra writes: >> while working on the new BRIN opclasses in [1], I stumbled on something >> I think is actually a bug in how CREATE OPERATOR CLASS handles the >> storage type. > > Hm. Both catalogs.sgml and pg_opclass.h say specifically that > opckeytype should be zero if it's to be the same as the input > column type. I don't think just dropping the enforcement of that > is the right answer. > Yeah, that's possible. I was mostly just demonstrating the difference in behavior. Maybe the right fix is to fix the catalog contents and then tweak the AM code, or something. > I don't recall for sure what-all might depend on that. I suspect > that it's mostly for the benefit of polymorphic opclasses, so > maybe the right thing is to say that the opckeytype can be > polymorphic if opcintype is, and then we resolve it as per > the usual polymorphism rules. > I did an experiment - fixed all the opclasses violating the rule by removing the opckeytype, and ran make checke. The only cases causing issues were cidr and int4range. Not that it proves anything. > In any case, it's fairly suspicious that the only opclasses > violating the existing rule are johnny-come-lately BRIN opclasses. > Right, that seems suspicious. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: New IndexAM API controlling index vacuum strategies
On Mon, Mar 22, 2021 at 8:33 PM Peter Geoghegan wrote: > More concretely, maybe the new GUC is forced to be 1.05 of > vacuum_freeze_table_age. Of course that scheme is a bit arbitrary -- > but so is the existing 0.95 scheme. I meant to write 1.05 of autovacuum_vacuum_max_age. So just as vacuum_freeze_table_age cannot really be greater than 0.95 * autovacuum_vacuum_max_age, your new GUC cannot really be less than 1.05 * autovacuum_vacuum_max_age. -- Peter Geoghegan
Re: New IndexAM API controlling index vacuum strategies
On Mon, Mar 22, 2021 at 8:28 PM Peter Geoghegan wrote: > You seem to be concerned about a similar contradiction. In fact it's > *very* similar contradiction, because this new GUC is naturally a > "sibling GUC" of both vacuum_freeze_table_age and > autovacuum_vacuum_max_age (the "units" are the same, though the > behavior that each GUC triggers is different -- but > vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already > *similar and different* in the same way). So perhaps the solution > should be similar -- silently interpret the setting of the new GUC to > resolve the contradiction. More concretely, maybe the new GUC is forced to be 1.05 of vacuum_freeze_table_age. Of course that scheme is a bit arbitrary -- but so is the existing 0.95 scheme. There may be some value in picking a scheme that "advertises" that all three GUCs are symmetrical, or at least related -- all three divide up the table's XID space. -- Peter Geoghegan
Re: Disable WAL logging to speed up data loading
On Mon, 2021-03-22 at 13:22 -0400, Stephen Frost wrote: > * Laurenz Albe (laurenz.a...@cybertec.at) wrote: > > On Mon, 2021-03-22 at 11:05 -0400, Stephen Frost wrote: > > > > Perhaps allowing to set unlogged tables to logged ones without writing > > > > WAL > > > > is a more elegant way to do that, but I cannot see how that would be any > > > > more crash safe than this patch. Besides, the feature doesn't exist > > > > yet. > > > > > > I'm not suggesting it's somehow more crash safe- but it's at least very > > > clear what happens in such a case, to wit: the entire table is cleared > > > on crash recovery. > > > > I didn't look at the patch, but are you saying that after changing the > > table to LOGGED, it somehow remembers that it is not crash safe and is > > emptied if there is a crash before the next checkpoint? > > I'm not sure where the confusion is, but certainly once the table has > been turned into LOGGED and that's been committed then it should be > crash safe, even under 'minimal' with the optimization to avoid actually > copying the entire table into the WAL. I don't see any particular > reason why that isn't possible to do. The confusion is probably caused by my ignorance. If the actual data files of the table are forced out to disk on COMMIT, that should indeed work. And if you fsync a file that has not been changed since it was last persisted, that should be cheap. So if I got that right, I agree with you that that would be a much better way to achieve the goal than wal_level = 'none'. The only drawback is that we don't have that feature yet... Yours, Laurenz Albe
Re: New IndexAM API controlling index vacuum strategies
On Mon, Mar 22, 2021 at 6:41 PM Masahiko Sawada wrote: > But we're not sure when the next anti-wraparound vacuum will take > place. Since the table is already vacuumed by a non-aggressive vacuum > with disabling index cleanup, an autovacuum will process the table > when the table gets modified enough or the table's relfrozenxid gets > older than autovacuum_vacuum_max_age. If the new threshold, probably a > new GUC, is much lower than autovacuum_vacuum_max_age and > vacuum_freeze_table_age, the table is continuously vacuumed without > advancing relfrozenxid, leading to unnecessarily index bloat. Given > the new threshold is for emergency purposes (i.g., advancing > relfrozenxid faster), I think it might be better to use > vacuum_freeze_table_age as the lower bound of the new threshold. What > do you think? As you know, when the user sets vacuum_freeze_table_age to a value that is greater than the value of autovacuum_vacuum_max_age, the two GUCs have values that are contradictory. This contradiction is resolved inside vacuum_set_xid_limits(), which knows that it should "interpret" the value of vacuum_freeze_table_age as (autovacuum_vacuum_max_age * 0.95) to paper-over the user's error. This 0.95 behavior is documented in the user docs, though it happens silently. You seem to be concerned about a similar contradiction. In fact it's *very* similar contradiction, because this new GUC is naturally a "sibling GUC" of both vacuum_freeze_table_age and autovacuum_vacuum_max_age (the "units" are the same, though the behavior that each GUC triggers is different -- but vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already *similar and different* in the same way). So perhaps the solution should be similar -- silently interpret the setting of the new GUC to resolve the contradiction. (Maybe I should say "these two new GUCs"? MultiXact variant might be needed...) This approach has the following advantages: * It follows precedent. * It establishes that the new GUC is a logical extension of the existing vacuum_freeze_table_age and autovacuum_vacuum_max_age GUCs. * The default value for the new GUC will be so much higher (say 1.8 billion XIDs) than even the default of autovacuum_vacuum_max_age that it won't disrupt anybody's existing postgresql.conf setup. * For the same reason (the big space between autovacuum_vacuum_max_age and the new GUC with default settings), you can almost set the new GUC without needing to know about autovacuum_vacuum_max_age. * The overall behavior isn't actually restrictive/paternalistic. That is, if you know what you're doing (say you're testing the feature) you can reduce all 3 sibling GUCs to 0 and get the testing behavior that you desire. What do you think? -- Peter Geoghegan
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On Mon, Mar 22, 2021 at 08:18:56PM -0700, Zhihong Yu wrote: > with data_dest_cb callback. It is used for send text representation of a > tuple to a custom destination. > > send text -> sending text I would say "It is used to send the text representation ..." > struct PgFdwModifyState *aux_fmstate; /* foreign-insert state, if > * created */ > + CopyToState cstate; /* foreign COPY state, if used */ > > Since foreign COPY is optional, should cstate be a pointer ? That would be > in line with aux_fmstate. It's actually a pointer: src/include/commands/copy.h:typedef struct CopyToStateData *CopyToState; There's many data structures like this, where a structure is typedefed with a "Data" suffix and the pointer is typedefed without the "Data" -- Justin
Re: shared memory stats: high level design decisions: consistency, dropping
On Sun, 21 Mar 2021 at 18:16, Stephen Frost wrote: > > Greetings, > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > I also believe that the snapshotting behavior has advantages in terms > > of being able to perform multiple successive queries and get consistent > > results from them. Only the most trivial sorts of analysis don't need > > that. > > > > In short, what you are proposing sounds absolutely disastrous for > > usability of the stats views, and I for one will not sign off on it > > being acceptable. > > > > I do think we could relax the consistency guarantees a little bit, > > perhaps along the lines of only caching view rows that have already > > been read, rather than grabbing everything up front. But we can't > > just toss the snapshot concept out the window. It'd be like deciding > > that nobody needs MVCC, or even any sort of repeatable read. > > This isn't the same use-case as traditional tables or relational > concepts in general- there aren't any foreign keys for the fields that > would actually be changing across these accesses to the shared memory > stats- we're talking about gross stats numbers like the number of > inserts into a table, not an employee_id column. In short, I don't > agree that this is a fair comparison. I use these stats quite a bit and do lots of slicing and dicing with them. I don't think it's as bad as Tom says but I also don't think we can be quite as loosy-goosy as I think Andres or Stephen might be proposing either (though I note that haven't said they don't want any consistency at all). The cases where the consistency really matter for me is when I'm doing math involving more than one statistic. Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look at the ratio between n_tup_upd and n_tup_hot_upd. And no, it doesn't help that these are often large numbers after a long time because I'm actually working with the first derivative of these numbers using snapshots or a time series database. So if you have the seq_tup_read incremented but not seq_scan incremented you could get a wildly incorrect calculation of "tup read per seq scan" which actually matters. I don't think I've ever done math across stats for different objects. I mean, I've plotted them together and looked at which was higher but I don't think that's affected by some plots having peaks slightly out of sync with the other. I suppose you could look at the ratio of access patterns between two tables and know that they're only ever accessed by a single code path at the same time and therefore the ratios would be meaningful. But I don't think users would be surprised to find they're not consistent that way either. So I think we need to ensure that at least all the values for a single row representing a single object are consistent. Or at least that there's *some* correct way to retrieve a consistent row and that the standard views use that. I don't think we need to guarantee that every possible plan will always be consistent even if you access the row multiple times in a self-join or use the lookup function on individual columns separately. -- greg
Re: [POC] Fast COPY FROM command for the table with foreign partitions
Hi, In the description: with data_dest_cb callback. It is used for send text representation of a tuple to a custom destination. send text -> sending text struct PgFdwModifyState *aux_fmstate; /* foreign-insert state, if * created */ + CopyToState cstate; /* foreign COPY state, if used */ Since foreign COPY is optional, should cstate be a pointer ? That would be in line with aux_fmstate. Cheers On Mon, Mar 22, 2021 at 7:02 PM tsunakawa.ta...@fujitsu.com < tsunakawa.ta...@fujitsu.com> wrote: > From: Andrey Lepikhov > > Macros _() at the postgresExecForeignCopy routine: > > if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0) > > > > uses gettext. Under linux it is compiled ok, because (as i understood) > > uses standard implementation of gettext: > > objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext' > > gettext@@GLIBC_2.2.5 > > > > but in MacOS (and maybe somewhere else) we need to explicitly link > > libintl library in the Makefile: > > SHLIB_LINK += $(filter -lintl, $(LIBS) > > > > Also, we may not use gettext at all in this part of the code. > > I'm afraid so, because no extension in contrib/ has po/ directory. I just > removed _() and rebased the patch on HEAD. > > > Regards > TakayukiTsunakawa > > >
RE: Disable WAL logging to speed up data loading
From: Stephen Frost > First- what are you expecting would actually happen during crash recovery in > this specific case with your proposed new WAL level? ... > I'm not suggesting it's somehow more crash safe- but it's at least very clear > what happens in such a case, to wit: the entire table is cleared on crash > recovery. As Laurenz-san kindly replied, the database server refuses to start with a clear message. So, it's similarly very clear what happens. The user will never unknowingly resume operation with possibly corrupt data. > We're talking about two different ways to accomplish essentially the same > thing- one which introduces a new WAL level, vs. one which adds an > optimization for a WAL level we already have. That the second is more elegant > is more-or-less entirely the point I'm making here, so it seems pretty > relevant. So, I understood the point boils down to elegance. Could I ask what makes you feel ALTER TABLE UNLOGGED/LOGGED is (more) elegant? I'm purely asking as a user. (I don't want to digress, but if we consider the number of options for wal_level as an issue, I feel it's not elegant to have separate "replica" and "logical".) > Under the proposed 'none', you basically have to throw out the entire cluster > on > a crash, all because you don't want to use 'UNLOGGED' when you created the > tables you want to load data into, or 'TRUNCATE' them in the transaction where > you start the data load, either of which gives us enough indication and which > we have infrastructure around dealing with in the event of a crash during the > load without everything else having to be tossed and everything restored from > a > backup. That's both a better user experience from the perspective of having > fewer WAL levels to understand and from just a general administration > perspective so you don't have to go all the way back to a backup to bring the > system back up. The elegance of wal_level = none is that the user doesn't have to remember to add ALTER TABLE to the data loading job when they add load target tables/partitions. If they build and use their own (shell) scripts to load data, that won't be burdon or forgotten. But what would they have to do when they use ETL tools like Talend, Pentaho, and Informatica Power Center? Do those tools allow users to add custom processing like ALTER TABLE to the data loading job steps for each table? (AFAIK, not.) wal_level = none is convenient and attractive for users who can backup and restore the entire database instantly with a storage or filesystem snapshot feature. Regards TakayukiTsunakawa
Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
On 2021/03/23 9:05, Masahiro Ikeda wrote: Yes. I attached the v5 patch based on v3 patch. I renamed SignalHandlerForUnsafeExit() and fixed the following comment. Thanks for updating the patch! When the startup process exits because of recovery_target_action=shutdown, reaper() calls TerminateChildren(SIGTERM). This function sends SIGTERM to the stats collector. Currently the stats collector ignores SIGTERM, but with the patch it exits normally. This change of behavior might be problematic. That is, TerminateChildren(SIGTERM) sends SIGTERM to various processes. But currently the stats collector and checkpointer don't exit even when SIGTERM arrives because they ignore SIGTERM. After several processes other than the stats collector and checkpointer exit by SIGTERM, PostmasterStateMachine() and reaper() make checkpointer exit and then the stats collector exit. The shutdown terminates the processes in this order. On the other hand, with the patch, the stats collector exits by SIGTERM before checkpointer exits. This is not normal order of processes to exit in shutdown. To address this issue, one idea is to use SIGUSR2 for normal exit of the stats collector, instead of SIGTERM. If we do this, TerminateChildren(SIGTERM) cannot terminate the stats collector. Thought? If we adopt this idea, the detail comment about why SIGUSR2 is used for that needs to be added. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: shared-memory based stats collector
At Fri, 19 Mar 2021 14:27:38 -0700, Andres Freund wrote in > Hi, > > On 2021-03-10 20:26:56 -0800, Andres Freund wrote: > > > + * We expose this shared entry now. You might think that the > > > entry > > > + * can be removed by a concurrent backend, but since we are > > > creating > > > + * an stats entry, the object actually exists and used in the > > > upper > > > + * layer. Such an object cannot be dropped until the first > > > vacuum > > > + * after the current transaction ends. > > > + */ > > > + dshash_release_lock(pgStatSharedHash, shhashent); > > > > I don't think you can safely release the lock before you incremented the > > refcount? What if, once the lock is released, somebody looks up that > > entry, increments the refcount, and decrements it again? It'll see a > > refcount of 0 at the end and decide to free the memory. Then the code > > below will access already freed / reused memory, no? > > Yep, it's not even particularly hard to hit: > > S0: CREATE TABLE a_table(); > S0: INSERT INTO a_table(); > S0: disconnect > S1: set a breakpoint to just after the dshash_release_lock(), with an > if objid == a_table_oid > S1: SELECT pg_stat_get_live_tuples('a_table'::regclass); > (will break at above breakpoint, without having incremented the > refcount yet) > S2: DROP TABLE a_table; > S2: VACUUM pg_class; > > At that point S2's call to pgstat_vacuum_stat() will find the shared > stats entry for a_table, delete the entry from the shared hash table, > see that the stats data has a zero refcount, and free it. Once S1 wakes > up it'll use already freed (and potentially since reused) memory. Sorry for the delay. You're right. I actually see permanent-block when continue running S1 after the vacuum. That happens at LWLockRelease on freed block. Moving the refcount bumping before the dshash_release_lock call fixes that. One issue doing that *was* get_stat_entry() had the path for the case pgStatCacheContext is not available, which is already dead. After the early lock releasing is removed, the comment is no longer needed, too. While working on this, I noticed that the previous diff v56-57-func-diff.txt was slightly stale (missing a later bug fix). So attached contains a fix on the amendment path. Please find the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 3f546afe6a..3e2b90e92b 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -884,6 +884,8 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create, create, nowait, create, ); if (shhashent) { + boollofound; + if (create && !shfound) { /* Create new stats entry. */ @@ -900,38 +902,27 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create, else shheader = dsa_get_address(area, shhashent->body); - /* -* We expose this shared entry now. You might think that the entry -* can be removed by a concurrent backend, but since we are creating -* an stats entry, the object actually exists and used in the upper -* layer. Such an object cannot be dropped until the first vacuum -* after the current transaction ends. -*/ + pg_atomic_add_fetch_u32(>refcount, 1); dshash_release_lock(pgStatSharedHash, shhashent); - /* register to local hash if possible */ - if (pgStatEntHash || pgStatCacheContext) + /* Create local hash if not yet */ + if (pgStatEntHash == NULL) { - boollofound; + Assert(pgStatCacheContext); - if (pgStatEntHash == NULL) - { - pgStatEntHash = - pgstat_localhash_create(pgStatCacheContext, - PGSTAT_TABLE_HASH_SIZE, NULL); - pgStatEntHashAge = - pg_atomic_read_u64(>gc_count); - } - - lohashent = - pgstat_localhash_insert(pgStatEntHash, key, ); - - Assert(!lofound); - lohashent->body = shheader; - lohashent->dsapointer = shhashent->body; - - pg_atomic_add_fetch_u32(>refcount, 1); + pgStatEntHash = +
Re: Key management with tests
On Mon, Mar 22, 2021 at 08:38:37PM -0400, Bruce Momjian wrote: > > This particular patch (introducing the RelationIsPermanent() macro) > > seems like it'd be a nice thing to commit independently of the rest, > > reducing the size of this patch set..? > > Committed as suggested. Also, I have written a short presentation on where I think we are with cluster file encryption: https://momjian.us/main/writings/pgsql/cfe.pdf -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Handling of opckeytype / CREATE OPERATOR CLASS (bug?)
Tomas Vondra writes: > while working on the new BRIN opclasses in [1], I stumbled on something > I think is actually a bug in how CREATE OPERATOR CLASS handles the > storage type. Hm. Both catalogs.sgml and pg_opclass.h say specifically that opckeytype should be zero if it's to be the same as the input column type. I don't think just dropping the enforcement of that is the right answer. I don't recall for sure what-all might depend on that. I suspect that it's mostly for the benefit of polymorphic opclasses, so maybe the right thing is to say that the opckeytype can be polymorphic if opcintype is, and then we resolve it as per the usual polymorphism rules. In any case, it's fairly suspicious that the only opclasses violating the existing rule are johnny-come-lately BRIN opclasses. regards, tom lane
RE: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Dear Andrei, I think the idea is good because the pg_stat_statements_info view cannot distinguish whether the specific statement is deallocated or not. But multiple calling of GetCurrentTimestamp() may cause some performance issues. How about adding a configuration parameter for controlling this feature? Or we don't not have to worry about that? > + if (api_version >= PGSS_V1_9) > + { > + values[i++] = TimestampTzGetDatum(first_seen); > + } I think {} is not needed here. Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: [POC] Fast COPY FROM command for the table with foreign partitions
From: Andrey Lepikhov > Macros _() at the postgresExecForeignCopy routine: > if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0) > > uses gettext. Under linux it is compiled ok, because (as i understood) > uses standard implementation of gettext: > objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext' > gettext@@GLIBC_2.2.5 > > but in MacOS (and maybe somewhere else) we need to explicitly link > libintl library in the Makefile: > SHLIB_LINK += $(filter -lintl, $(LIBS) > > Also, we may not use gettext at all in this part of the code. I'm afraid so, because no extension in contrib/ has po/ directory. I just removed _() and rebased the patch on HEAD. Regards TakayukiTsunakawa v21-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch Description: v21-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao wrote: > I found 0001 patch was committed in de829ddf23, and which added new > wait event WalrcvExit. This name seems not consistent with other wait > events. I'm thinking it's better to rename it to WalReceiverExit. Thought? > Patch attached. Agreed, your names are better. Thanks.
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
On 2021/03/02 10:10, Thomas Munro wrote: On Tue, Mar 2, 2021 at 12:00 AM Thomas Munro wrote: On Mon, Nov 16, 2020 at 8:56 PM Michael Paquier wrote: No objections with the two changes from pg_usleep() to WaitLatch() so they could be applied separately first. I thought about committing that first part, and got as far as splitting the patch into two (see attached), but then I re-read Fujii-san's message about the speed of promotion and realised that we really should have something like a condition variable for walRcvState changes. I'll look into that. Here's an experimental attempt at that, though I'm not sure if it's the right approach. Of course it's not necessary to use condition variables here: we could use recoveryWakeupLatch, because we're not in any doubt about who needs to be woken up. But then you could get constant wakeups while recovery is paused, unless you also suppressed that somehow. You could use the startup process's procLatch, advertised in shmem, but that's almost a condition variable. With a condition variable, you get to name it something like walRcvStateChanged, and then the programming rule is very clear: if you change walRcvState, you need to broadcast that fact (and you don't have to worry about who might be interested). One question I haven't got to the bottom of: is it a problem for the startup process that CVs use CHECK_FOR_INTERRUPTS()? I found 0001 patch was committed in de829ddf23, and which added new wait event WalrcvExit. This name seems not consistent with other wait events. I'm thinking it's better to rename it to WalReceiverExit. Thought? Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 19540206f9..43c07da20e 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1763,8 +1763,8 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser replication. - WalrcvExit - Waiting for the walreceiver to exit. + WalReceiverExit + Waiting for the WAL receiver to exit. WalReceiverWaitStart diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index b7af7c2707..60f45ccc4e 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -4121,8 +4121,8 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; - case WAIT_EVENT_WALRCV_EXIT: - event_name = "WalrcvExit"; + case WAIT_EVENT_WAL_RECEIVER_EXIT: + event_name = "WalReceiverExit"; break; case WAIT_EVENT_WAL_RECEIVER_WAIT_START: event_name = "WalReceiverWaitStart"; diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c index fff6c54c45..6f0acbfdef 100644 --- a/src/backend/replication/walreceiverfuncs.c +++ b/src/backend/replication/walreceiverfuncs.c @@ -224,7 +224,7 @@ ShutdownWalRcv(void) ConditionVariablePrepareToSleep(>walRcvStoppedCV); while (WalRcvRunning()) ConditionVariableSleep(>walRcvStoppedCV, - WAIT_EVENT_WALRCV_EXIT); + WAIT_EVENT_WAL_RECEIVER_EXIT); ConditionVariableCancelSleep(); } diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 2c82313550..87672e6f30 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1008,7 +1008,7 @@ typedef enum WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, WAIT_EVENT_SYNC_REP, - WAIT_EVENT_WALRCV_EXIT, + WAIT_EVENT_WAL_RECEIVER_EXIT, WAIT_EVENT_WAL_RECEIVER_WAIT_START, WAIT_EVENT_XACT_GROUP_UPDATE } WaitEventIPC;
Re: New IndexAM API controlling index vacuum strategies
On Fri, Mar 19, 2021 at 3:36 AM Peter Geoghegan wrote: > > On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada wrote: > > If we have the constant threshold of 1 billion transactions, a vacuum > > operation might not be an anti-wraparound vacuum and even not be an > > aggressive vacuum, depending on autovacuum_freeze_max_age value. Given > > the purpose of skipping index vacuuming in this case, I think it > > doesn't make sense to have non-aggressive vacuum skip index vacuuming > > since it might not be able to advance relfrozenxid. If we have a > > constant threshold, 2 billion transactions, maximum value of > > autovacuum_freeze_max_age, seems to work. > > I like the idea of not making the behavior a special thing that only > happens with a certain variety of VACUUM operation (non-aggressive or > anti-wraparound VACUUMs). Just having a very high threshold should be > enough. > > Even if we're not going to be able to advance relfrozenxid, we'll > still finish much earlier and let a new anti-wraparound vacuum take > place that will do that -- and will be able to reuse much of the work > of the original VACUUM. Of course this anti-wraparound vacuum will > also skip index vacuuming from the start (whereas the first VACUUM may > well have done some index vacuuming before deciding to end index > vacuuming to hurry with finishing). But we're not sure when the next anti-wraparound vacuum will take place. Since the table is already vacuumed by a non-aggressive vacuum with disabling index cleanup, an autovacuum will process the table when the table gets modified enough or the table's relfrozenxid gets older than autovacuum_vacuum_max_age. If the new threshold, probably a new GUC, is much lower than autovacuum_vacuum_max_age and vacuum_freeze_table_age, the table is continuously vacuumed without advancing relfrozenxid, leading to unnecessarily index bloat. Given the new threshold is for emergency purposes (i.g., advancing relfrozenxid faster), I think it might be better to use vacuum_freeze_table_age as the lower bound of the new threshold. What do you think? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Handling of opckeytype / CREATE OPERATOR CLASS (bug?)
Hi, while working on the new BRIN opclasses in [1], I stumbled on something I think is actually a bug in how CREATE OPERATOR CLASS handles the storage type. If you look at built-in opclasses, there's a bunch of cases where (opcintype == opckeytype), for example the BRIN opclasses for inet data type: test=# select oid, opcname, opcfamily, opcintype, opckeytype from pg_opclass where opcintype = 869 order by oid; oid |opcname| opcfamily | opcintype | opckeytype ---+---+---+---+ 10105 | inet_minmax_ops | 4075 | 869 |869 10106 | inet_inclusion_ops| 4102 | 869 |869 The fact that opckeytype is set is important, because this allows the opclass to handle data with cidr data type - there's no opclass for cidr, but we can do this: create table t (a cidr); insert into t values ('127.0.0.1'); insert into t values ('127.0.0.2'); insert into t values ('127.0.0.3'); create index on t using brin (a inet_minmax_ops); This seems to work fine. Now, if you manually update the opckeytype to 0, you'll get this: test=# update pg_opclass set opckeytype = 0 where oid = 10105; UPDATE 1 test=# create index on t using brin (a inet_minmax_ops); ERROR: missing operator 1(650,650) in opfamily 4075 Unfortunately, it turns out it's impossible to re-create this opclass using CREATE OPERATOR CLASS, because the opclasscmds.c does this: /* Just drop the spec if same as column datatype */ if (storageoid == typeoid && false) storageoid = InvalidOid; So the storageoid is reset to 0. This only works for the built-in opclasses because those are injected directly into the catalogs. Either the CREATE OPERATOR CLASS is not considering something before resetting the storageoid, or maybe it should be reset for all opclasses (including the built-in ones) and the code that's using it should restore it in those cases (AFAICS opckeytype=0 means it's the same as opcintkey). Attached is a PoC patch doing the first thing - this does the trick for me, but I'm not 100% sure it's the right fix. [1] https://www.postgresql.org/message-id/54b47e66-bd8a-d44a-2090-fd4b2f49abe6%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company >From 4ee5e2bbd0c44c82d1604836db5352f3e0e0fbf2 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 22 Mar 2021 20:36:59 +0100 Subject: [PATCH 1/5] fixup opclass storage type --- src/backend/commands/opclasscmds.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c index fad39e2b75..78b2d69782 100644 --- a/src/backend/commands/opclasscmds.c +++ b/src/backend/commands/opclasscmds.c @@ -576,10 +576,7 @@ DefineOpClass(CreateOpClassStmt *stmt) */ if (OidIsValid(storageoid)) { - /* Just drop the spec if same as column datatype */ - if (storageoid == typeoid) - storageoid = InvalidOid; - else if (!amstorage) + if ((storageoid != typeoid) && (!amstorage)) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg("storage type cannot be different from data type for access method \"%s\"", -- 2.30.2
Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL
On 2021/03/22 13:59, Fujii Masao wrote: Ok, so barring any objection, I will commit the patch that I posted upthread. Pushed! I'm waiting for other two patches to be reviewed :) Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.
On 2021/03/22 17:49, Kyotaro Horiguchi wrote: At Mon, 22 Mar 2021 14:07:43 +0900, Fujii Masao wrote in On 2021/03/22 14:03, shinya11.k...@nttdata.com wrote: Barring any objection, I will commit this. I think it's good except for a typo "thoes four bits" Thanks for the review! Attached is the updated version of the patch. Thanks for picking it up. LGTM and applies cleanly. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Mon, Mar 22, 2021 at 05:17:15PM -0700, Zhihong Yu wrote: > Hi, > For queryjumble.c : > > + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group > > The year should be updated. > Same with queryjumble.h Thanks, fixed. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: create table like: ACCESS METHOD
On Tue, Jan 19, 2021 at 03:03:31PM -0600, Justin Pryzby wrote: > On Wed, Dec 30, 2020 at 12:33:56PM +, Simon Riggs wrote: > > There are no tests for the new functionality, please could you add some? > > Did you look at the most recent patch? > > +CREATE ACCESS METHOD heapdup TYPE TABLE HANDLER heap_tableam_handler; > +CREATE TABLE likeam() USING heapdup; > +CREATE TABLE likeamlike(LIKE likeam INCLUDING ALL); > > > > Also, I just realized that Dilip's toast compression patch adds "INCLUDING > COMPRESSION", which is stored in pg_am. That's an implementation detail of > that patch, but it's not intuitive that "including access method" wouldn't > include the compression stored there. So I think this should use "INCLUDING > TABLE ACCESS METHOD" not just ACCESS METHOD. Since the TOAST patch ended up not using access methods after all, I renamed this back to "like ACCESS METHOD" (without table). For now, I left TableLikeOption un-alphabetized. -- Justin >From 2da8104a39f65f576d0f221c288502d27211a209 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 15 Nov 2020 16:54:53 -0600 Subject: [PATCH v4] create table (like .. including ACCESS METHOD) --- doc/src/sgml/ref/create_table.sgml | 12 +++- src/backend/parser/gram.y | 1 + src/backend/parser/parse_utilcmd.c | 10 ++ src/include/nodes/parsenodes.h | 1 + src/test/regress/expected/create_table_like.out | 12 src/test/regress/sql/create_table_like.sql | 8 6 files changed, 43 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index c6c248f1e9..dd92fd0e9a 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -87,7 +87,7 @@ class="parameter">referential_action ] [ ON UPDATE and like_option is: -{ INCLUDING | EXCLUDING } { COMMENTS | CONSTRAINTS | DEFAULTS | GENERATED | IDENTITY | INDEXES | STATISTICS | STORAGE | ALL } +{ INCLUDING | EXCLUDING } { ACCESS METHOD | COMMENTS | CONSTRAINTS | DEFAULTS | GENERATED | IDENTITY | INDEXES | STATISTICS | STORAGE | ALL } and partition_bound_spec is: @@ -613,6 +613,16 @@ WITH ( MODULUS numeric_literal, REM available options are: + +INCLUDING ACCESS METHOD + + + The table's access method will be copied. By default, the + default_table_access_method is used. + + + + INCLUDING COMMENTS diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index bc43641ffe..383e0671af 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -3741,6 +3741,7 @@ TableLikeOption: | STATISTICS { $$ = CREATE_TABLE_LIKE_STATISTICS; } | STORAGE { $$ = CREATE_TABLE_LIKE_STORAGE; } | COMPRESSION { $$ = CREATE_TABLE_LIKE_COMPRESSION; } +| ACCESS METHOD { $$ = CREATE_TABLE_LIKE_ACCESS_METHOD; } | ALL{ $$ = CREATE_TABLE_LIKE_ALL; } ; diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index aa6c19adad..07e18fa62f 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -97,6 +97,7 @@ typedef struct bool ispartitioned; /* true if table is partitioned */ PartitionBoundSpec *partbound; /* transformed FOR VALUES */ bool ofType; /* true if statement contains OF typename */ + char *accessMethod; /* table access method */ } CreateStmtContext; /* State shared by transformCreateSchemaStmt and its subroutines */ @@ -253,6 +254,7 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString) cxt.ispartitioned = stmt->partspec != NULL; cxt.partbound = stmt->partbound; cxt.ofType = (stmt->ofTypename != NULL); + cxt.accessMethod = NULL; Assert(!stmt->ofTypename || !stmt->inhRelations); /* grammar enforces */ @@ -347,6 +349,9 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString) stmt->tableElts = cxt.columns; stmt->constraints = cxt.ckconstraints; + if (cxt.accessMethod != NULL) + stmt->accessMethod = cxt.accessMethod; + result = lappend(cxt.blist, stmt); result = list_concat(result, cxt.alist); result = list_concat(result, save_alist); @@ -1137,6 +1142,11 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla cxt->likeclauses = lappend(cxt->likeclauses, table_like_clause); } + /* ACCESS METHOD doesn't apply and isn't copied for partitioned tables */ + if ((table_like_clause->options & CREATE_TABLE_LIKE_ACCESS_METHOD) != 0 && + !cxt->ispartitioned) + cxt->accessMethod = get_am_name(relation->rd_rel->relam); + /* * We may copy extended statistics if requested, since the representation * of CreateStatsStmt doesn't
Re: Key management with tests
On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote: > > src/backend/access/gist/gistutil.c | 2 +- > > src/backend/access/heap/heapam_handler.c | 2 +- > > src/backend/catalog/pg_publication.c | 2 +- > > src/backend/commands/tablecmds.c | 10 +- > > src/backend/optimizer/util/plancat.c | 3 +-- > > src/backend/utils/cache/relcache.c | 2 +- > > src/include/utils/rel.h | 10 -- > > src/include/utils/snapmgr.h | 3 +-- > > 8 files changed, 19 insertions(+), 15 deletions(-) > > This particular patch (introducing the RelationIsPermanent() macro) > seems like it'd be a nice thing to commit independently of the rest, > reducing the size of this patch set..? Committed as suggested. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: shared-memory based stats collector
Hi, On 2021-03-22 16:17:37 -0700, Andres Freund wrote: > and to reduce unnecessary diff noise This patch has just tons of stuff like: -/* - * Calculate function call usage and update stat counters. - * Called by the executor after invoking a function. +/* -- + * pgstat_end_function_usage() - * - * In the case of a set-returning function that runs in value-per-call mode, - * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage - * calls for what the user considers a single call of the function. The - * finalize flag should be TRUE on the last call. + * Calculate function call usage and update stat counters. + * Called by the executor after invoking a function. + * + * In the case of a set-returning function that runs in value-per-call mode, + * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage + * calls for what the user considers a single call of the function. The + * finalize flag should be TRUE on the last call. + * -- and typedef struct PgStat_StatTabEntry { -Oid tableid; +/* Persistent data follow */ +TimestampTz vacuum_timestamp; /* user initiated vacuum */ +TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */ +TimestampTz analyze_timestamp; /* user initiated */ +TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */ PgStat_Counter numscans; @@ -773,103 +352,31 @@ typedef struct PgStat_StatTabEntry PgStat_Counter blocks_fetched; PgStat_Counter blocks_hit; -TimestampTz vacuum_timestamp; /* user initiated vacuum */ PgStat_Counter vacuum_count; -TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */ PgStat_Counter autovac_vacuum_count; -TimestampTz analyze_timestamp; /* user initiated */ PgStat_Counter analyze_count; -TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */ PgStat_Counter autovac_analyze_count; } PgStat_StatTabEntry; and changes like s/PgStat_WalStats/PgStat_Wal/. I can't really recognize what the pattern of what was changed and what not is? Mixing this in into a 300kb patch really adds a bunch of work. Greetings, Andres Freund
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Hi, For queryjumble.c : + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group The year should be updated. Same with queryjumble.h Cheers On Mon, Mar 22, 2021 at 2:56 PM Bruce Momjian wrote: > On Sat, Mar 20, 2021 at 02:12:34PM +0800, Julien Rouhaud wrote: > > On Fri, Mar 19, 2021 at 12:53:18PM -0400, Bruce Momjian wrote: > > > > > > Well, given we don't really want to support multiple query id types > > > being generated or displayed, the "error out" above should fix it. > > > > > > Let's do this --- tell extensions to error out if the query id is > > > already set, either by compute_query_id or another extension. If an > > > extension wants to generate its own query id and store is internal to > > > the extension, that is fine, but the server-displayed query id should > be > > > generated once and never overwritten by an extension. > > > > Agreed, this will ensure that you won't dynamically change the queryid > source. > > > > We should also document that changing it requires a restart and calling > > pg_stat_statements_reset() afterwards. > > > > v19 adds some changes, plus extra documentation for pg_stat_statements > about > > the requirement for a queryid to be calculated, and a note that all > documented > > details only apply for in-core source. I'm not sure if this is still > the best > > place to document those details anymore though. > > OK, after reading the entire thread, I don't think there are any > remaining open issues with this patch and I think this is ready for > committing. I have adjusted the doc section of the patches, attached. > I have marked myself as committer in the commitfest app and hope to > apply it in the next few days based on feedback. > > -- > Bruce Momjian https://momjian.us > EDB https://enterprisedb.com > > If only the physical world exists, free will is an illusion. > >
Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
On 2021/03/22 23:59, Fujii Masao wrote: > > > On 2021/03/20 13:40, Masahiro Ikeda wrote: >> Sorry, there is no evidence we should call it. >> I thought that to call FreeWaitEventSet() is a manner after using >> CreateWaitEventSet() and the performance impact to call it seems to be too >> small. >> >> (Please let me know if my understanding is wrong.) >> The reason why I said this is a manner because I thought it's no problem >> even without calling FreeWaitEventSet() before the process exits regardless >> of fast or immediate shutdown. Since the "WaitEventSet *wes" is a process >> local resource, >> this will be released by OS even if FreeWaitEventSet() is not called. >> >> I'm now changing my mind to skip it is better because the code can be >> simpler and, >> it's unimportant for the above reason especially when the immediate shutdown >> is >> requested which is a bad manner itself. > > Thanks for explaining this! So you're thinking that v3 patch should be chosen? > Probably some review comments I posted upthread need to be applied to > v3 patch, e.g., rename of SignalHandlerForUnsafeExit() function. Yes. I attached the v5 patch based on v3 patch. I renamed SignalHandlerForUnsafeExit() and fixed the following comment. > "SIGTERM or SIGQUIT of our parent postmaster" should be > "SIGTERM, SIGQUIT, or detect ungraceful death of our parent > postmaster"? >> BTW, the SysLoggerMain() create the WaitEventSet, but it didn't call >> FreeWaitEventSet(). >> Is it better to call FreeWaitEventSet() before exiting too? > > Yes if which could cause actual issue. Otherwise I don't have strong > reason to do that for now.. OK. AFAIK, this doesn't lead critical problems like memory leak. Regards, -- Masahiro Ikeda NTT DATA CORPORATION diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c index dd9136a942..8621212eda 100644 --- a/src/backend/postmaster/interrupt.c +++ b/src/backend/postmaster/interrupt.c @@ -64,9 +64,27 @@ SignalHandlerForConfigReload(SIGNAL_ARGS) } /* - * Simple signal handler for exiting quickly as if due to a crash. + * Simple signal handler for processes which have not yet touched or do not + * touch shared memory to exit quickly. * - * Normally, this would be used for handling SIGQUIT. + * If processes already touched shared memory, call + * SignalHandlerForCrashExit() because shared memory may be corrupted. + */ +void +SignalHandlerForNonCrashExit(SIGNAL_ARGS) +{ + /* + * Since we don't touch shared memory, we can just pull the plug and exit + * without running any atexit handlers. + */ + _exit(1); +} + +/* + * Simple signal handler for processes which have touched shared memory to + * exit quickly. + * + * Normally, this would be used for handling SIGQUIT as if due to a crash. */ void SignalHandlerForCrashExit(SIGNAL_ARGS) @@ -93,9 +111,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS) * shut down and exit. * * Typically, this handler would be used for SIGTERM, but some processes use - * other signals. In particular, the checkpointer exits on SIGUSR2, the - * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT - * or SIGTERM. + * other signals. In particular, the checkpointer exits on SIGUSR2 and the + * WAL writer exits on either SIGINT or SIGTERM. * * ShutdownRequestPending should be checked at a convenient place within the * main loop, or else the main loop should call HandleMainLoopInterrupts. diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 208a33692f..44c471937b 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -724,6 +724,7 @@ pgstat_reset_remove_files(const char *directory) snprintf(fname, sizeof(fname), "%s/%s", directory, entry->d_name); + elog(DEBUG2, "removing stats file \"%s\"", fname); unlink(fname); } FreeDir(dir); @@ -4821,13 +4822,19 @@ PgstatCollectorMain(int argc, char *argv[]) /* * Ignore all signals usually bound to some action in the postmaster, - * except SIGHUP and SIGQUIT. Note we don't need a SIGUSR1 handler to - * support latch operations, because we only use a local latch. + * except SIGHUP, SIGTERM and SIGQUIT. Note we don't need a SIGUSR1 + * handler to support latch operations, because we only use a local latch. */ pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGINT, SIG_IGN); - pqsignal(SIGTERM, SIG_IGN); - pqsignal(SIGQUIT, SignalHandlerForShutdownRequest); + pqsignal(SIGTERM, SignalHandlerForShutdownRequest); + + /* + * If SIGQUIT is received, exit quickly without doing any additional work, + * for example writing stats files. We arrange to do _exit(1) because the + * stats collector doesn't touch shared memory. + */ + pqsignal(SIGQUIT, SignalHandlerForNonCrashExit); pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); pqsignal(SIGUSR1, SIG_IGN); @@ -4852,8 +4859,8 @@ PgstatCollectorMain(int argc, char *argv[]) AddWaitEventToSet(wes,
Re: [HACKERS] Custom compression methods
Justin Pryzby writes: > On Mon, Mar 22, 2021 at 03:47:58PM -0400, Robert Haas wrote: >> Are you sure you're looking at the patch I sent, >> toast-compression-guc-rmh.patch? I can't help wondering if you applied >> it to a dirty source tree or got the wrong file or something, because >> otherwise I don't understand why you're seeing things that I'm not >> seeing. > I'm guessing Tom read this hunk as being changes to > check_default_toast_compression() rather than removing the function ? Yeah, after looking closer, the diff looks like check_default_toast_compression is being modified in-place, whereas actually it's getting replaced by CompressionNameToMethod which does something entirely different. I'd also not looked closely enough at where NO_LZ4_SUPPORT() was being moved to. My apologies --- I can only plead -ENOCAFFEINE. regards, tom lane
Re: [HACKERS] logical decoding of two-phase transactions
On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila wrote: > > I have incorporated all your changes and additionally made few more > changes (a) got rid of LogicalRepBeginPrepareData and instead used > LogicalRepPreparedTxnData, (b) made a number of changes in comments > and docs, (c) ran pgindent, (d) modified tests to use standard > wait_for_catch function and removed few tests to reduce the time and > to keep regression tests reliable. I checked all v65* / v66* differences and found only two trivial comment typos. PSA patches to fix those. Kind Regards, Peter Smith. Fujitsu Australia v66-0001-Feedback-apply_handle_rollback_prepared-typo-in-.patch Description: Binary data v66-0002-Feedback-AlterSubscription-typo-in-comment.patch Description: Binary data
Re: proposal - psql - use pager for \watch command
On Sun, Mar 21, 2021 at 11:43 PM Pavel Stehule wrote: > so 20. 3. 2021 v 23:45 odesílatel Thomas Munro > napsal: >> Thoughts? I put my changes into a separate patch for clarity, but >> they need some more tidying up. > > yes, your solution is much better. Hmm, there was a problem with it though: it blocked ^C while running the query, which is bad. I fixed that. I did some polishing of the code and some editing on the documentation and comments. I disabled the feature completely on Windows, because it seems unlikely that we'll be able to know if it even works, in this cycle. - output = PageOutput(158, pager ? &(pset.popt.topt) : NULL); + output = PageOutput(160, pager ? &(pset.popt.topt) : NULL); What is that change for? From cf1d23b92e9a7a97d60abdf4c5a8fbe9d9092788 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 20 Mar 2021 21:54:27 +1300 Subject: [PATCH v4] Add PSQL_WATCH_PAGER for psql's \watch command. Allow a pager to be used by the \watch command, which runs a query periodically. It works but isn't very useful with traditional pagers like "less", so use a different evironment variable to control it. Pagers that understand the output format of psql such as the popular open source tool "pspg" (written by the main author of this patch) can use this to show the results in a user-friendly format. Example: PSQL_WATCH_PAGER="pspg --stream". To make \watch react quickly when the user quits the pager or presses ^C, and also to increase the accuracy of its timing and decrease the rate of useless context switches, change the main loop of the \watch command to use sigwait() rather than a sleeping/polling loop, on Unix. Supported on Unix only for now (like pspg). Author: Pavel Stehule Author: Thomas Munro Discussion: https://postgr.es/m/CAFj8pRBfzUUPz-3gN5oAzto9SDuRSq-TQPfXU_P6h0L7hO%2BEhg%40mail.gmail.com --- doc/src/sgml/ref/psql-ref.sgml | 28 +++ src/bin/psql/command.c | 133 ++--- src/bin/psql/common.c | 11 ++- src/bin/psql/common.h | 2 +- src/bin/psql/help.c| 6 +- src/bin/psql/startup.c | 14 6 files changed, 179 insertions(+), 15 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 01ec9b8b0a..dfa7bf17a9 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -2993,6 +2993,16 @@ lo_import 152801 (such as more) is used. + + When using the \watch command to execute a query + repeatedly, the environment variable PSQL_WATCH_PAGER + is used to find the pager program instead, on Unix systems. This is + configured separately because it may confuse traditional pagers, but + can be used to send output to tools that understand + psql's output format (such as + pspg --stream). + + When the pager option is off, the pager program is not used. When the pager option is @@ -4663,6 +4673,24 @@ PSQL_EDITOR_LINENUMBER_ARG='--line ' + +PSQL_WATCH_PAGER + + + + When a query is executed repeatedly with the \watch + command, a pager is not used by default. This behavior can be changed + by setting PSQL_WATCH_PAGER to a pager command, on Unix + systems. The pspg pager (not part of + PostgreSQL but available in many open source + software distributions) can display the output of + \watch if started with the option + --stream. + + + + + PSQLRC diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 838f74..bfdc5a8121 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -13,6 +13,7 @@ #include #ifndef WIN32 #include /* for stat() */ +#include /* for setitimer() */ #include /* open() flags */ #include /* for geteuid(), getpid(), stat() */ #else @@ -4787,8 +4788,17 @@ do_watch(PQExpBuffer query_buf, double sleep) const char *strftime_fmt; const char *user_title; char *title; + const char *pagerprog = NULL; + FILE *pagerpipe = NULL; int title_len; int res = 0; +#ifndef WIN32 + sigset_t sigalrm_sigchld_sigint; + sigset_t sigalrm_sigchld; + sigset_t sigint; + struct itimerval interval; + bool done = false; +#endif if (!query_buf || query_buf->len <= 0) { @@ -4796,6 +4806,58 @@ do_watch(PQExpBuffer query_buf, double sleep) return false; } +#ifndef WIN32 + sigemptyset(_sigchld_sigint); + sigaddset(_sigchld_sigint, SIGCHLD); + sigaddset(_sigchld_sigint, SIGALRM); + sigaddset(_sigchld_sigint, SIGINT); + + sigemptyset(_sigchld); + sigaddset(_sigchld, SIGCHLD); + sigaddset(_sigchld, SIGALRM); + + sigemptyset(); + sigaddset(, SIGINT); + + /* + * Block SIGALRM and SIGCHLD before we start the timer and the pager (if + * configured), to avoid races. sigwait() will receive them. + */ + sigprocmask(SIG_BLOCK, _sigchld, NULL); +
Re: pg_upgrade failing for 200+ million Large Objects
On 3/22/21 5:36 PM, Zhihong Yu wrote: Hi, w.r.t. pg_upgrade_improvements.v2.diff. + blobBatchCount = 0; + blobInXact = false; The count and bool flag are always reset in tandem. It seems variable blobInXact is not needed. You are right. I will fix that. Thanks, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Re: Allow an alias to be attached directly to a JOIN ... USING
Peter Eisentraut writes: > I think Tom's input on the guts of this patch would be most valuable, > since it intersects a lot with the parse namespace refactoring he did. I really didn't like the way you'd done that :-(. My primary complaint is that any one ParseNamespaceItem can describe only one table alias, but here we have the potential for two aliases associated with the same join: select * from (t1 join t2 using(a) as tu) tx; Admittedly that's not hugely useful since tx hides the tu alias, but it should behave in a sane fashion. (BTW, after reading the SQL spec again along the way to reviewing this, I am wondering if hiding the lower aliases is really what we want; though it may be decades too late to change that.) However, ParseNamespaceItem as it stands needs some help for this. It has a wired-in assumption that p_rte->eref describes the table and column aliases exposed by the nsitem. 0001 below fixes this by creating a separate p_names field in an nsitem. (There are some comments in 0001 referencing JOIN USING aliases, but no actual code for the feature.) That saves one indirection in common code paths, so it's possibly a win on its own. Then 0002 is your patch rebased onto that infrastructure, and with some cleanup of my own. One thing I ran into is that a whole-row Var for the JOIN USING alias did the wrong thing. It should have only the common columns, but we were getting all the join columns in examples such as the row_to_json() test case I added. This is difficult to fix given the existing whole-row Var infrastructure, unless we want to make a separate RTE for the JOIN USING alias, which I think is overkill. What I did about this was to make transformWholeRowRef produce a ROW() construct --- which is something that a whole-row Var for a join would be turned into by the planner anyway. I think this is semantically OK since the USING construct has already nailed down the number and types of the join's common columns; there's no prospect of those changing underneath a stored view query. It's slightly ugly because the ROW() construct will be visible in a decompiled view instead of "tu.*" like you wrote originally, but I'm willing to live with that. Speaking of decompiled views, I feel like ruleutils.c could do with a little more work to teach it that these aliases are available. Right now, it resorts to ugly workarounds: regression=# create table t1 (a int, b int, c int); CREATE TABLE regression=# create table t2 (a int, x int, y int); CREATE TABLE regression=# create view vvv as select tj.a, t1.b from t1 full join t2 using(a) as tj, t1 as tx; CREATE VIEW regression=# \d+ vvv View "public.vvv" Column | Type | Collation | Nullable | Default | Storage | Description +-+---+--+-+-+- a | integer | | | | plain | b | integer | | | | plain | View definition: SELECT a, t1.b FROM t1 FULL JOIN t2 USING (a) AS tj, t1 tx(a_1, b, c); That's not wrong, but it could likely be done better if ruleutils realized it could use the tj alias to reference the column, instead of having to force unqualified "a" to be a globally unique name. I ran out of steam to look into that, though, and it's probably something that could be improved later. One other cosmetic thing is that this: regression=# select tu.* from (t1 join t2 using(a) as tu) tx; ERROR: missing FROM-clause entry for table "tu" LINE 1: select tu.* from (t1 join t2 using(a) as tu) tx; ^ is a relatively dumb error message, compared to regression=# select t1.* from (t1 join t2 using(a) as tu) tx; ERROR: invalid reference to FROM-clause entry for table "t1" LINE 1: select t1.* from (t1 join t2 using(a) as tu) tx; ^ HINT: There is an entry for table "t1", but it cannot be referenced from this part of the query. I didn't look into why that isn't working, but maybe errorMissingRTE needs to trawl all of the ParseNamespaceItems not just the RTEs. Anyway, since these remaining gripes are cosmetic, I'll mark this RFC. regards, tom lane diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c index bdf8ec46e2..5dfea46021 100644 --- a/src/backend/parser/parse_clause.c +++ b/src/backend/parser/parse_clause.c @@ -1217,9 +1217,9 @@ transformFromClauseItem(ParseState *pstate, Node *n, * input column numbers more easily. */ l_nscolumns = l_nsitem->p_nscolumns; - l_colnames = l_nsitem->p_rte->eref->colnames; + l_colnames = l_nsitem->p_names->colnames; r_nscolumns = r_nsitem->p_nscolumns; - r_colnames = r_nsitem->p_rte->eref->colnames; + r_colnames = r_nsitem->p_names->colnames; /* * Natural join does not explicitly specify columns; must generate @@ -1469,7 +1469,7 @@ transformFromClauseItem(ParseState *pstate, Node *n, * Now that we know the join
Re: shared-memory based stats collector
Hi, On 2021-03-22 12:02:39 +0900, Kyotaro Horiguchi wrote: > At Mon, 22 Mar 2021 09:55:59 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund > > wrote in > > > Hi, > > > > > > On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote: > > > > At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi > > > > wrote in > > > > Rebased and fixed two bugs. Not addressed received comments in this > > > > version. > > > > > > Since I am heavily editing the code, could you submit "functional" > > > changes (as opposed to fixing rebase issues) as incremental patches? > > > > Oh.. please wait for.. a moment, maybe. > > This is that. Thanks! That change shouldn't be necessary on my branch - I did something to fix this kind of problem too. I decided that there's no point in doing hash table lookups for the database: It's not going to change in the life of a backend. So there's now two static "pending" entries: One for the current DB, one for the shared DB. There's only one place that needed to change, pgstat_report_checksum_failures_in_db(), which now reports the changes directly instead of going via pending. I suspect we should actually do that with a number of other DB specific functions. Things like recovery conflicts, deadlocks, checksum failures imo should really not be delayed till later. And you should never have enough of them to make contention a concern. You can see a somewhat sensible list of changes from your v52 at https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22 (I did fix some of damage from rebase in a non-incremental way, of course) My branch: https://github.com/anarazel/postgres/tree/shmstat It would be cool if you could check if there any relevant things between v52 and v56 that I should include. I think a lot of the concerns I had with the patch are addressed at the end of my series of changes. Please let me know what you think. My next step is going to be to squash all my changes into the base patch, and try to extract all the things that I think can be independently committed, and to reduce unnecessary diff noise. Once that's done I plan to post that series to the list. TODO: - explain the design at the top of pgstat.c - figure out a way to deal with the different demands on stats consistency / efficiency - see how hard it'd be to not need collect_oids() - split pgstat.c - consider removing PgStatTypes and replacing it with the oid of the table the type of stats reside in. So PGSTAT_TYPE_DB would be DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ... I think that'd make the system more cleanly extensible going forward? - I'm not yet happy with the naming schemes in use in pgstat.c. I feel like I improved it a bunch, but it's not yet there. - the replication slot stuff isn't quite right in my branch - I still don't quite like the reset_offset stuff - I wonder if we can find something better there. And if not, whether we can deduplicate the code between functions like pgstat_fetch_stat_checkpointer() and pgstat_report_checkpointer(). At the very least it'll need a lot better comments. - bunch of FIXMEs / XXXs Greetings, Andres Freund
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 03:47:58PM -0400, Robert Haas wrote: > On Mon, Mar 22, 2021 at 2:10 PM Tom Lane wrote: > > Robert Haas writes: > > > I think this is significantly cleaner than what we have now, and I > > > also prefer it to your proposal. > > > > +1 in general. However, I suspect that you did not try to compile > > this without --with-lz4, because if you had you'd have noticed the > > other uses of NO_LZ4_SUPPORT() that you broke. I think you need > > to leave that macro where it is. > > You're correct that I hadn't tried this without --with-lz4, but I did > grep for other uses of NO_LZ4_SUPPORT() and found none. I also just > tried it without --with-lz4 just now, and it worked fine. > > > Also, it's not nice for GUC check > > functions to throw ereport(ERROR); we prefer the caller to be able > > to decide if it's a hard error or not. That usage should be using > > GUC_check_errdetail() or a cousin, so it can't share the macro anyway. > > I agree that these are valid points about GUC check functions in > general, but the patch I sent adds 0 GUC check functions and removes > 1, and it didn't do the stuff you describe here anyway. > > Are you sure you're looking at the patch I sent, > toast-compression-guc-rmh.patch? I can't help wondering if you applied > it to a dirty source tree or got the wrong file or something, because > otherwise I don't understand why you're seeing things that I'm not > seeing. I'm guessing Tom read this hunk as being changes to check_default_toast_compression() rather than removing the function ? - * Validate a new value for the default_toast_compression GUC. + * CompressionNameToMethod - Get compression method from compression name + * + * Search in the available built-in methods. If the compression not found + * in the built-in methods then return InvalidCompressionMethod. */ -bool -check_default_toast_compression(char **newval, void **extra, GucSource source) +char +CompressionNameToMethod(const char *compression) { - if (**newval == '\0') + if (strcmp(compression, "pglz") == 0) + return TOAST_PGLZ_COMPRESSION; + else if (strcmp(compression, "lz4") == 0) { - GUC_check_errdetail("%s cannot be empty.", - "default_toast_compression"); - return false; +#ifndef USE_LZ4 + NO_LZ4_SUPPORT(); +#endif + return TOAST_LZ4_COMPRESSION; -- Justin
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Sat, Mar 20, 2021 at 02:12:34PM +0800, Julien Rouhaud wrote: > On Fri, Mar 19, 2021 at 12:53:18PM -0400, Bruce Momjian wrote: > > > > Well, given we don't really want to support multiple query id types > > being generated or displayed, the "error out" above should fix it. > > > > Let's do this --- tell extensions to error out if the query id is > > already set, either by compute_query_id or another extension. If an > > extension wants to generate its own query id and store is internal to > > the extension, that is fine, but the server-displayed query id should be > > generated once and never overwritten by an extension. > > Agreed, this will ensure that you won't dynamically change the queryid source. > > We should also document that changing it requires a restart and calling > pg_stat_statements_reset() afterwards. > > v19 adds some changes, plus extra documentation for pg_stat_statements about > the requirement for a queryid to be calculated, and a note that all documented > details only apply for in-core source. I'm not sure if this is still the best > place to document those details anymore though. OK, after reading the entire thread, I don't think there are any remaining open issues with this patch and I think this is ready for committing. I have adjusted the doc section of the patches, attached. I have marked myself as committer in the commitfest app and hope to apply it in the next few days based on feedback. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion. >From 5ab783123fc11e948963de7a7c3e6428051e3315 Mon Sep 17 00:00:00 2001 From: Bruce Momjian Date: Mon, 22 Mar 2021 17:43:22 -0400 Subject: [PATCH] qid-01-jumble_over_master squash commit --- .../pg_stat_statements/pg_stat_statements.c | 805 + .../pg_stat_statements.conf | 1 + doc/src/sgml/config.sgml | 25 + doc/src/sgml/pgstatstatements.sgml| 20 +- src/backend/parser/analyze.c | 14 +- src/backend/tcop/postgres.c | 6 +- src/backend/utils/misc/Makefile | 1 + src/backend/utils/misc/guc.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/backend/utils/misc/queryjumble.c (new)| 834 ++ src/include/parser/analyze.h | 4 +- src/include/utils/guc.h | 1 + src/include/utils/queryjumble.h (new) | 58 ++ 13 files changed, 995 insertions(+), 785 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 62cccbfa44..bd8c96728c 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -8,24 +8,9 @@ * a shared hashtable. (We track only as many distinct queries as will fit * in the designated amount of shared memory.) * - * As of Postgres 9.2, this module normalizes query entries. Normalization - * is a process whereby similar queries, typically differing only in their - * constants (though the exact rules are somewhat more subtle than that) are - * recognized as equivalent, and are tracked as a single entry. This is - * particularly useful for non-prepared queries. - * - * Normalization is implemented by fingerprinting queries, selectively - * serializing those fields of each query tree's nodes that are judged to be - * essential to the query. This is referred to as a query jumble. This is - * distinct from a regular serialization in that various extraneous - * information is ignored as irrelevant or not essential to the query, such - * as the collations of Vars and, most notably, the values of constants. - * - * This jumble is acquired at the end of parse analysis of each query, and - * a 64-bit hash of it is stored into the query's Query.queryId field. - * The server then copies this value around, making it available in plan - * tree(s) generated from the query. The executor can then use this value - * to blame query costs on the proper queryId. + * Starting in Postgres 9.2, this module normalized query entries. As of + * Postgres 14, the normalization is done by the core if compute_query_id is + * enabled, or optionally by third-party modules. * * To facilitate presenting entries to users, we create "representative" query * strings in which constants are replaced with parameter symbols ($n), to @@ -114,8 +99,6 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ #define IS_STICKY(c) ((c.calls[PGSS_PLAN] + c.calls[PGSS_EXEC]) == 0) -#define JUMBLE_SIZE1024 /* query serialization buffer size */ - /* * Extension version number, for supporting older extension versions' objects */ @@ -235,40 +218,6 @@ typedef struct pgssSharedState
Re: pg_upgrade failing for 200+ million Large Objects
> > Hi, > w.r.t. pg_upgrade_improvements.v2.diff. + blobBatchCount = 0; + blobInXact = false; The count and bool flag are always reset in tandem. It seems variable blobInXact is not needed. Cheers
Re: UPDATE ... SET (single_column) = row_constructor is a bit broken from V10 906bfcad7ba7c
On Mon, Mar 22, 2021 at 02:10:49PM +0530, Mahendra Singh Thalor wrote: > Hi Hackers, > > Commit 906bfcad7ba7c has improved handling for "UPDATE ... SET > (column_list) = row_constructor", but it has broken in some cases where it > was working prior to this commit. > After this commit query “DO UPDATE SET (t1_col)” is giving an error which > was working fine earlier. See prior discussions: https://www.postgresql.org/message-id/flat/20170719174507.GA19616%40telsasoft.com https://www.postgresql.org/message-id/flat/camjna7cdlzpcs0xnrpkvqmj6vb6g3eh8cygp9zbjxdpfftz...@mail.gmail.com https://www.postgresql.org/message-id/flat/87sh5rs74y@news-spur.riddles.org.uk https://git.postgresql.org/gitweb/?p=postgresql.git=commit=86182b18957b8f9e8045d55b137aeef7c9af9916 -- Justin
Re: proposal - psql - use pager for \watch command
On Tue, Mar 23, 2021 at 1:53 AM Pavel Stehule wrote: > po 22. 3. 2021 v 13:13 odesílatel Thomas Munro > napsal: >> The problem is that Apple's /dev/tty device is defective, and doesn't >> work in poll(). It always returns immediately with revents=POLLNVAL, >> but pspg assumes that data is ready and tries to read the keyboard and >> then blocks until I press a key. This seems to fix it: >> >> +#ifndef __APPLE__ >> + /* macOS can't use poll() on /dev/tty */ >> state.tty = fopen("/dev/tty", "r+"); >> +#endif >> if (!state.tty) >> state.tty = fopen(ttyname(fileno(stdout)), "r"); > > > it is hell. Heh. I've recently spent many, many hours trying to make AIO work on macOS, and nothing surprises me anymore. BTW I found something from years ago on the 'net that fits with my observation about /dev/tty: https://www.mail-archive.com/bug-gnulib@gnu.org/msg00296.html Curious, which other OS did you put that fallback case in for? I'm a little confused about why it works, so I'm not sure if it's the best possible change, but I'm not planning to dig any further now, too many patches, not enough time :-) > Please, can you verify this fix? It works perfectly for me on a macOS 11.2 system with that change, repainting the screen exactly when it should. I'm happy about that because (1) it means I can confirm that the proposed change to psql is working correctly on the 3 Unixes I have access to, and (2) I am sure that a lot of Mac users will appreciate being able to use super-duper \watch mode when this ships (a high percentage of PostgreSQL users I know use a Mac as their client machine). >> A minor problem is that on macOS, _GNU_SOURCE doesn't seem to imply >> NCURSES_WIDECHAR, so I suspect Unicode will be broken unless you >> manually add -DNCURSES_WIDECHAR=1, though I didn't check. > > It is possible - > > can you run "pspg --version" Looks like I misunderstood: it is showing "with wide char support", it's just that the "num" is 0 rather than 1. I'm not planning to investigate that any further now, but I checked that it can show the output of SELECT 'špeĉiäl chârãçtérs' correctly.
Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch
On 19.03.21 21:06, Tom Lane wrote: I guess the immediate question is how much of a performance gap there is now between the float and numeric implementations. Attached are my test script and the full output. To summarize, for cases that don't do any interesting computation and where the overhead is only the data type passing, the difference is like this: -- old select date_part('microseconds', current_timestamp + generate_series(0, 1000) * interval '1 second') \g /dev/null Time: 2760.966 ms (00:02.761) -- new select extract(microseconds from current_timestamp + generate_series(0, 1000) * interval '1 second') \g /dev/null Time: 3178.477 ms (00:03.178) -- date select date_part('day', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2753.082 ms (00:02.753) select date_part('month', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2777.257 ms (00:02.777) select date_part('quarter', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2788.313 ms (00:02.788) select date_part('week', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2842.797 ms (00:02.843) select date_part('year', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2848.463 ms (00:02.848) select date_part('decade', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2844.086 ms (00:02.844) select date_part('century', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2824.671 ms (00:02.825) select date_part('millennium', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 2846.615 ms (00:02.847) select date_part('julian', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3011.378 ms (00:03.011) select date_part('isoyear', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3059.134 ms (00:03.059) select date_part('dow', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3014.891 ms (00:03.015) select date_part('isodow', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3054.004 ms (00:03.054) select date_part('doy', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3101.087 ms (00:03.101) select date_part('epoch', current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3156.491 ms (00:03.156) select extract(day from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3303.320 ms (00:03.303) select extract(month from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3289.939 ms (00:03.290) select extract(quarter from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3301.930 ms (00:03.302) select extract(week from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3347.644 ms (00:03.348) select extract(year from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3358.538 ms (00:03.359) select extract(decade from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3351.724 ms (00:03.352) select extract(century from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3282.368 ms (00:03.282) select extract(millennium from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3358.242 ms (00:03.358) select extract(julian from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 5758.773 ms (00:05.759) FIXME select extract(isoyear from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3780.507 ms (00:03.781) select extract(dow from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3688.659 ms (00:03.689) select extract(isodow from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3716.656 ms (00:03.717) select extract(doy from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 3636.682 ms (00:03.637) select extract(epoch from current_date + generate_series(0, 1000) * interval '1 day') \g /dev/null Time: 6423.341 ms (00:06.423) FIXME -- time select date_part('microseconds', localtime + generate_series(0, 1000) * interval '1 second') \g /dev/null Time: 2425.741 ms (00:02.426) select date_part('milliseconds', localtime + generate_series(0, 1000) * interval '1 second') \g /dev/null Time: 2790.781 ms (00:02.791) select date_part('second', localtime + generate_series(0, 1000) * interval '1 second') \g /dev/null Time: 2900.706 ms (00:02.901) select date_part('minute', localtime + generate_series(0, 1000) * interval '1 second') \g /dev/null
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 4:33 PM Robert Haas wrote: > On Mon, Mar 22, 2021 at 1:58 PM Justin Pryzby wrote: > > guc.c should not longer define this as extern: > > default_toast_compression_options > > Fixed. Fixed some more. -- Robert Haas EDB: http://www.enterprisedb.com v3-0001-Tidy-up-more-loose-ends-related-to-configurable-T.patch Description: Binary data
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 1:58 PM Justin Pryzby wrote: > guc.c should not longer define this as extern: > default_toast_compression_options Fixed. > I think you should comment that default_toast_compression is an int as far as > guc.c is concerned, but storing one of the char value of TOAST_*_COMPRESSION Done. > Shouldn't varlena.c pg_column_compression() call GetCompressionMethodName () ? > I guess it should already have done that. It has a 0-3 integer, not a char value. > Maybe pg_dump.c can't use those constants, though (?) Hmm, toast_compression.h might actually be safe for frontend code now, or if necessary we could add #ifdef FRONTEND stanzas to make it so. I don't know if that is really this patch's job, but I guess it could be. A couple of other things: - Upon further reflection, I think the NO_LZ4_SUPPORT() message is kinda not great. I'm thinking we should change it to say "LZ4 is not supported by this build" instead of "unsupported LZ4 compression method" and drop the hint and detail. That seems more like how we've handled other such cases. - It is not very nice that the three possible values of attcompression are TOAST_PGLZ_COMPRESSION, TOAST_LZ4_COMPRESSION, and InvalidCompressionMethod. One of those three identifiers looks very little like the other two, and there's no real good reason for that. I think we should try to standardize on something, but I'm not sure what it should be. It would also be nice if these names were more visually distinct from the related but very different enum values TOAST_PGLZ_COMPRESSION_ID and TOAST_LZ4_COMPRESSION_ID. Really, as the comments I added explain, we want to minimize the amount of code that knows about the 0-3 "ID" values, and use the char values whenever we can. -- Robert Haas EDB: http://www.enterprisedb.com v2-0001-Tidy-up-more-loose-ends-related-to-configurable-T.patch Description: Binary data
Re: postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)
Am Donnerstag, dem 21.01.2021 um 15:56 +0100 schrieb Matthias van de Meent: > Hi, > > Recently I was trying to copy some of the data of one database to > another through postgres_fdw, and found that it wouldn't import that > partition through IMPORT FOREIGN SCHEMA, even when I explicitly > specified the name of the table that contained the data in the LIMIT > TO clause. > > I realised the reason is that currently, postgres_fdw explicitly > disallows importing foreign partitions. This is a reasonable default > when importing a whole schema, but if I wanted to explicitly import > one of a partitioned tables' partitions, that would now require me to > manually copy the foreign table's definitions through the use of > CREATE FOREIGN TABLE, which is a hassle and prone to mistakes. > Hi, I took a look at this patch. Patch applies on current master. Documentation and adjusted regression tests included. Regression tests passes without errors. The patch changes IMPORT FOREIGN SCHEMA to explicitely allow partition child tables in the LIMIT TO clause of the IMPORT FOREIGN SCHEMA command by relaxing the checks introduced with commit [1]. The reason behind [1] are discussed in [2]. So the original behavior this patch wants to address was done intentionally, so what needs to be discussed here is whether we want to relax that a little. One argument for the original behavior since then was that it is cleaner to just automatically import the parent, which allows access to the childs through the foreign table anways and exclude partition childs when querying pg_class. I haven't seen demand for the implemented feature here myself, but i could imagine use cases where just a single child or a set of child tables are candidates. For example, i think it's possible that users can query only specific childs and want them to have imported on another foreign server. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f49bcd4ef3e9a75de210357a4d9bbe3e004db956 [2] https://www.postgresql.org/message-id/20170309141531.gd9...@tamriel.snowman.net -- Thanks, Bernd
Re: Autovacuum worker doesn't immediately exit on postmaster death
On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost wrote: > Thanks for that. Attached is just a rebased version with a commit > message added. If there aren't any other concerns, I'll commit this in > the next few days and back-patch it. When it comes to 12 and older, > does anyone want to opine about the wait event to use? I was thinking > PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ... I'm not sure if we should back-patch this, but I think if you do you should just add a wait event, rather than using a generic one. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 2:10 PM Tom Lane wrote: > Robert Haas writes: > > I think this is significantly cleaner than what we have now, and I > > also prefer it to your proposal. > > +1 in general. However, I suspect that you did not try to compile > this without --with-lz4, because if you had you'd have noticed the > other uses of NO_LZ4_SUPPORT() that you broke. I think you need > to leave that macro where it is. You're correct that I hadn't tried this without --with-lz4, but I did grep for other uses of NO_LZ4_SUPPORT() and found none. I also just tried it without --with-lz4 just now, and it worked fine. > Also, it's not nice for GUC check > functions to throw ereport(ERROR); we prefer the caller to be able > to decide if it's a hard error or not. That usage should be using > GUC_check_errdetail() or a cousin, so it can't share the macro anyway. I agree that these are valid points about GUC check functions in general, but the patch I sent adds 0 GUC check functions and removes 1, and it didn't do the stuff you describe here anyway. Are you sure you're looking at the patch I sent, toast-compression-guc-rmh.patch? I can't help wondering if you applied it to a dirty source tree or got the wrong file or something, because otherwise I don't understand why you're seeing things that I'm not seeing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: shared-memory based stats collector
Hi, On 2021-03-19 14:27:38 -0700, Andres Freund wrote: > Yep, it's not even particularly hard to hit: > > S0: CREATE TABLE a_table(); > S0: INSERT INTO a_table(); > S0: disconnect > S1: set a breakpoint to just after the dshash_release_lock(), with an > if objid == a_table_oid > S1: SELECT pg_stat_get_live_tuples('a_table'::regclass); > (will break at above breakpoint, without having incremented the > refcount yet) > S2: DROP TABLE a_table; > S2: VACUUM pg_class; > > At that point S2's call to pgstat_vacuum_stat() will find the shared > stats entry for a_table, delete the entry from the shared hash table, > see that the stats data has a zero refcount, and free it. Once S1 wakes > up it'll use already freed (and potentially since reused) memory. I fixed this by initializing / increment the refcount while holding dshash partition lock. To avoid the potential refcount leak in case the lookup cache insertion fails due to OOM I changed things so that the lookup cache is inserted, not just looked up, earlier. That also avoids needing two hashtable ops in the cache miss case. The price of an empty hashtable entry in the !create case doesn't seem high. Related issue: delete_current_stats_entry() there's the following comment: /* * Let the referrers drop the entry if any. Refcount won't be decremented * until "dropped" is set true and StatsShmem->gc_count is incremented * later. So we can check refcount to set dropped without holding a lock. * If no one is referring this entry, free it immediately. */ I don't think this explanations is correct. gc_count might have been incremented by another backend, or cleanup_dropped_stats_entries() might run. So the whole bit about refcounts seems wrong. I don't see what prevents a double-free here. Consider what happens if S1: cleanup_dropped_stats_entries() does pg_atomic_sub_fetch_u32(>shared->refcount, 1) S2: delete_current_stats_entry() pg_atomic_read_u32(>refcount), reading 0 S1: dsa_free(area, ent->dsapointer); S2: dsa_free(area, pdsa); World: boom I think the appropriate fix might be to not have ->dropped (or rather have it just as a crosscheck), and have every non-dropped entry have an extra refcount. When dropping the entry the refcount is dropped, and we can safely free the entry. That way normal paths don't need to check ->dropped at all. Greetings, Andres Freund
Re: Proposal: Save user's original authenticated identity for logging
On Mon, 2021-03-22 at 15:16 +0900, Michael Paquier wrote: > On Fri, Mar 19, 2021 at 06:37:05PM +, Jacob Champion wrote: > > The same effect can be had by moving the log rotation to the top of the > > test that needs it, so I've done it that way in v7. > > After thinking more about 0001, I have come up with an even simpler > solution that has resulted in 11e1577. That's similar to what > PostgresNode::issues_sql_like() does. This also makes 0003 simpler > with its changes as this requires to change two lines in test_access. v8's test_access lost the in-order log search from v7; I've added it back in v9. The increased resistance to entropy seems worth the few extra lines. Thoughts? --Jacob From 1939c94fe9d21b19a456ce99d03dfe8b4682d5af Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Mon, 8 Feb 2021 10:53:20 -0800 Subject: [PATCH v9 1/2] ssl: store client's DN in port->peer_dn Original patch by Andrew Dunstan: https://www.postgresql.org/message-id/fd96ae76-a8e3-ef8e-a642-a592f5b76771%40dunslane.net but I've taken out the clientname=DN functionality; all that will be needed for the next patch is the DN string. --- src/backend/libpq/be-secure-openssl.c | 53 +++ src/include/libpq/libpq-be.h | 1 + 2 files changed, 47 insertions(+), 7 deletions(-) diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c index 5ce3f27855..18321703da 100644 --- a/src/backend/libpq/be-secure-openssl.c +++ b/src/backend/libpq/be-secure-openssl.c @@ -551,22 +551,25 @@ aloop: /* Get client certificate, if available. */ port->peer = SSL_get_peer_certificate(port->ssl); - /* and extract the Common Name from it. */ + /* and extract the Common Name / Distinguished Name from it. */ port->peer_cn = NULL; + port->peer_dn = NULL; port->peer_cert_valid = false; if (port->peer != NULL) { int len; + X509_NAME *x509name = X509_get_subject_name(port->peer); + char *peer_cn; + char *peer_dn; + BIO *bio = NULL; + BUF_MEM*bio_buf = NULL; - len = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer), - NID_commonName, NULL, 0); + len = X509_NAME_get_text_by_NID(x509name, NID_commonName, NULL, 0); if (len != -1) { - char *peer_cn; - peer_cn = MemoryContextAlloc(TopMemoryContext, len + 1); - r = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer), - NID_commonName, peer_cn, len + 1); + r = X509_NAME_get_text_by_NID(x509name, NID_commonName, peer_cn, + len + 1); peer_cn[len] = '\0'; if (r != len) { @@ -590,6 +593,36 @@ aloop: port->peer_cn = peer_cn; } + + bio = BIO_new(BIO_s_mem()); + if (!bio) + { + pfree(port->peer_cn); + port->peer_cn = NULL; + return -1; + } + /* use commas instead of slashes */ + X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_RFC2253); + BIO_get_mem_ptr(bio, _buf); + peer_dn = MemoryContextAlloc(TopMemoryContext, bio_buf->length + 1); + memcpy(peer_dn, bio_buf->data, bio_buf->length); + peer_dn[bio_buf->length] = '\0'; + if (bio_buf->length != strlen(peer_dn)) + { + ereport(COMMERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg("SSL certificate's distinguished name contains embedded null"))); + BIO_free(bio); + pfree(peer_dn); + pfree(port->peer_cn); + port->peer_cn = NULL; + return -1; + } + + BIO_free(bio); + + port->peer_dn = peer_dn; + port->peer_cert_valid = true; } @@ -618,6 +651,12 @@ be_tls_close(Port *port) pfree(port->peer_cn); port->peer_cn = NULL; } + + if (port->peer_dn) + { + pfree(port->peer_dn); + port->peer_dn = NULL; + } } ssize_t diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h index 30fb4e613d..d970277894 100644 --- a/src/include/libpq/libpq-be.h +++ b/src/include/libpq/libpq-be.h @@ -190,6 +190,7 @@ typedef struct Port */ bool ssl_in_use; char *peer_cn; + char *peer_dn; bool peer_cert_valid; /* -- 2.25.1 From e315f806dc7895a8c1a51cec4d073c5d3ccea74b Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Wed, 3 Feb 2021 11:42:05 -0800 Subject: [PATCH v9 2/2] Log authenticated identity from all auth backends The "authenticated identity" is the string used by an auth method to identify a particular user. In many common cases this is the same as the Postgres username, but for some third-party auth methods, the identifier in use may be shortened or otherwise translated (e.g. through pg_ident mappings) before the server stores it. To help DBAs see who has actually interacted with the system, store the original identity when authentication succeeds, and expose it via the log_connections setting. The log entries look something like this example (where a local user named "pchampion" is connecting to the database as the "admin" user): LOG: connection received: host=[local] LOG: connection authenticated: identity="pchampion" method=peer (/data/pg_hba.conf:88) LOG: connection
Re: Nicer error when connecting to standby with hot_standby=off
On Mon, Mar 22, 2021 at 2:52 PM Fujii Masao wrote: > > > > On 2021/03/23 3:25, James Coleman wrote: > > On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao > > wrote: > >> > >> > >> > >> On 2021/03/19 23:35, James Coleman wrote: > >>> Here's an updated patch; I think I've gotten what Alvaro suggested. > >> > >> Thanks for updating the patch! But I was thinking that our consensus is > >> something like the attached patch. Could you check this patch? > > > > As far as I can tell (I might be missing something) your v5 patch does > > the same thing, albeit with different code organization. It did catch > > though that I'd neglected to add the DETAIL line as separate from the > > errmsg line. > > > > Is the attached (in the style of my v4, since I'm not following why we > > need to move the standby determination logic into a new > > CAC_NOCONSISTENT block) what you're thinking? Or is there something > > else I'm missing? > > I just did that to avoid adding more CAC_state. But basically it's > ok to check hot standby at either canAcceptConnections() or > ProcessStartupPacket(). > > case CAC_STARTUP: > ereport(FATAL, > (errcode(ERRCODE_CANNOT_CONNECT_NOW), > -errmsg("the database system is > starting up"))); > +errmsg("the database system is not > accepting connections"), > +errdetail("Consistent recovery state > has not been yet reached."))); > > Do you want to report this message even in crash recovery? Since crash > recovery is basically not so much related to "consistent recovery state", > at least for me the original message seems more suitable for crash recovery. > > Also if we adopt this message, the server with hot_standby=off reports > "Consistent recovery state has not been yet reached." in PM_STARTUP, > but stops reporting this message at PM_RECOVERY even if the consistent > recovery state has not been reached yet. Instead, it reports "Hot standby > mode is disabled." at PM_RECOVERY. Isn't this transition of message confusing? Are you saying we should only change the message for a single case: the case where we'd otherwise allow connections but EnableHotStandby is false? I believe that's what the original patch did, but then Alvaro's proposal included changing additional messages. James Coleman
Re: Nicer error when connecting to standby with hot_standby=off
On 2021/03/23 3:25, James Coleman wrote: On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao wrote: On 2021/03/19 23:35, James Coleman wrote: Here's an updated patch; I think I've gotten what Alvaro suggested. Thanks for updating the patch! But I was thinking that our consensus is something like the attached patch. Could you check this patch? As far as I can tell (I might be missing something) your v5 patch does the same thing, albeit with different code organization. It did catch though that I'd neglected to add the DETAIL line as separate from the errmsg line. Is the attached (in the style of my v4, since I'm not following why we need to move the standby determination logic into a new CAC_NOCONSISTENT block) what you're thinking? Or is there something else I'm missing? I just did that to avoid adding more CAC_state. But basically it's ok to check hot standby at either canAcceptConnections() or ProcessStartupPacket(). case CAC_STARTUP: ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), -errmsg("the database system is starting up"))); +errmsg("the database system is not accepting connections"), +errdetail("Consistent recovery state has not been yet reached."))); Do you want to report this message even in crash recovery? Since crash recovery is basically not so much related to "consistent recovery state", at least for me the original message seems more suitable for crash recovery. Also if we adopt this message, the server with hot_standby=off reports "Consistent recovery state has not been yet reached." in PM_STARTUP, but stops reporting this message at PM_RECOVERY even if the consistent recovery state has not been reached yet. Instead, it reports "Hot standby mode is disabled." at PM_RECOVERY. Isn't this transition of message confusing? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Proposal: Save user's original authenticated identity for logging
On Mon, 2021-03-22 at 18:22 +0100, Magnus Hagander wrote: > On Mon, Mar 22, 2021 at 7:16 AM Michael Paquier wrote: > > > > I have briefly looked at 0002 (0001 in the attached set), and it seems > > sane to me. I still need to look at 0003 (well, now 0002) in details, > > which is very sensible as one mistake would likely be a CVE-class > > bug. > > The 0002/0001/whateveritisaftertherebase is tracked over at > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fflat%2F92e70110-9273-d93c-5913-0bccb6562740%40dunslane.netdata=04%7C01%7Cpchampion%40vmware.com%7Cd085c1e56ff045c7af3308d8ed57279a%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637520305878415422%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=kyW9O1jD0z14z0rC%2BYY9UhIKb7D6bg0nCWoVBJkF8oQ%3Dreserved=0 > isn't it? I've assumed the expectation is to have that one committed > from that thread, and then rebase using that. I think the primary thing that needs to be greenlit for both is the idea of using the RFC 2253/4514 format for Subject DNs. Other than that, the version here should only contain the changes necessary for both features (that is, port->peer_dn), so there's no hard dependency between the two. It's just on me to make sure my version is up-to-date. Which I believe it is, as of today. --Jacob
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 5/3/21 21:54, tsunakawa.ta...@fujitsu.com wrote: I've managed to rebased it, although it took unexpectedly long. The patch is attached. It passes make check against core and postgres_fdw. I'll turn the CF status back to ready for committer shortly. Macros _() at the postgresExecForeignCopy routine: if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0) uses gettext. Under linux it is compiled ok, because (as i understood) uses standard implementation of gettext: objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext' gettext@@GLIBC_2.2.5 but in MacOS (and maybe somewhere else) we need to explicitly link libintl library in the Makefile: SHLIB_LINK += $(filter -lintl, $(LIBS) Also, we may not use gettext at all in this part of the code. -- regards, Andrey Lepikhov Postgres Professional
Re: New IndexAM API controlling index vacuum strategies
On Mon, Mar 22, 2021 at 7:05 AM Robert Haas wrote: > I agree. I was having trouble before understanding exactly what you > are proposing, but this makes sense to me and I agree it's a good > idea. Our ambition is for this to be one big multi-release umbrella project, with several individual enhancements that each deliver a user-visible benefit on their own. The fact that we're talking about a few things at once is confusing, but I think that you need a "grand bargain" kind of discussion for this. I believe that it actually makes sense to do it that way, difficult though it may be. Sometimes the goal is to improve performance, other times the goal is to improve robustness. Although the distinction gets blurry at the margins. If VACUUM was infinitely fast (say because of sorcery), then performance would bee *unbeatable* -- plus we'd never have to worry about anti-wraparound vacuums not completing in time! > I'm not 100% sure whether we need a new GUC for this or not. I think > that if by default this triggers at the 90% of the hard-shutdown > limit, it would be unlikely, and perhaps unreasonable, for users to > want to raise the limit. However, I wonder whether some users will > want to lower the limit. Would it be reasonable for someone to want to > trigger this at 50% or 70% of XID exhaustion rather than waiting until > things get really bad? I wanted to avoid inventing a GUC for the mechanism in the third patch (not the emergency mechanism we're discussing right now, which was posted separately by Masahiko). I think that a GUC to control skipping index vacuuming purely because there are very few index tuples to delete in indexes will become a burden before long. In particular, we should eventually be able to vacuum some indexes but not others (on the same table) based on the local needs of each index. As I keep pointing out, bottom-up index deletion has created a situation where there can be dramatically different needs among indexes on the same table -- it can literally prevent 100% of all page splits from version churn in those indexes that are never subject to logically changes from non-HOT updates. Whereas bottom-up index deletion does nothing for any index that is logically updated, for the obvious reason -- there is now frequently a sharp qualitative difference among indexes that vacuumlazy.c currently imagines have basically the same needs. Vacuuming these remaining indexes is a cost that users will actually understand and accept, too. But that has nothing to do with the emergency mechanism we're talking about right now. I actually like your idea of making the emergency mechanism a GUC. It's equivalent to index_cleanup, except that it is continuous and dynamic (not discrete and static). The fact that this GUC expresses what VACUUM should do in terms of the age of the target table's current relfrozenxid age (and nothing else) seems like exactly the right thing. As I said before: What else could possibly matter? So I don't see any risk of such a GUC becoming a burden. I also think that it's a useful knob to be able to tune. It's also going to help a lot with testing the feature. So let's have a GUC for that. > Also, one thing that I dislike about the current system is that, from > a user perspective, when something goes wrong, nothing happens for a > while and then the whole system goes bananas. It seems desirable to me > to find ways of gradually ratcheting up the pressure, like cranking up > the effective cost limit if we can somehow figure out that we're not > keeping up. If, with your mechanism, there's an abrupt point when we > switch from never doing this for any table to always doing this for > every table, that might not be as good as something which does this > "sometimes" and then, if that isn't enough to avoid disaster, does it > "more," and eventually ramps up to doing it always, if trouble > continues. I don't know whether that's possible here, or what it would > look like, or even whether it's appropriate at all in this particular > case, so I just offer it as food for thought. That is exactly the kind of thing that some future highly evolved version of the broader incremental/dynamic VACUUM design might do. Your thoughts about the effective delay/throttling lessening as conditions change is in line with the stuff that we're thinking of doing. Though I don't believe Masahiko and I have talked about the delay stuff specifically in any of our private discussions about it. I am a big believer in the idea that we should have a variety of strategies that are applied incrementally and dynamically, in response to an immediate local need (say at the index level). VACUUM should be able to organically figure out the best strategy (or combination of strategies) itself, over time. Sometimes it will be very important to recognize that most indexes have been able to use techniques like bottom-up index deletion, and so really don't need to be vacuumed at all. Other times the cost delay
Re: Nicer error when connecting to standby with hot_standby=off
On Mon, Mar 22, 2021 at 1:49 PM Fujii Masao wrote: > > > > On 2021/03/19 23:35, James Coleman wrote: > > Here's an updated patch; I think I've gotten what Alvaro suggested. > > Thanks for updating the patch! But I was thinking that our consensus is > something like the attached patch. Could you check this patch? As far as I can tell (I might be missing something) your v5 patch does the same thing, albeit with different code organization. It did catch though that I'd neglected to add the DETAIL line as separate from the errmsg line. Is the attached (in the style of my v4, since I'm not following why we need to move the standby determination logic into a new CAC_NOCONSISTENT block) what you're thinking? Or is there something else I'm missing? Thanks, James v6-0001-Improve-standby-connection-denied-error-message.patch Description: Binary data
Re: [HACKERS] Custom compression methods
Robert Haas writes: > I think this is significantly cleaner than what we have now, and I > also prefer it to your proposal. +1 in general. However, I suspect that you did not try to compile this without --with-lz4, because if you had you'd have noticed the other uses of NO_LZ4_SUPPORT() that you broke. I think you need to leave that macro where it is. Also, it's not nice for GUC check functions to throw ereport(ERROR); we prefer the caller to be able to decide if it's a hard error or not. That usage should be using GUC_check_errdetail() or a cousin, so it can't share the macro anyway. regards, tom lane
Re: Fix pg_upgrade to preserve datdba
On 3/21/21 3:56 PM, Tom Lane wrote: Jan Wieck writes: So let's focus on the actual problem of running out of XIDs and memory while doing the upgrade involving millions of small large objects. Right. So as far as --single-transaction vs. --create goes, that's mostly a definitional problem. As long as the contents of a DB are restored in one transaction, it's not gonna matter if we eat one or two more XIDs while creating the DB itself. So we could either relax pg_restore's complaint, or invent a different switch that's named to acknowledge that it's not really only one transaction. That still leaves us with the lots-o-locks problem. However, once we've crossed the Rubicon of "it's not really only one transaction", you could imagine that the switch is "--fewer-transactions", and the idea is for pg_restore to commit after every (say) 10 operations. That would both bound its lock requirements and greatly cut its XID consumption. It leaves us with three things. 1) tremendous amounts of locks 2) tremendous amounts of memory needed 3) taking forever because it is single threaded. I created a pathological case here on a VM with 24GB of RAM, 80GB of SWAP sitting on NVME. The database has 20 million large objects, each of which has 2 GRANTS, 1 COMMENT and 1 SECURITY LABEL (dummy). Each LO only contains a string "large object ", so the whole database in 9.5 is about 15GB in size. A stock pg_upgrade to version 14devel using --link takes about 15 hours. This is partly because the pg_dump and pg_restore both grow to something like 50GB+ to hold the TOC. Which sounds out of touch considering that the entire system catalog on disk is less than 15GB. But aside from the ridiculous amount of swapping, the whole thing also suffers from consuming about 80 million transactions and apparently having just as many network round trips with a single client. The work you described sounded like it could fit into that paradigm, with the additional ability to run some parallel restore tasks that are each consuming a bounded number of locks. I have attached a POC patch that implements two new options for pg_upgrade. --restore-jobs=NUM --jobs parameter passed to pg_restore --restore-blob-batch-size=NUM number of blobs restored in one xact It does a bit more than just that. It rearranges the way large objects are dumped so that most of the commands are all in one TOC entry and the entry is emitted into SECTION_DATA when in binary upgrade mode (which guarantees that there isn't any actual BLOB data in the dump). This greatly reduces the number of network round trips and when using 8 parallel restore jobs, almost saturates the 4-core VM. Reducing the number of TOC entries also reduces the total virtual memory need of pg_restore to 15G, so there is a lot less swapping going on. It cuts down the pg_upgrade time from 15 hours to 1.5 hours. In that run I used --restore-jobs=8 and --restore-blob-batch-size=1 (with a max_locks_per_transaction=12000). As said, this isn't a "one size fits all" solution. The pg_upgrade parameters for --jobs and --restore-jobs will really depend on the situation. Hundreds of small databases want --jobs, but one database with millions of large objects wants --restore-jobs. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c index c7351a4..4a611d0 100644 --- a/src/bin/pg_dump/parallel.c +++ b/src/bin/pg_dump/parallel.c @@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot) WaitForCommands(AH, pipefd); /* + * Close an eventually open BLOB batch transaction. + */ + CommitBlobTransaction((Archive *)AH); + + /* * Disconnect from database and clean up. */ set_cancel_slot_archive(slot, NULL); diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 0296b9b..cd8a590 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -203,6 +203,8 @@ typedef struct Archive int numWorkers; /* number of parallel processes */ char *sync_snapshot_id; /* sync snapshot id for parallel operation */ + int blobBatchSize; /* # of blobs to restore per transaction */ + /* info needed for string escaping */ int encoding; /* libpq code for client_encoding */ bool std_strings; /* standard_conforming_strings */ @@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen); extern int StartBlob(Archive *AH, Oid oid); extern int EndBlob(Archive *AH, Oid oid); +extern void CommitBlobTransaction(Archive *AH); extern void CloseArchive(Archive *AH); extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt); diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 1f82c64..51a862a 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -68,6 +68,8 @@ typedef struct
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 01:38:36PM -0400, Robert Haas wrote: > On Mon, Mar 22, 2021 at 12:16 PM Robert Haas wrote: > > But, what about giving the default_toast_compression_method GUC an > > assign hook that sets a global variable of type "char" to the > > appropriate value? Then GetDefaultToastCompression() goes away > > entirely. That might be worth exploring. > > Actually, we can do even better. We should just make the values > actually assigned to the GUC be TOAST_PGLZ_COMPRESSION etc. rather > than TOAST_PGLZ_COMPRESSION_ID etc. Then a whole lot of complexity > just goes away. I added some comments explaining why using > TOAST_PGLZ_COMPRESSION is the wrong thing anyway. Then I got hacking > and rearranged a few other things. So the attached patch does these > thing: > > - Changes default_toast_compression to an enum, as in your patch, but > now with values that are the same as what ultimately gets stored in > attcompression. > - Adds a comment warning against incautious use of > TOAST_PGLZ_COMPRESSION_ID, etc. > - Moves default_toast_compression_options to guc.c. > - After doing the above two things, we can remove the #include of > utils/guc.h into access/toast_compression.h, so the patch does that. > - Moves NO_LZ4_SUPPORT, GetCompressionMethodName, and > CompressionNameToMethod to guc.c. Making these inline functions > doesn't save anything meaningful; it's more important not to export a > bunch of random identifiers. > - Removes an unnecessary cast to bool from the definition of > CompressionMethodIsValid. > > I think this is significantly cleaner than what we have now, and I > also prefer it to your proposal. +1 guc.c should not longer define this as extern: default_toast_compression_options I think you should comment that default_toast_compression is an int as far as guc.c is concerned, but storing one of the char value of TOAST_*_COMPRESSION Shouldn't varlena.c pg_column_compression() call GetCompressionMethodName () ? I guess it should already have done that. Maybe pg_dump.c can't use those constants, though (?) -- Justin
Re: Nicer error when connecting to standby with hot_standby=off
On 2021/03/19 23:35, James Coleman wrote: Here's an updated patch; I think I've gotten what Alvaro suggested. Thanks for updating the patch! But I was thinking that our consensus is something like the attached patch. Could you check this patch? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index ef0be4ca38..a800568d14 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2294,6 +2294,19 @@ retry1: (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is starting up"))); break; + case CAC_NOCONSISTENT: + if (EnableHotStandby) + ereport(FATAL, + (errcode(ERRCODE_CANNOT_CONNECT_NOW), +errmsg("the database system is not accepting connections"), +errdetail("Consistent recovery state has not been yet reached."))); + else + ereport(FATAL, + (errcode(ERRCODE_CANNOT_CONNECT_NOW), +errmsg("the database system is not accepting connections"), +errdetail("Hot standby mode is disabled."))); + break; + case CAC_SHUTDOWN: ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), @@ -2435,10 +2448,11 @@ canAcceptConnections(int backend_type) { if (Shutdown > NoShutdown) return CAC_SHUTDOWN;/* shutdown is pending */ - else if (!FatalError && -(pmState == PM_STARTUP || - pmState == PM_RECOVERY)) + else if (!FatalError && pmState == PM_STARTUP) return CAC_STARTUP; /* normal startup */ + else if (!FatalError && pmState == PM_RECOVERY) + return CAC_NOCONSISTENT;/* consistent recovery state has not + * been yet reached */ else return CAC_RECOVERY;/* else must be crash recovery */ } diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h index 30fb4e613d..299528255c 100644 --- a/src/include/libpq/libpq-be.h +++ b/src/include/libpq/libpq-be.h @@ -70,7 +70,12 @@ typedef struct typedef enum CAC_state { - CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY, + CAC_OK, + CAC_STARTUP, + CAC_NOCONSISTENT, + CAC_SHUTDOWN, + CAC_RECOVERY, + CAC_TOOMANY, CAC_SUPERUSER } CAC_state;
Re: Autovacuum worker doesn't immediately exit on postmaster death
Greetings, * Thomas Munro (thomas.mu...@gmail.com) wrote: > On Fri, Dec 11, 2020 at 7:57 AM Stephen Frost wrote: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > > The if-we're-going-to-delay-anyway path in vacuum_delay_point seems > > > OK to add a touch more overhead to, though. > > > > Alright, for this part at least, seems like it'd be something like the > > attached. > > > > Only lightly tested, but does seem to address the specific example which > > was brought up on this thread. > > > > Thoughts..? > > +1 Thanks for that. Attached is just a rebased version with a commit message added. If there aren't any other concerns, I'll commit this in the next few days and back-patch it. When it comes to 12 and older, does anyone want to opine about the wait event to use? I was thinking PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ... Or do folks think this shouldn't be backpatched? That would mean it wouldn't help anyone for years, which would be pretty unfortuante, hence my feeling that it's worthwhile to backpatch. Thanks! Stephen From 9daf52b78d106c86e038dcefdb1d8345d22b9756 Mon Sep 17 00:00:00 2001 From: Stephen Frost Date: Mon, 22 Mar 2021 13:25:57 -0400 Subject: [PATCH] Use a WaitLatch for vacuum/autovacuum sleeping Instead of using pg_usleep() in vacuum_delay_point(), use a WaitLatch. This has the advantage that we will realize if the postmaster has been killed since the last time we decided to sleep while vacuuming. Discussion: https://postgr.es/m/CAFh8B=kcdk8k-Y21RfXPu5dX=bgPqJ8TC3p_qxR_ygdBS=j...@mail.gmail.com Backpatch: 9.6- --- src/backend/commands/vacuum.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index c064352e23..662aff04b4 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -2080,9 +2080,11 @@ vacuum_delay_point(void) if (msec > VacuumCostDelay * 4) msec = VacuumCostDelay * 4; - pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); - pg_usleep((long) (msec * 1000)); - pgstat_report_wait_end(); + (void) WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + msec, + WAIT_EVENT_VACUUM_DELAY); + ResetLatch(MyLatch); VacuumCostBalance = 0; -- 2.27.0 signature.asc Description: PGP signature
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 12:16 PM Robert Haas wrote: > But, what about giving the default_toast_compression_method GUC an > assign hook that sets a global variable of type "char" to the > appropriate value? Then GetDefaultToastCompression() goes away > entirely. That might be worth exploring. Actually, we can do even better. We should just make the values actually assigned to the GUC be TOAST_PGLZ_COMPRESSION etc. rather than TOAST_PGLZ_COMPRESSION_ID etc. Then a whole lot of complexity just goes away. I added some comments explaining why using TOAST_PGLZ_COMPRESSION is the wrong thing anyway. Then I got hacking and rearranged a few other things. So the attached patch does these thing: - Changes default_toast_compression to an enum, as in your patch, but now with values that are the same as what ultimately gets stored in attcompression. - Adds a comment warning against incautious use of TOAST_PGLZ_COMPRESSION_ID, etc. - Moves default_toast_compression_options to guc.c. - After doing the above two things, we can remove the #include of utils/guc.h into access/toast_compression.h, so the patch does that. - Moves NO_LZ4_SUPPORT, GetCompressionMethodName, and CompressionNameToMethod to guc.c. Making these inline functions doesn't save anything meaningful; it's more important not to export a bunch of random identifiers. - Removes an unnecessary cast to bool from the definition of CompressionMethodIsValid. I think this is significantly cleaner than what we have now, and I also prefer it to your proposal. Comments? -- Robert Haas EDB: http://www.enterprisedb.com toast-compression-guc-rmh.patch Description: Binary data
Re: Wired if-statement in gen_partprune_steps_internal
Should the status of this patch be updated to ready for comitter to get in line for Pg 14 deadline? *Ryan Lambert* On Sun, Mar 7, 2021 at 11:38 PM Amit Langote wrote: > On Fri, Mar 5, 2021 at 7:50 AM Ryan Lambert > wrote: > > On Wed, Mar 3, 2021 at 11:03 PM Amit Langote > wrote: > >> > >> Sorry, this seems to have totally slipped my mind. > >> > >> Attached is the patch I had promised. > >> > >> Also, I have updated the title of the CF entry to "Some cosmetic > >> improvements of partition pruning code", which I think is more > >> appropriate. > > > > Thank you. The updated patch passes installcheck-world. I ran a > handful of test queries with a small number of partitions and observed the > same plans before and after the patch. I cannot speak to the quality of the > code, though am happy to test any additional use cases that should be > verified. > > Thanks Ryan. > > There's no need to test it extensively, because no functionality is > changed with this patch. > > -- > Amit Langote > EDB: http://www.enterprisedb.com >
Re: Proposal: Save user's original authenticated identity for logging
On Mon, Mar 22, 2021 at 7:16 AM Michael Paquier wrote: > > On Fri, Mar 19, 2021 at 06:37:05PM +, Jacob Champion wrote: > > The same effect can be had by moving the log rotation to the top of the > > test that needs it, so I've done it that way in v7. > > After thinking more about 0001, I have come up with an even simpler > solution that has resulted in 11e1577. That's similar to what > PostgresNode::issues_sql_like() does. This also makes 0003 simpler > with its changes as this requires to change two lines in test_access. Man that renumbering threw me off :) > > Turns out it's easy now to have our cake and eat it too; a single if > > statement can implement the same search-forward functionality that was > > spread across multiple places before. So I've done that too. > > I have briefly looked at 0002 (0001 in the attached set), and it seems > sane to me. I still need to look at 0003 (well, now 0002) in details, > which is very sensible as one mistake would likely be a CVE-class > bug. The 0002/0001/whateveritisaftertherebase is tracked over at https://www.postgresql.org/message-id/flat/92e70110-9273-d93c-5913-0bccb6562...@dunslane.net isn't it? I've assumed the expectation is to have that one committed from that thread, and then rebase using that. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Disable WAL logging to speed up data loading
Greetings, * Laurenz Albe (laurenz.a...@cybertec.at) wrote: > On Mon, 2021-03-22 at 11:05 -0400, Stephen Frost wrote: > > > Perhaps allowing to set unlogged tables to logged ones without writing WAL > > > is a more elegant way to do that, but I cannot see how that would be any > > > more crash safe than this patch. Besides, the feature doesn't exist yet. > > > > I'm not suggesting it's somehow more crash safe- but it's at least very > > clear what happens in such a case, to wit: the entire table is cleared > > on crash recovery. > > I didn't look at the patch, but are you saying that after changing the > table to LOGGED, it somehow remembers that it is not crash safe and is > emptied if there is a crash before the next checkpoint? I'm not sure where the confusion is, but certainly once the table has been turned into LOGGED and that's been committed then it should be crash safe, even under 'minimal' with the optimization to avoid actually copying the entire table into the WAL. I don't see any particular reason why that isn't possible to do. Under the proposed 'none', you basically have to throw out the entire cluster on a crash, all because you don't want to use 'UNLOGGED' when you created the tables you want to load data into, or 'TRUNCATE' them in the transaction where you start the data load, either of which gives us enough indication and which we have infrastructure around dealing with in the event of a crash during the load without everything else having to be tossed and everything restored from a backup. That's both a better user experience from the perspective of having fewer WAL levels to understand and from just a general administration perspective so you don't have to go all the way back to a backup to bring the system back up. Thanks, Stephen signature.asc Description: PGP signature
Re: Protect syscache from bloating with negative cache entries
On Thu, Jan 28, 2021 at 05:16:52PM +0900, Kyotaro Horiguchi wrote: > At Thu, 28 Jan 2021 16:50:44 +0900 (JST), Kyotaro Horiguchi > wrote in > > I was going to write in the doc something like "you can inspect memory > > consumption by catalog caches using pg_backend_memory_contexts", but > > all the memory used by catalog cache is in CacheMemoryContext. Is it > > sensible for each catalog cache to have their own contexts? > > Something like this. Is this feature not going to make it into PG 14? It first appeared in the January, 2017 commitfest: https://commitfest.postgresql.org/32/931/ -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Change default of checkpoint_completion_target
Greetings, * David Steele (da...@pgmasters.net) wrote: > I had a look at the patch and the change and new documentation seem sensible > to me. Thanks! > I think this phrase may be a bit too idiomatic: > > +consistent I/O load while also leaving some slop for checkpoint > > Perhaps just: > > +consistent I/O load while also leaving some time for checkpoint Yeah, good thought, updated. > It seems to me that the discussion about changing the wording for GUCs not > changeable after server should be saved for another patch as long as this > patch follows the current convention. Agreed. Unless there's anything further on this, I'll plan to commit it tomorrow or Wednesday. Thanks! Stephen From 3ebe08dee4b9dfe2dff51fd1bad2eb36834e82ed Mon Sep 17 00:00:00 2001 From: Stephen Frost Date: Tue, 19 Jan 2021 13:53:34 -0500 Subject: [PATCH] Change the default of checkpoint_completion_target to 0.9 Common recommendations are that the checkpoint should be spread out as much as possible, provided we avoid having it take too long. This change updates the default to 0.9 (from 0.5) to match that recommendation. There was some debate about possibly removing the option entirely but it seems there may be some corner-cases where having it set much lower to try to force the checkpoint to be as fast as possible could result in fewer periods of time of reduced performance due to kernel flushing. General agreement is that the "spread more" is the preferred approach though and those who need to tune away from that value are much less common. Reviewed-By: Michael Paquier, Peter Eisentraut, Tom Lane, David Steele Discussion: https://postgr.es/m/20201207175329.GM16415%40tamriel.snowman.net --- doc/src/sgml/config.sgml | 12 ++-- doc/src/sgml/wal.sgml | 29 --- src/backend/utils/misc/guc.c | 2 +- src/backend/utils/misc/postgresql.conf.sample | 2 +- src/test/recovery/t/015_promotion_pages.pl| 1 - 5 files changed, 29 insertions(+), 17 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5679b40dd5..44763f0180 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3302,9 +3302,15 @@ include_dir 'conf.d' Specifies the target of checkpoint completion, as a fraction of -total time between checkpoints. The default is 0.5. -This parameter can only be set in the postgresql.conf -file or on the server command line. +total time between checkpoints. The default is 0.9, which spreads the +checkpoint across almost all of the available interval, providing fairly +consistent I/O load while also leaving some time for checkpoint +completion overhead. Reducing this parameter is not recommended as that +causes the I/O from the checkpoint to have to complete faster, resulting +in a higher I/O rate, while then having a period of less I/O between the +completion of the checkpoint and the start of the next scheduled +checkpoint. This parameter can only be set in the +postgresql.conf file or on the server command line. diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index ae4a3c1293..4354051c7b 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -571,22 +571,29 @@ writing dirty buffers during a checkpoint is spread over a period of time. That period is controlled by , which is - given as a fraction of the checkpoint interval. + given as a fraction of the checkpoint interval (configured by using + checkpoint_timeout). The I/O rate is adjusted so that the checkpoint finishes when the given fraction of checkpoint_timeout seconds have elapsed, or before max_wal_size is exceeded, whichever is sooner. - With the default value of 0.5, + With the default value of 0.9, PostgreSQL can be expected to complete each checkpoint - in about half the time before the next checkpoint starts. On a system - that's very close to maximum I/O throughput during normal operation, - you might want to increase checkpoint_completion_target - to reduce the I/O load from checkpoints. The disadvantage of this is that - prolonging checkpoints affects recovery time, because more WAL segments - will need to be kept around for possible use in recovery. Although - checkpoint_completion_target can be set as high as 1.0, - it is best to keep it less than that (perhaps 0.9 at most) since - checkpoints include some other activities besides writing dirty buffers. + a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's + duration). This spreads out the I/O as much as possible to have the I/O load be + consistent during the checkpoint. The disadvantage of this is that prolonging + checkpoints affects recovery time, because more WAL segments will need to be kept +
Re: Disable WAL logging to speed up data loading
On Mon, 2021-03-22 at 11:05 -0400, Stephen Frost wrote: > > Perhaps allowing to set unlogged tables to logged ones without writing WAL > > is a more elegant way to do that, but I cannot see how that would be any > > more crash safe than this patch. Besides, the feature doesn't exist yet. > > I'm not suggesting it's somehow more crash safe- but it's at least very > clear what happens in such a case, to wit: the entire table is cleared > on crash recovery. I didn't look at the patch, but are you saying that after changing the table to LOGGED, it somehow remembers that it is not crash safe and is emptied if there is a crash before the next checkpoint? Wouldn't that cause corruption if you restore from an earlier backup? At least with the feature in this thread we'd get an error on recovery. Yours, Laurenz Albe
Re: Add docs stub for recovery.conf
Greetings, * Craig Ringer (craig.rin...@enterprisedb.com) wrote: > Pretty good to me. Thanks so much for your help and support with this. Thanks for helping me move it forward! > Index entries render as e.g. > > pg_xlogdump, The pg_xlogdump command > (see also pg_waldump) > > wheras with the obsolete subhead they would render as something like: > > obsolete, Obsolete or renamed features, settings and files > pg_xlogdump, The pg_xlogdump command > > The see also spelling is much easier to find in the index but doesn't make > it as obvious that it's obsoleted/replaced. > > A look at the doxygen docs suggest we should use not for > these. > > A quick > > sed -i -e 's///g' -e 's/<\/seealso>/<\/see>/g' > doc/src/sgml/appendix-obsolete* > > causes them to render much better: > > pg_receivexlog, The pg_receivexlog command (see pg_receivewal) > > It might be worth changing the s too, so I've done so in the > attached. The terms now render as: > > pg_receivexlog, pg_receivexlog renamed to pg_recievewal (see > pg_receivewal) > > which is good enough in my opinion. The duplication is messy but an > expected artifact of index generation. I don't see any docbook > attribute that lets you suppress insertion of the of the section > containing the , and it's not worth fiddling to try to eliminate > it with structural hacks. Nice, yes, that does look better. > The attached changes the titles, changes to , and also > updates the comments in the obsolete entries SGML docs to specify that the > id must be unchanged + give a recommended index term format. Awesome, attached is just a rebase (not that anything really changed). Unless someone wants to speak up, I'll commit this soonish (hopefully tomorrow, but at least sometime later this week). Thanks! Stephen From 000cd577d6dbb9d6d6c571e2302657f1252e6a56 Mon Sep 17 00:00:00 2001 From: Stephen Frost Date: Mon, 22 Mar 2021 12:45:41 -0400 Subject: [PATCH] Add a docs section for obsoleted and renamed functions and settings The new appendix groups information on renamed or removed settings, commands, etc into an out-of-the-way part of the docs. The original id elements are retained in each subsection to ensure that the same filenames are produced for HTML docs. This prevents /current/ links on the web from breaking, and allows users of the web docs to follow links from old version pages to info on the changes in the new version. Prior to this change, a link to /current/ for renamed sections like the recovery.conf docs would just 404. Similarly if someone searched for recovery.conf they would find the pg11 docs, but there would be no /12/ or /current/ link, so they couldn't easily find out that it was removed in pg12 or how to adapt. Index entries are also added so that there's a breadcrumb trail for users to follow when they know the old name, but not what we changed it to. So a user who is trying to find out how to set standby_mode in PostgreSQL 12+, or where pg_resetxlog went, now has more chance of finding that information. Craig Ringer and Stephen Frost Discussion: https://postgr.es/m/CAGRY4nzPNOyYQ_1-pWYToUVqQ0ThqP5jdURnJMZPm539fdizOg%40mail.gmail.com --- .../sgml/appendix-obsolete-pgreceivexlog.sgml | 24 .../sgml/appendix-obsolete-pgresetxlog.sgml | 24 .../sgml/appendix-obsolete-pgxlogdump.sgml| 24 .../appendix-obsolete-recovery-config.sgml| 58 +++ doc/src/sgml/appendix-obsolete.sgml | 40 + doc/src/sgml/config.sgml | 4 +- doc/src/sgml/filelist.sgml| 7 +++ doc/src/sgml/high-availability.sgml | 16 - doc/src/sgml/postgres.sgml| 1 + doc/src/sgml/ref/pg_basebackup.sgml | 5 +- 10 files changed, 198 insertions(+), 5 deletions(-) create mode 100644 doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml create mode 100644 doc/src/sgml/appendix-obsolete-pgresetxlog.sgml create mode 100644 doc/src/sgml/appendix-obsolete-pgxlogdump.sgml create mode 100644 doc/src/sgml/appendix-obsolete-recovery-config.sgml create mode 100644 doc/src/sgml/appendix-obsolete.sgml diff --git a/doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml b/doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml new file mode 100644 index 00..30bae2f7a2 --- /dev/null +++ b/doc/src/sgml/appendix-obsolete-pgreceivexlog.sgml @@ -0,0 +1,24 @@ + + + + + pg_receivexlog renamed to pg_recievewal + + + pg_receivexlog + pg_receivewal + + + +PostgreSQL 9.6 and below provided a command named +pg_receivexlog +pg_receivexlog +to fetch write-ahead-log (WAL) files. This command was renamed to pg_receivewal, see + for documentation of pg_receivewal and see +the release notes for PostgreSQL 10 for details +on this change. + + + diff --git a/doc/src/sgml/appendix-obsolete-pgresetxlog.sgml b/doc/src/sgml/appendix-obsolete-pgresetxlog.sgml new file
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 11:13 AM Justin Pryzby wrote: > The first iteration was pretty rough, and there's still some question in my > mind about where default_toast_compression_options[] should be defined. If > it's in the header file, then I could use lengthof() - but then it probably > gets multiply defined. What do you want to use lengthof() for? > In the latest patch, there's multiple "externs". Maybe > guc.c doesn't need the extern, since it includes toast_compression.h. But > then > it's the only "struct config_enum_entry" which has an "extern" outside of > guc.c. Oh, yeah, we certainly shouldn't have an extern in guc.c itself, if we've already got it in the header file. As to the more general question of where to put stuff, I don't think there's any conceptual problem with putting it in a header file rather than in guc.c. It's not very scalable to just keeping inventing new GUCs and sticking all their accoutrement into guc.c. That might have kind of made sense when guc.c was invented, since there were probably fewer settings there and guc.c itself was new, but at this point it's a well-established part of the infrastructure and having other subsystems cater to what it needs rather than the other way around seems logical. However, it's not great to have "utils/guc.h" included in "access/toast_compression.h", because then anything that includes "access/toast_compression.h" or "access/toast_internals.h" sucks in "utils/guc.h" even though it's not really topically related to what they intended to include. We can't avoid that just by choosing to put this enum in guc.c, because GetDefaultToastCompression() also uses it. But, what about giving the default_toast_compression_method GUC an assign hook that sets a global variable of type "char" to the appropriate value? Then GetDefaultToastCompression() goes away entirely. That might be worth exploring. > Also, it looks like you added default_toast_compression out of order, so maybe > you'd fix that at the same time. You know, I looked at where you had it and said to myself, "surely this is a silly place to put this, it would make much more sense to move this up a bit." Now I feel dumb. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods (mac+lz4.h)
Robert Haas writes: > On Mon, Mar 22, 2021 at 11:48 AM Tom Lane wrote: >> Possibly the former names should survive and the latter become >> wrappers around them, not sure. But we shouldn't be using the "4B" >> terminology anyplace except this part of postgres.h. > I would argue that it shouldn't be used any place at all, and that we > ought to go the other direction and get rid of the existing macros - > e.g. change #define VARATT_IS_1B_E to #define VARATT_IS_EXTERNAL > instead of defining the latter as a no-value-added wrapper around the > former. Maybe at one time somebody thought that the test for > VARATT_IS_EXTERNAL might someday have more cases than just > VARATT_IS_1B_E, but that's not looking like a good bet in 2021. Maybe. I think the original idea was exactly what the comment says, to have a layer of macros that'd deal with endianness issues and no more. That still seems like a reasonable plan to me, though perhaps it wasn't executed very well. regards, tom lane
Re: [HACKERS] Custom compression methods (mac+lz4.h)
On Mon, Mar 22, 2021 at 11:48 AM Tom Lane wrote: > > Anyway, this particular macro name was chosen, it seems, for symmetry > > with VARDATA_4B_C, but if you want to change it to something else, I'm > > OK with that, too. > > After looking at postgres.h for a bit, I'm thinking that what these > should have been symmetric with is the considerably-less-terrible > names used for the corresponding VARATT_EXTERNAL cases. Thus, > something like > > s/VARRAWSIZE_4B_C/VARDATA_COMPRESSED_GET_RAWSIZE/ > s/VARCOMPRESS_4B_C/VARDATA_COMPRESSED_GET_COMPRESSION/ Works for me. > Possibly the former names should survive and the latter become > wrappers around them, not sure. But we shouldn't be using the "4B" > terminology anyplace except this part of postgres.h. I would argue that it shouldn't be used any place at all, and that we ought to go the other direction and get rid of the existing macros - e.g. change #define VARATT_IS_1B_E to #define VARATT_IS_EXTERNAL instead of defining the latter as a no-value-added wrapper around the former. Maybe at one time somebody thought that the test for VARATT_IS_EXTERNAL might someday have more cases than just VARATT_IS_1B_E, but that's not looking like a good bet in 2021. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods (mac+lz4.h)
Robert Haas writes: > On Mon, Mar 22, 2021 at 10:41 AM Tom Lane wrote: >> Yeah, I thought about that too, but do we want to assume that >> VARRAWSIZE_4B_C is the correct way to get the decompressed size >> for all compression methods? > I think it's OK to assume this. OK, cool. >> (If so, I think it would be better style to have a less opaque macro >> name for the purpose.) > Complaining about the name of one particular TOAST-related macro name > seems a bit like complaining about the greenhouse gasses emitted by > one particular car. Maybe, but that's not a reason to make it worse. Anyway, my understanding of that is that the really opaque names are *only* meant to be used in this very stretch of postgres.h, ie they are just intermediate steps on the way to the macros below them. As an example, the only use of VARDATA_1B_E() is in VARDATA_EXTERNAL(). > Anyway, this particular macro name was chosen, it seems, for symmetry > with VARDATA_4B_C, but if you want to change it to something else, I'm > OK with that, too. After looking at postgres.h for a bit, I'm thinking that what these should have been symmetric with is the considerably-less-terrible names used for the corresponding VARATT_EXTERNAL cases. Thus, something like s/VARRAWSIZE_4B_C/VARDATA_COMPRESSED_GET_RAWSIZE/ s/VARCOMPRESS_4B_C/VARDATA_COMPRESSED_GET_COMPRESSION/ Possibly the former names should survive and the latter become wrappers around them, not sure. But we shouldn't be using the "4B" terminology anyplace except this part of postgres.h. regards, tom lane
Re: [HACKERS] Custom compression methods (mac+lz4.h)
On Mon, Mar 22, 2021 at 10:41 AM Tom Lane wrote: > > Okay, the fix makes sense. In fact, IMHO, in general also this fix > > looks like an optimization, I mean when slicelength >= > > VARRAWSIZE_4B_C(value), then why do we need to allocate extra memory > > even in the case of pglz. So shall we put this check directly in > > toast_decompress_datum_slice instead of handling it at the lz4 level? > > Yeah, I thought about that too, but do we want to assume that > VARRAWSIZE_4B_C is the correct way to get the decompressed size > for all compression methods? I think it's OK to assume this. If and when we add a third compression method, it seems certain to just grab one of the two remaining bit patterns. Now, things get a bit more complicated if and when we want to add a fourth method, because at that point you've got to ask yourself how comfortable you feel about stealing the last bit pattern for your feature. But, if the solution to that problem were to decide that whenever that last bit pattern is used, we will add an extra byte (or word) after va_tcinfo indicating the real compression method, then using VARRAWSIZE_4B_C here would still be correct. To imagine this decision being wrong, you have to posit a world in which one of the two remaining bit patterns for the high 2 bits cause the low 30 bits to be interpreted as something other than the size, which I guess is not totally impossible, but my first reaction is to think that such a design would be (1) hard to make work and (2) unnecessarily painful. > (If so, I think it would be better style to have a less opaque macro > name for the purpose.) Complaining about the name of one particular TOAST-related macro name seems a bit like complaining about the greenhouse gasses emitted by one particular car. They're pretty uniformly terrible. Does anyone really know when to use VARATT_IS_1B_E or VARATT_IS_4B_U or any of that cruft? Like, who decided that "is this varatt 1B E?" would be a perfectly reasonable way of asking "is this varlena is TOAST pointer?". While I'm complaining, it's hard to say enough bad things about the fact that we have 12 consecutive completely obscure macro definitions for which the only comments are (a) that they are endian-dependent - which isn't even true for all of them - and (b) that they are "considered internal." Apparently, they're SO internal that they don't even need to be understandable to other developers. Anyway, this particular macro name was chosen, it seems, for symmetry with VARDATA_4B_C, but if you want to change it to something else, I'm OK with that, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods (mac+lz4.h)
On Mon, Mar 22, 2021 at 8:11 PM Tom Lane wrote: > > Dilip Kumar writes: > > On Mon, Mar 22, 2021 at 5:22 AM Tom Lane wrote: > >> Also, after studying the documentation for LZ4_decompress_safe > >> and LZ4_decompress_safe_partial, I realized that liblz4 is also > >> counting on the *output* buffer size to not be a lie. So we > >> cannot pass it a number larger than the chunk's true decompressed > >> size. The attached patch resolves the issue I'm seeing. > > > Okay, the fix makes sense. In fact, IMHO, in general also this fix > > looks like an optimization, I mean when slicelength >= > > VARRAWSIZE_4B_C(value), then why do we need to allocate extra memory > > even in the case of pglz. So shall we put this check directly in > > toast_decompress_datum_slice instead of handling it at the lz4 level? > > Yeah, I thought about that too, but do we want to assume that > VARRAWSIZE_4B_C is the correct way to get the decompressed size > for all compression methods? Yeah, VARRAWSIZE_4B_C is the macro getting the rawsize of the data stored in the compressed varlena. > (If so, I think it would be better style to have a less opaque macro > name for the purpose.) Okay, I have added another macro that is less opaque and came up with this patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v1-0001-fix-slice-decompression.patch Description: Binary data
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 11:05:19AM -0400, Robert Haas wrote: > On Mon, Mar 22, 2021 at 10:44 AM Justin Pryzby wrote: > > Thanks. I just realized that if you also push the GUC change, then the docs > > should change from to > > > > doc/src/sgml/config.sgml: > > default_toast_compression (string) > > I've now also committed your 0005. As for 0006, aside from the note > above, which is a good one, is there any particular reason why this > patch is labelled as WIP? I think this change makes sense and we > should just do it unless there's some problem with it. The first iteration was pretty rough, and there's still some question in my mind about where default_toast_compression_options[] should be defined. If it's in the header file, then I could use lengthof() - but then it probably gets multiply defined. In the latest patch, there's multiple "externs". Maybe guc.c doesn't need the extern, since it includes toast_compression.h. But then it's the only "struct config_enum_entry" which has an "extern" outside of guc.c. Also, it looks like you added default_toast_compression out of order, so maybe you'd fix that at the same time. -- Justin
Re: Disable WAL logging to speed up data loading
Greetings, * Laurenz Albe (laurenz.a...@cybertec.at) wrote: > On Mon, 2021-03-22 at 09:46 -0400, Stephen Frost wrote: > > * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote: > > > From: Stephen Frost > > > > The argument here seems to stem from the idea that issueing a 'TRUNCATE' > > > > inside the transaction before starting the 'COPY' command is 'too hard'. > > > > > > No, we can't ask using TRUNCATE because the user wants to add data to a > > > table. > > > > First- what are you expecting would actually happen during crash > > recovery in this specific case with your proposed new WAL level? > > > > Second, use partitioning, or unlogged tables (with the patch discussed > > elsewhere to allow flipping them to logged without writing the entire > > thing to WAL). > > Perhaps allowing to set unlogged tables to logged ones without writing WAL > is a more elegant way to do that, but I cannot see how that would be any > more crash safe than this patch. Besides, the feature doesn't exist yet. I'm not suggesting it's somehow more crash safe- but it's at least very clear what happens in such a case, to wit: the entire table is cleared on crash recovery. > So I think that argument doesn't carry much weight. We're talking about two different ways to accomplish essentially the same thing- one which introduces a new WAL level, vs. one which adds an optimization for a WAL level we already have. That the second is more elegant is more-or-less entirely the point I'm making here, so it seems pretty relevant. Thanks, Stephen signature.asc Description: PGP signature
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 10:44 AM Justin Pryzby wrote: > Thanks. I just realized that if you also push the GUC change, then the docs > should change from to > > doc/src/sgml/config.sgml: > default_toast_compression (string) I've now also committed your 0005. As for 0006, aside from the note above, which is a good one, is there any particular reason why this patch is labelled as WIP? I think this change makes sense and we should just do it unless there's some problem with it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
On 2021/03/20 13:40, Masahiro Ikeda wrote: Sorry, there is no evidence we should call it. I thought that to call FreeWaitEventSet() is a manner after using CreateWaitEventSet() and the performance impact to call it seems to be too small. (Please let me know if my understanding is wrong.) The reason why I said this is a manner because I thought it's no problem even without calling FreeWaitEventSet() before the process exits regardless of fast or immediate shutdown. Since the "WaitEventSet *wes" is a process local resource, this will be released by OS even if FreeWaitEventSet() is not called. I'm now changing my mind to skip it is better because the code can be simpler and, it's unimportant for the above reason especially when the immediate shutdown is requested which is a bad manner itself. Thanks for explaining this! So you're thinking that v3 patch should be chosen? Probably some review comments I posted upthread need to be applied to v3 patch, e.g., rename of SignalHandlerForUnsafeExit() function. BTW, the SysLoggerMain() create the WaitEventSet, but it didn't call FreeWaitEventSet(). Is it better to call FreeWaitEventSet() before exiting too? Yes if which could cause actual issue. Otherwise I don't have strong reason to do that for now.. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Disable WAL logging to speed up data loading
On Mon, 2021-03-22 at 09:46 -0400, Stephen Frost wrote: > * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote: > > From: Stephen Frost > > > The argument here seems to stem from the idea that issueing a 'TRUNCATE' > > > inside the transaction before starting the 'COPY' command is 'too hard'. > > > > No, we can't ask using TRUNCATE because the user wants to add data to a > > table. > > First- what are you expecting would actually happen during crash > recovery in this specific case with your proposed new WAL level? > > Second, use partitioning, or unlogged tables (with the patch discussed > elsewhere to allow flipping them to logged without writing the entire > thing to WAL). Perhaps allowing to set unlogged tables to logged ones without writing WAL is a more elegant way to do that, but I cannot see how that would be any more crash safe than this patch. Besides, the feature doesn't exist yet. So I think that argument doesn't carry much weight. Yours, Laurenz Albe
Re: [HACKERS] Custom compression methods
On Mon, Mar 22, 2021 at 10:41:33AM -0400, Robert Haas wrote: > On Sun, Mar 21, 2021 at 7:55 PM Justin Pryzby wrote: > > Rebased again. > > Thanks, Justin. I committed 0003 and 0004 together as > 226e2be3876d0bda3dc33d16dfa0bed246b7b74f. I also committed 0001 and > 0002 together as 24f0e395ac5892cd12e8914646fe921fac5ba23d, but with > some revisions, because your text was not clear that this is setting > the default for new tables, not new values; it also implied that this > only affects out-of-line compression, which is not true. In lieu of > trying to explain how TOAST works here, I added a link. It looks, > though, like that documentation also needs to be patched for this > change. I'll look into that, and your remaining patches, next. Thanks. I just realized that if you also push the GUC change, then the docs should change from to doc/src/sgml/config.sgml: default_toast_compression (string) -- Justin
Re: [HACKERS] Custom compression methods (mac+lz4.h)
Dilip Kumar writes: > On Mon, Mar 22, 2021 at 5:22 AM Tom Lane wrote: >> Also, after studying the documentation for LZ4_decompress_safe >> and LZ4_decompress_safe_partial, I realized that liblz4 is also >> counting on the *output* buffer size to not be a lie. So we >> cannot pass it a number larger than the chunk's true decompressed >> size. The attached patch resolves the issue I'm seeing. > Okay, the fix makes sense. In fact, IMHO, in general also this fix > looks like an optimization, I mean when slicelength >= > VARRAWSIZE_4B_C(value), then why do we need to allocate extra memory > even in the case of pglz. So shall we put this check directly in > toast_decompress_datum_slice instead of handling it at the lz4 level? Yeah, I thought about that too, but do we want to assume that VARRAWSIZE_4B_C is the correct way to get the decompressed size for all compression methods? (If so, I think it would be better style to have a less opaque macro name for the purpose.) regards, tom lane
Re: [HACKERS] Custom compression methods
On Sun, Mar 21, 2021 at 7:55 PM Justin Pryzby wrote: > Rebased again. Thanks, Justin. I committed 0003 and 0004 together as 226e2be3876d0bda3dc33d16dfa0bed246b7b74f. I also committed 0001 and 0002 together as 24f0e395ac5892cd12e8914646fe921fac5ba23d, but with some revisions, because your text was not clear that this is setting the default for new tables, not new values; it also implied that this only affects out-of-line compression, which is not true. In lieu of trying to explain how TOAST works here, I added a link. It looks, though, like that documentation also needs to be patched for this change. I'll look into that, and your remaining patches, next. -- Robert Haas EDB: http://www.enterprisedb.com
Re: PG13 fails to startup even though the current transaction is equal to the target transaction
On 2021/03/22 21:40, Sean Jezewski wrote: Hi Kyotaro - Thanks for the response. I think it boils down to your comment: > I'm not sure. The direct cause of the "issue" is a promotion trigger > that came before reaching recovery target. That won't happen if the > "someone" doesn't do that. I think the question is 'under what conditions is it safe to do the promotion' ? What is your recommendation in this case? The end of the archive has been reached. All transactions have been replayed. And in fact the current transaction id is exactly equal to the target recovery transaction id. I guess that the transaction with this current XID has not been committed yet at that moment. Right? I thought that because you confirmed the XID by SELECT pg_catalog.txid_snapshot_xmax(pg_catalog.txid_current_snapshot()). IIUC this query doesn't return the XID of already-committed transaction. The standby thinks that the recovery reaches the recovery target when the XID of *committed* transaction get equal to the recovery_target_xid. So in your case, the standby didn't reached the recovery target when you requested the promotion. IMO this is why you got that FATAL error. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Minimal logical decoding on standbys
On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand wrote: > > Thanks! > > Just made minor changes to make it compiling again on current master (mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced in e5d8a99903). > > Please find enclosed the new patch version that currently passes "make check" and the 2 associated TAP tests. > Unfortunately it still not applying to the current master: $ git am ~/Downloads/v10-000*.patch Applying: Allow logical decoding on standby. Applying: Add info in WAL records in preparation for logical slot conflict handling. error: patch failed: src/backend/access/nbtree/nbtpage.c:32 error: src/backend/access/nbtree/nbtpage.c: patch does not apply Patch failed at 0002 Add info in WAL records in preparation for logical slot conflict handling. hint: Use 'git am --show-current-patch' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". > I'll have a look to the whole thread to check if there is anything else waiting in the pipe regarding this feature, unless some of you know off the top of their head? > Will do the same!!! But as far I remember last time I checked it everything discussed is covered in this patch set. Regards, -- Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: New IndexAM API controlling index vacuum strategies
On Thu, Mar 18, 2021 at 9:42 PM Peter Geoghegan wrote: > The fact that we can *continually* reevaluate if an ongoing VACUUM is > at risk of taking too long is entirely the point here. We can in > principle end index vacuuming dynamically, whenever we feel like it > and for whatever reasons occur to us (hopefully these are good reasons > -- the point is that we get to pick and choose). We can afford to be > pretty aggressive about not giving up, while still having the benefit > of doing that when it *proves* necessary. Because: what are the > chances of the emergency mechanism ending index vacuuming being the > wrong thing to do if we only do that when the system clearly and > measurably has no more than about 10% of the possible XID space to go > before the system becomes unavailable for writes? I agree. I was having trouble before understanding exactly what you are proposing, but this makes sense to me and I agree it's a good idea. > > But ... should the thresholds for triggering these kinds of mechanisms > > really be hard-coded with no possibility of being configured in the > > field? What if we find out after the release is shipped that the > > mechanism works better if you make it kick in sooner, or later, or if > > it depends on other things about the system, which I think it almost > > certainly does? Thresholds that can't be changed without a recompile > > are bad news. That's why we have GUCs. > > I'm fine with a GUC, though only for the emergency mechanism. The > default really matters, though -- it shouldn't be necessary to tune > (since we're trying to address a problem that many people don't know > they have until it's too late). I still like 1.8 billion XIDs as the > value -- I propose that that be made the default. I'm not 100% sure whether we need a new GUC for this or not. I think that if by default this triggers at the 90% of the hard-shutdown limit, it would be unlikely, and perhaps unreasonable, for users to want to raise the limit. However, I wonder whether some users will want to lower the limit. Would it be reasonable for someone to want to trigger this at 50% or 70% of XID exhaustion rather than waiting until things get really bad? Also, one thing that I dislike about the current system is that, from a user perspective, when something goes wrong, nothing happens for a while and then the whole system goes bananas. It seems desirable to me to find ways of gradually ratcheting up the pressure, like cranking up the effective cost limit if we can somehow figure out that we're not keeping up. If, with your mechanism, there's an abrupt point when we switch from never doing this for any table to always doing this for every table, that might not be as good as something which does this "sometimes" and then, if that isn't enough to avoid disaster, does it "more," and eventually ramps up to doing it always, if trouble continues. I don't know whether that's possible here, or what it would look like, or even whether it's appropriate at all in this particular case, so I just offer it as food for thought. > > On another note, I cannot say enough bad things about the function > > name two_pass_strategy(). I sincerely hope that you're not planning to > > create a function which is a major point of control for VACUUM whose > > name gives no hint that it has anything to do with vacuum. > > You always hate my names for things. But that's fine by me -- I'm > usually not very attached to them. I'm happy to change it to whatever > you prefer. My taste in names may be debatable, but including the subsystem name in the function name seems like a pretty bare-minimum requirement, especially when the other words in the function name could refer to just about anything. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Disable WAL logging to speed up data loading
Greetings, * tsunakawa.ta...@fujitsu.com (tsunakawa.ta...@fujitsu.com) wrote: > From: Stephen Frost > > The argument here seems to stem from the idea that issueing a 'TRUNCATE' > > inside the transaction before starting the 'COPY' command is 'too hard'. > > > I could be wrong and perhaps opinions will change in the future, but it > > really > > doesn't seem like the there's much support for adding a new WAL level just > > to > > avoid doing the TRUNCATE. Appealing to what other database systems > > support can be helpful in some cases but that's usually when we're looking > > at a > > new capability which multiple other systems have implemented. This isn't > > actually a new capability at all- the WAL level that you're looking for > > already > > exists and already gives the optimization you're looking for, as long as > > you issue > > a TRUNCATE at the start of the transaction. > > No, we can't ask using TRUNCATE because the user wants to add data to a table. First- what are you expecting would actually happen during crash recovery in this specific case with your proposed new WAL level? Second, use partitioning, or unlogged tables (with the patch discussed elsewhere to allow flipping them to logged without writing the entire thing to WAL). Thanks, Stephen signature.asc Description: PGP signature