Re: Disable WAL logging to speed up data loading
At Fri, 02 Oct 2020 13:51:35 +0900 (JST), Kyotaro Horiguchi wrote in > Sorry for the slippery brain... ^2 > At Fri, 02 Oct 2020 13:38:22 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao > > wrote in > > > > > > On 2020/10/02 10:06, Kyotaro Horiguchi wrote: > > > > 1. Flip BM_PERMANENT of active buffers > > > > 2. adding/removing init fork > > > > 3. sync files, > > > > 4. Flip pg_class.relpersistence. > > > > It always skips table copy in the SET UNLOGGED case, > > > > > > Even in wal_level != minimal? > > > What happens in the standby side when SET UNLOGGED is executed without > > > the table rewrite in the primary? The table data should be truncated > > > in the standby? > > > > A table turned into unlogged on the primary is also turned into > > unlogged on the standby and it is inaccessible on the standby. > Maybe the storage dropped on unpatched Maybe the old storage is replaced with an empty stroage on unpatched. > and left alone on patched. > > After the table is again turned into logged, the content is > > transferred via WAL records generated from the insertions into the new > > storage and it rebuilds the same storage on the standby on both > > patched and unpatched. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: Disable WAL logging to speed up data loading
From: Fujii Masao > > The speedup has already been achieved with higher durability by > > wal_level=minimal in that case. > > I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold would > speed up that initial data loading. First of all, thank you Horiguchi-san for trying to improve ALTER TABLE SET UNLOGGED/LOGGED. That should also be appealing. At the same time, as I said before, both features have good points. TBH, as a user, I'm kind of attracted by MySQL's approach because of its simplicity for users (although DBMS developers may be worried about this and that.) What tempts me is that I can just switch on the feature with a single configuration parameter, and continue to use existing SQL scripts and other data integration software without knowing what tables those load data into. In the same context, I don't have to add or delete ALTER TABLE statements when I have to change the set of tables to be loaded. For the same reason, I'm also interested in Oracle's another feature ALTER TABLESPACE LOGGING/NOLOGGING. BTW, does ALTER TABLE LOGGED/UNLOGGED on a partitioned table get the change to its all partitions? It would be a bit tedious to add/delete ALTER TABLE LOGGED/UNLOGGED when I add/drop a partition. Regarding data migration, data movement is not limited only to major upgrades. It will be convenient to speed up the migration of the entire database cluster into a new instance for testing and new deployment. (I'm not sure about recent pg_upgrade, but pg_upgrade sometimes cannot upgrade too older versions.) To conclude, I hope both features will be realized, and wish we won't fall in a situation where the words fly such as "Mine is enough. Yours is risky and not necessary." With that said, I think we may as well separate the thread some time later for CF entry. Otherwise, we will have trouble in finding the latest patch from the CF entry. > No, I have no strong objection against your trial. But I was thinking > that it's not so easy to design and implement wal_level=none. > For example, there are some functions and commands depending on > the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION > and COMMIT PREPARED. Probably you need to define how they should > work in wal_level=none, e.g., emit an error. Yeah, we thought pg_switch_wal() may need some treatment. We'll check PREPARE and COMMIT PREPARED as well. I'd appreciate it if you share what you notice at any time. It is possible that we should emit WAL records of some resource managers, like the bootstrap mode emits WAL only for RM_XLOG_ID. Regards Takayuki Tsunakawa
Re: Disable WAL logging to speed up data loading
Sorry for the slippery brain... At Fri, 02 Oct 2020 13:38:22 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao > wrote in > > > > On 2020/10/02 10:06, Kyotaro Horiguchi wrote: > > > 1. Flip BM_PERMANENT of active buffers > > > 2. adding/removing init fork > > > 3. sync files, > > > 4. Flip pg_class.relpersistence. > > > It always skips table copy in the SET UNLOGGED case, > > > > Even in wal_level != minimal? > > What happens in the standby side when SET UNLOGGED is executed without > > the table rewrite in the primary? The table data should be truncated > > in the standby? > > A table turned into unlogged on the primary is also turned into > unlogged on the standby and it is inaccessible on the standby. > Maybe the storage is dropped on both patched and unpatched versoins. Maybe the storage dropped on unpatched and left alone on patched. > After the table is again turned into logged, the content is > transferred via WAL records generated from the insertions into the new > storage and it rebuilds the same storage on the standby on both > patched and unpatched. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
At Fri, 02 Oct 2020 11:44:46 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 1 Oct 2020 12:55:34 +, "k.jami...@fujitsu.com" > wrote in > - * XXX currently it sequentially searches the buffer pool, should > be > - * changed to more clever ways of searching. However, this routine > - * is used only in code paths that aren't very > performance-critical, > - * and we shouldn't slow down the hot paths to make it faster ... > + * XXX The relation might have extended before this, so this path > is > + * only optimized during recovery when we can get a reliable cached > + * value of blocks for specified relation. In addition, it is > safe to > + * do this since there are no other processes but the startup > process > + * that changes the relation size during recovery. Otherwise, or > if > + * not in recovery, proceed to usual invalidation process, where it > + * sequentially searches the buffer pool. > > This should no longer be a XXX comment. It seems to me somewhat > describing too-detailed at this function's level. How about something > like the follwoing? (excpet its syntax, or phrasing:p) > > === > If the expected maximum number of buffers to drop is small enough > compared to NBuffers, individual buffers are located by > BufTableLookup. Otherwise we scan through all buffers. Snnce we > mustn't leave a buffer behind, we take the latter way unless the > number is not reliably identified. See smgrcachednblocks() for > details. > === The second to last phrase is inversed, and some typos are found. FWIW this is the revised version. If we are expected to drop buffers less enough, we locate individual buffers using BufTableLookup. Otherwise we scan through all buffers. Since we mustn't leave a buffer behind, we take the latter way unless the sizes of all the involved forks are known to be accurte. See smgrcachednblocks() for details. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Disable WAL logging to speed up data loading
At Fri, 2 Oct 2020 10:56:21 +0900, Fujii Masao wrote in > > On 2020/10/02 10:06, Kyotaro Horiguchi wrote: > > At Thu, 1 Oct 2020 08:14:42 +, "osumi.takami...@fujitsu.com" > > wrote in > >> When I compare the 2 ideas, > >> one of the benefits of this ALTER TABLE 's improvement > >> is that we can't avoid the downtime > >> while that of wal_level='none' provides an easy and faster > >> major version up via output file of pg_dumpall. > > The speedup has already been achieved with higher durability by > > wal_level=minimal in that case. > > I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold > would > speed up that initial data loading. Yeah. > > Or maybe you should consider using > > pg_upgrade instead. Even inducing the time to take a backup copy of > > the whole cluster, running pg_upgrade would be far faster than > > pg_dumpall then loading. > > > >> Both ideas have good points. > >> However, actually to modify ALTER TABLE's copy > >> looks far more difficult than wal_level='none' and > >> beyond my current ability. > >> So, I'd like to go forward with the direction of wal_level='none'. > >> Did you have strong objections for this direction ? > > No, I have no strong objection against your trial. But I was thinking > that it's not so easy to design and implement wal_level=none. > For example, there are some functions and commands depending on > the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION > and COMMIT PREPARED. Probably you need to define how they should > work in wal_level=none, e.g., emit an error. > > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some > > trials of several ways, I drifted to the following way after poking > > several ways. > > Nice! > > > > 1. Flip BM_PERMANENT of active buffers > > 2. adding/removing init fork > > 3. sync files, > > 4. Flip pg_class.relpersistence. > > It always skips table copy in the SET UNLOGGED case, > > Even in wal_level != minimal? > What happens in the standby side when SET UNLOGGED is executed without > the table rewrite in the primary? The table data should be truncated > in the standby? A table turned into unlogged on the primary is also turned into unlogged on the standby and it is inaccessible on the standby. Maybe the storage is dropped on both patched and unpatched versoins. After the table is again turned into logged, the content is transferred via WAL records generated from the insertions into the new storage and it rebuilds the same storage on the standby on both patched and unpatched. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: New statistics for tuning WAL buffer size
On 2020-10-02 10:21, Fujii Masao wrote: On 2020/10/01 13:35, Fujii Masao wrote: On 2020/10/01 12:56, Masahiro Ikeda wrote: On 2020-10-01 11:33, Amit Kapila wrote: On Thu, Oct 1, 2020 at 6:53 AM Kyotaro Horiguchi wrote: At Thu, 1 Oct 2020 09:05:19 +0900, Fujii Masao wrote in > > > On 2020/09/30 20:21, Amit Kapila wrote: > > On Tue, Sep 29, 2020 at 9:23 PM Fujii Masao > > wrote: > >> > >> On 2020/09/29 11:51, Masahiro Ikeda wrote: > >>> On 2020-09-29 11:43, Amit Kapila wrote: > On Tue, Sep 29, 2020 at 7:39 AM Masahiro Ikeda > wrote: > >>> Thanks for your suggestion. > >>> I understood that the point is that WAL-related stats have just one > >>> counter now. > >>> > >>> Since we may add some WAL-related stats like pgWalUsage.(bytes, > >>> records, fpi), > >>> I think that the current approach is good. > >> > >> +1 > >> > > Okay, it makes sense to keep it in the current form if we have a plan > > to extend this view with additional stats. However, why don't we > > expose it with a function similar to pg_stat_get_archiver() instead of > > providing individual functions like pg_stat_get_wal_buffers_full() and > > pg_stat_get_wal_stat_reset_time? > > We can adopt either of those approaches for pg_stat_wal. I think that > the former is a bit more flexible because we can collect only one of > WAL information even when pg_stat_wal will contain many information > in the future, by using the function. But you thought there are some > reasons that the latter is better for pg_stat_wal? FWIW I prefer to expose it by one SRF function rather than by subdivided functions. One of the reasons is the less oid consumption and/or reduction of definitions for intrinsic functions. Another reason is at least for me subdivided functions are not useful so much for on-the-fly examination on psql console. I'm often annoyed by realizing I can't recall the exact name of a function, say, pg_last_wal_receive_lsn or such but function names cannot be auto-completed on psql console. "select proname from pg_proc where proname like.. " is one of my friends:p On the other hand "select * from pg_stat_wal" requires no detailed memory. However subdivided functions might be useful if I wanted use just one number of wal-stats in a function, I think it is not a major usage and we can use a SQL query on the view instead. Another reason that I mildly want to object to subdivided functions is I was annoyed that a stats view makes many individual calls to functions that internally share the same statistics entry. That behavior required me to provide an entry-caching feature to my shared-memory statistics patch. All these are good reasons to expose it via one function and I think Understood. +1 to expose it as one function. that is why most of our existing views also use one function approach. Thanks for your comments. I didn't notice there are the above disadvantages to provide individual functions. I changed the latest patch to expose it via one function. Thanks for updating the patch! LGTM. Barring any other objection, I will commit it. I updated typedefs.list and pushed the patch. Thanks! Thanks to all reviewers! Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: Assertion failure with barriers in parallel hash join
On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro wrote: > On Tue, Sep 29, 2020 at 7:11 PM Michael Paquier wrote: > > #2 0x009027d2 in ExceptionalCondition > > (conditionName=conditionName@entry=0xa80846 "!barrier->static_party", > > > #4 0x00682ebf in ExecParallelHashJoinNewBatch > > Thanks. Ohhh. I think I see how that condition was reached and what > to do about it, but I'll need to look more closely. I'm away on > vacation right now, and will update in a couple of days when I'm back > at a real computer. Here's a throw-away patch to add some sleeps that trigger the problem, and a first draft fix. I'll do some more testing of this next week and see if I can simplify it. From 65f70e0be36ec61e1993907162cfd4edac46e063 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 2 Oct 2020 15:24:23 +1300 Subject: [PATCH 1/2] Inject fault timing --- src/backend/executor/nodeHash.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index ea69eeb2a1..244805e69b 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -25,6 +25,7 @@ #include #include +#include #include "access/htup_details.h" #include "access/parallel.h" @@ -585,6 +586,13 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, ParallelHashJoinState *pstate = hashtable->parallel_state; Barrier*build_barrier; + if (ParallelWorkerNumber >= 1) + { + elog(LOG, "a short nap before attaching to build_barrier..."); + sleep(2); + elog(LOG, "nap finished"); + } + /* * Attach to the build barrier. The corresponding detach operation is * in ExecHashTableDetach. Note that we won't attach to the @@ -3198,6 +3206,9 @@ ExecHashTableDetach(HashJoinTable hashtable) if (DsaPointerIsValid(pstate->batches)) { dsa_free(hashtable->area, pstate->batches); +elog(LOG, "batch array freed, taking a long nap..."); +sleep(5); +elog(LOG, "finished nap, clearing pointer"); pstate->batches = InvalidDsaPointer; } } -- 2.20.1 From 21f745905dcaff738326434943e70c4292aae4a4 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 2 Oct 2020 15:53:44 +1300 Subject: [PATCH 2/2] Fix race condition in parallel hash join batch cleanup. With unlucky timing and parallel_leader_participation off, PHJ could attempt to access per-batch state just as it was being freed. There was code intended to prevent that by checking for a cleared pointer, but it was racy. Fix, by introducing an extra barrier phase. The new phase PHJ_BUILD_RUNNING means that it's safe to access the per-batch state to find a batch to help with, and PHJ_BUILD_DONE means that it is too late. The last to detach will free the array of per-batch state as before, but now it will also atomically advance the phase at the same time, so that late attachers can avoid the hazard. This mirrors the way per-batch hash tables are freed (see phases PHJ_BATCH_PROBING and PHJ_BATCH_DONE). Revealed by a build farm failure, where BarrierAttach() failed a sanity check assertion, because the memory had been clobbered by dsa_free(). Discussion: https://postgr.es/m/20200929061142.GA29096%40paquier.xyz --- src/backend/executor/nodeHash.c | 47 - src/backend/executor/nodeHashjoin.c | 38 +-- src/include/executor/hashjoin.h | 3 +- 3 files changed, 56 insertions(+), 32 deletions(-) diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 244805e69b..b1013b452b 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -334,14 +334,21 @@ MultiExecParallelHash(HashState *node) hashtable->nbuckets = pstate->nbuckets; hashtable->log2_nbuckets = my_log2(hashtable->nbuckets); hashtable->totalTuples = pstate->total_tuples; - ExecParallelHashEnsureBatchAccessors(hashtable); + + /* + * Unless we're completely done and the batch state has been freed, make + * sure we have accessors. + */ + if (BarrierPhase(build_barrier) < PHJ_BUILD_DONE) + ExecParallelHashEnsureBatchAccessors(hashtable); /* * The next synchronization point is in ExecHashJoin's HJ_BUILD_HASHTABLE - * case, which will bring the build phase to PHJ_BUILD_DONE (if it isn't + * case, which will bring the build phase to PHJ_BUILD_RUNNING (if it isn't * there already). */ Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER || + BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING || BarrierPhase(build_barrier) == PHJ_BUILD_DONE); } @@ -632,7 +639,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations, /* * The next Parallel Hash synchronization point is in * MultiExecParallelHash(), which will progress it all the way to - * PHJ_BUILD_DONE. The caller must not return control from this + * PHJ_BUILD_RUNNING. The caller must not return control from this *
Re: Why does PostgresNode.pm set such a low value of max_wal_senders?
At Wed, 30 Sep 2020 22:38:59 -0700, Noah Misch wrote in noah> Perhaps wal_level=minimal should stop its pedantic call for max_wal_senders=0. noah> As long as the relevant error messages are clear, it would be fine for noah> wal_level=minimal to ignore max_wal_senders and size resources as though noah> max_wal_senders=0. That could be one less snag for end users. (It's not noah> worth changing solely to save a line in PostgresNode, though.) At Thu, 01 Oct 2020 09:42:52 -0400, Tom Lane wrote in tgl> On the other point, I think that we should continue to complain tgl> about max_wal_senders > 0 with wal_level = minimal. If we reduce tgl> that to a LOG message, which'd be the net effect of trying to be tgl> laxer, people wouldn't see it and would then wonder why they can't tgl> start replication. FWIW, I'm on the noah's side. One reason of that is that if we implement the in-place setting relation persistence feature for bulk-data loading, wal_level would get flipped-then-back between minimal and replica or logical. The restriction about max_wal_senders is the pain n the ass in that case.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [Patch] Optimize dropping of relation buffers using dlist
At Thu, 1 Oct 2020 12:55:34 +, "k.jami...@fujitsu.com" wrote in > On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote: > > > From: Kyotaro Horiguchi > > > I thought that the advantage of this optimization is that we don't > > > need to visit all buffers? If we need to run a full-scan for any > > > reason, there's no point in looking-up already-visited buffers again. > > > That's just wastefull cycles. Am I missing somethig? > > > > > > I don't understand. If we chose to the optimized dropping, the reason > > > is the number of buffer lookup is fewer than a certain threashold. Why > > > do you think that the fork kind a buffer belongs to is relevant to the > > > criteria? > > > > I rethought about this, and you certainly have a point, but... OK, I think > > I > > understood. I should have thought in a complicated way. In other words, > > you're suggesting "Let's simply treat all forks as one relation to determine > > whether to optimize," right? That is, the code simple becomes: Exactly. The concept of the threshold is that if we are expected to repeat buffer look-up than that, we consider just one-time full-scan more efficient. Since we know we are going to drop buffers of all (or the specified) forks of the relation at once, the number of looking-up is naturally the sum of the expected number of the buffers of all forks. > > whether to optimize," right? That is, the code simple becomes: > > > > Sums up the number of buffers to invalidate in all forks; > > if (the cached sizes > > of all forks are valid && # of buffers to invalidate < THRESHOLD) { > > do the optimized way; > > return; > > } > > do the traditional way; > > > > This will be simple, and I'm +1. Thanks! > This is actually close to the v18 I posted trying Horiguchi-san's approach, > but that > patch had bug. So attached is an updated version (v20) trying this approach > again. > I hope it's bug-free this time. Thaks for the new version. - * XXX currently it sequentially searches the buffer pool, should be - * changed to more clever ways of searching. However, this routine - * is used only in code paths that aren't very performance-critical, - * and we shouldn't slow down the hot paths to make it faster ... + * XXX The relation might have extended before this, so this path is The following description is found in the comment for FlushRelationBuffers. > * XXX currently it sequentially searches the buffer pool, should > be > * changed to more clever ways of searching. This routine is not > * used in any performance-critical code paths, so it's not worth > * adding additional overhead to normal paths to make it go faster; > * but see also DropRelFileNodeBuffers. This looks like to me "We won't do that kind of optimization for FlushRelationBuffers, but DropRelFileNodeBuffers would need it". If so, don't we need to revise the comment together? - * XXX currently it sequentially searches the buffer pool, should be - * changed to more clever ways of searching. However, this routine - * is used only in code paths that aren't very performance-critical, - * and we shouldn't slow down the hot paths to make it faster ... + * XXX The relation might have extended before this, so this path is + * only optimized during recovery when we can get a reliable cached + * value of blocks for specified relation. In addition, it is safe to + * do this since there are no other processes but the startup process + * that changes the relation size during recovery. Otherwise, or if + * not in recovery, proceed to usual invalidation process, where it + * sequentially searches the buffer pool. This should no longer be a XXX comment. It seems to me somewhat describing too-detailed at this function's level. How about something like the follwoing? (excpet its syntax, or phrasing:p) === If the expected maximum number of buffers to drop is small enough compared to NBuffers, individual buffers are located by BufTableLookup. Otherwise we scan through all buffers. Snnce we mustn't leave a buffer behind, we take the latter way unless the number is not reliably identified. See smgrcachednblocks() for details. === (I'm still mildly opposed to the function name, which seems exposing detail too much.) +* Get the total number of cached blocks and to-be-invalidated blocks +* of the relation. If a fork's nblocks is not valid, break the loop. The number of file blocks is not usually equal to the number of existing buffers for the file. We might need to explain that limitation here. + for (j = 0; j < nforks; j++) Though I understand that j is considered to be in a connection with fork number, I'm a bit uncomfortable that j is used for the outmost
Re: Improving connection scalability: GetSnapshotData()
On 2020/10/02 3:26, Andres Freund wrote: Hi Ian, Andrew, All, On 2020-09-30 15:43:17 -0700, Andres Freund wrote: Attached is an updated version of the test (better utility function, stricter regexes, bailing out instead of failing just the current when psql times out). I'm leaving it in this test for now, but it's fairly easy to use this way, in my opinion, so it may be worth moving to PostgresNode at some point. I pushed this yesterday. Ian, thanks again for finding this and helping with fixing & testing. Thanks! Apologies for not getting back to your earlier responses, have been distracted by Various Other Things. The tests I run which originally triggered the issue now run just fine. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Disable WAL logging to speed up data loading
On 2020/10/02 10:06, Kyotaro Horiguchi wrote: At Thu, 1 Oct 2020 08:14:42 +, "osumi.takami...@fujitsu.com" wrote in Hi, Horiguchi-San and Fujii-San. Thank you so much both of you. the table needs to be rewriitten. One idea for that is to improve that command so that it skips the table rewrite if wal_level=minimal. Of course, also you can change wal_level after marking the table as unlogged. tablecmd.c: The idea is really interesting. I didn't come up with getting rid of the whole copy of the ALTER TABLE UNLOGGED/LOGGED commands only when wal_level='minimal'. * There are two reasons for requiring a rewrite when changing * persistence: on one hand, we need to ensure that the buffers * belonging to each of the two relations are marked with or without * BM_PERMANENT properly. On the other hand, since rewriting creates * and assigns a new relfilenode, we automatically create or drop an * init fork for the relation as appropriate. Thanks for sharing concrete comments in the source code. According to this comment, perhaps we can do that at least for wal_level=minimal. When I compare the 2 ideas, one of the benefits of this ALTER TABLE 's improvement is that we can't avoid the downtime while that of wal_level='none' provides an easy and faster major version up via output file of pg_dumpall. The speedup has already been achieved with higher durability by wal_level=minimal in that case. I was thinking the same, i.e., wal_level=minimal + wal_skip_threshold would speed up that initial data loading. Or maybe you should consider using pg_upgrade instead. Even inducing the time to take a backup copy of the whole cluster, running pg_upgrade would be far faster than pg_dumpall then loading. Both ideas have good points. However, actually to modify ALTER TABLE's copy looks far more difficult than wal_level='none' and beyond my current ability. So, I'd like to go forward with the direction of wal_level='none'. Did you have strong objections for this direction ? No, I have no strong objection against your trial. But I was thinking that it's not so easy to design and implement wal_level=none. For example, there are some functions and commands depending on the existence of WAL, like pg_switch_wal(), PREPARE TRANSACTION and COMMIT PREPARED. Probably you need to define how they should work in wal_level=none, e.g., emit an error. For fuel(?) of the discussion, I tried a very-quick PoC for in-place ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some trials of several ways, I drifted to the following way after poking several ways. Nice! 1. Flip BM_PERMANENT of active buffers 2. adding/removing init fork 3. sync files, 4. Flip pg_class.relpersistence. It always skips table copy in the SET UNLOGGED case, Even in wal_level != minimal? What happens in the standby side when SET UNLOGGED is executed without the table rewrite in the primary? The table data should be truncated in the standby? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: New statistics for tuning WAL buffer size
On 2020/10/01 13:35, Fujii Masao wrote: On 2020/10/01 12:56, Masahiro Ikeda wrote: On 2020-10-01 11:33, Amit Kapila wrote: On Thu, Oct 1, 2020 at 6:53 AM Kyotaro Horiguchi wrote: At Thu, 1 Oct 2020 09:05:19 +0900, Fujii Masao wrote in > > > On 2020/09/30 20:21, Amit Kapila wrote: > > On Tue, Sep 29, 2020 at 9:23 PM Fujii Masao > > wrote: > >> > >> On 2020/09/29 11:51, Masahiro Ikeda wrote: > >>> On 2020-09-29 11:43, Amit Kapila wrote: > On Tue, Sep 29, 2020 at 7:39 AM Masahiro Ikeda > wrote: > >>> Thanks for your suggestion. > >>> I understood that the point is that WAL-related stats have just one > >>> counter now. > >>> > >>> Since we may add some WAL-related stats like pgWalUsage.(bytes, > >>> records, fpi), > >>> I think that the current approach is good. > >> > >> +1 > >> > > Okay, it makes sense to keep it in the current form if we have a plan > > to extend this view with additional stats. However, why don't we > > expose it with a function similar to pg_stat_get_archiver() instead of > > providing individual functions like pg_stat_get_wal_buffers_full() and > > pg_stat_get_wal_stat_reset_time? > > We can adopt either of those approaches for pg_stat_wal. I think that > the former is a bit more flexible because we can collect only one of > WAL information even when pg_stat_wal will contain many information > in the future, by using the function. But you thought there are some > reasons that the latter is better for pg_stat_wal? FWIW I prefer to expose it by one SRF function rather than by subdivided functions. One of the reasons is the less oid consumption and/or reduction of definitions for intrinsic functions. Another reason is at least for me subdivided functions are not useful so much for on-the-fly examination on psql console. I'm often annoyed by realizing I can't recall the exact name of a function, say, pg_last_wal_receive_lsn or such but function names cannot be auto-completed on psql console. "select proname from pg_proc where proname like.. " is one of my friends:p On the other hand "select * from pg_stat_wal" requires no detailed memory. However subdivided functions might be useful if I wanted use just one number of wal-stats in a function, I think it is not a major usage and we can use a SQL query on the view instead. Another reason that I mildly want to object to subdivided functions is I was annoyed that a stats view makes many individual calls to functions that internally share the same statistics entry. That behavior required me to provide an entry-caching feature to my shared-memory statistics patch. All these are good reasons to expose it via one function and I think Understood. +1 to expose it as one function. that is why most of our existing views also use one function approach. Thanks for your comments. I didn't notice there are the above disadvantages to provide individual functions. I changed the latest patch to expose it via one function. Thanks for updating the patch! LGTM. Barring any other objection, I will commit it. I updated typedefs.list and pushed the patch. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: buildfarm animal shoveler failing with "Illegal instruction"
Mark Wong writes: > I'm getting Tom set up with access too, in case he has time before me to > get a stack trace to see what's happening... tl;dr: it's hard to conclude that this is anything but a compiler bug. I was able to reproduce this on shoveler's host, but only when using the compiler shoveler uses (clang-3.9), not the 6.3 gcc that's also on there and is of similar vintage. Observations: * You don't need any complicated test case; "pg_dump template1" is enough. * Reverting 1ed6b8956's addition of a "postfix operators are not supported anymore" warning to dumpOpr() makes it go away. This despite the fact that that code is never reached when dumping template1. (We do enter dumpOpr, but the oprinfo->dobj.dump test always fails.) * Reducing the optimization level to -O1 or -O0 makes it go away. * Inserting a debugging fprintf in dumpOpr makes it go away. Since clang 3.9 is several years old, maybe we could move shoveler up to a newer version? Or dial it down to -O1 optimization? regards, tom lane
Re: enable_incremental_sort changes query behavior
On Thu, Oct 1, 2020 at 6:08 PM Tomas Vondra wrote: > > On Thu, Oct 01, 2020 at 09:02:57AM -0400, James Coleman wrote: > >On Thu, Oct 1, 2020 at 3:09 AM Jaime Casanova > > wrote: > >> > >> On Wed, 30 Sep 2020 at 21:21, James Coleman wrote: > >> > > >> > On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova > >> > wrote: > >> > > > >> > > Hi, > >> > > > >> > > With sqlsmith I found a query that gives this error: > >> > > ERROR: ORDER/GROUP BY expression not found in targetlist > >> > > > >> [...] > >> > > > >> > > But if I set enable_incremental_sort to off the query gets executed > >> > > without problems (attached the explain produced for that case) > >> > > >> > Thanks for the report. > >> > > >> > >> Hi, > >> > >> by experiment I reduced the query to this > >> > >> --- 0 --- > >> select distinct > >> subq_0.c1 as c0, > >> case when (true = pg_catalog.pg_rotate_logfile_old()) then > >> ref_0.t else ref_0.t > >> end > >> as c4 > >> from > >> public.ref_0, > >> lateral (select > >> > >> ref_0.i as c1 > >> from > >> generate_series(1, 100) as ref_1) as subq_0 > >> --- 0 --- > >> > >> the only custom table already needed can be created with this commands: > >> > >> --- 0 --- > >> create table ref_0 as select repeat('abcde', (random() * 10)::integer) > >> t, random() * 1000 i from generate_series(1, 50); > >> create index on ref_0 (i); > >> analyze ref_0 ; > >> --- 0 --- > >> > >> > >> > Is there by an chance an index on ref_0.radi_text_temp? > >> > > >> > >> there is an index involved but not on that field, commands above > >> create the index in the right column... after that, ANALYZE the table > >> > >> > And if you set enable_hashagg = off what plan do you get (or error)? > >> > > >> > >> same error > > > >I was able to reproduce the error without incremental sort enabled > >(i.e., it happens with a full sort also). The function call in the > >SELECT doesn't have to be in a case expression; for example I was able > >to reproduce changing that to `random()::text || ref_0.t`. > > > >It looks like the issue happens when: > >1. The sort happens within a parallel node. > >2. One of the sort keys is an expression containing a volatile > >function call and a column from the lateral join. > > > >Here are the settings I used with your above repro case to show it > >with regular sort: > > > > enable_hashagg=off > > enable_incremental_sort=off > > enable_seqscan=off > > parallel_setup_cost=10 > > parallel_tuple_cost=0 > > > >The plan (obtained by replacing the volatile function with a stable one): > > > > Unique > > -> Nested Loop > > -> Gather Merge > > Workers Planned: 2 > > -> Sort > > Sort Key: ref_0.i, (md5(ref_0.t)) > > -> Parallel Index Scan using ref_0_i_idx on ref_0 > > -> Function Scan on generate_series ref_1 > > > >Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error. > > > >I haven't been able to dig further than that yet, but my intuition is > >to poke around in the parallel query machinery? > > > > Nope. Bisect says this was introduced by this commit: > > ba3e76cc57 Consider Incremental Sort paths at additional places > > Looking at the diff, I suspect generate_useful_gather_paths (or maybe > get_useful_pathkeys_for_relation) fails to do something with the > pathkeys. > > Of course, that'd explain why it only happens for parallel plans, as > this builds gather nodes. I should have known I'd regret making a wild guess. I was in the middle of bisecting too, but had to set it aside for a bit and hadn't finished. I'll try to look into that commit (and I agree those functions are the likely places to look) as I get a chance here. James
Re: Disable WAL logging to speed up data loading
At Thu, 1 Oct 2020 08:14:42 +, "osumi.takami...@fujitsu.com" wrote in > Hi, Horiguchi-San and Fujii-San. > > > Thank you so much both of you. > > > the table needs to be rewriitten. One idea for that is to improve that > > > command so that it skips the table rewrite if wal_level=minimal. > > > Of course, also you can change wal_level after marking the table as > > > unlogged. > > > > tablecmd.c: > The idea is really interesting. > I didn't come up with getting rid of the whole copy of > the ALTER TABLE UNLOGGED/LOGGED commands > only when wal_level='minimal'. > > > > * There are two reasons for requiring a rewrite when changing > > > * persistence: on one hand, we need to ensure that the buffers > > > * belonging to each of the two relations are marked with or without > > > * BM_PERMANENT properly. On the other hand, since rewriting creates > > > * and assigns a new relfilenode, we automatically create or drop an > > > * init fork for the relation as appropriate. > Thanks for sharing concrete comments in the source code. > > > According to this comment, perhaps we can do that at least for > > wal_level=minimal. > When I compare the 2 ideas, > one of the benefits of this ALTER TABLE 's improvement > is that we can't avoid the downtime > while that of wal_level='none' provides an easy and faster > major version up via output file of pg_dumpall. The speedup has already been achieved with higher durability by wal_level=minimal in that case. Or maybe you should consider using pg_upgrade instead. Even inducing the time to take a backup copy of the whole cluster, running pg_upgrade would be far faster than pg_dumpall then loading. > Both ideas have good points. > However, actually to modify ALTER TABLE's copy > looks far more difficult than wal_level='none' and > beyond my current ability. > So, I'd like to go forward with the direction of wal_level='none'. > Did you have strong objections for this direction ? For fuel(?) of the discussion, I tried a very-quick PoC for in-place ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some trials of several ways, I drifted to the following way after poking several ways. 1. Flip BM_PERMANENT of active buffers 2. adding/removing init fork 3. sync files, 4. Flip pg_class.relpersistence. It always skips table copy in the SET UNLOGGED case, and only when wal_level=minimal in the SET LOGGED case. Crash recovery seems working by some brief testing by hand. Of course, I haven't performed intensive test on it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index dcaea7135f..d8a1c2a7e7 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -613,6 +613,26 @@ heapam_relation_set_new_filenode(Relation rel, smgrclose(srel); } +static void +heapam_relation_set_persistence(Relation rel, char persistence) +{ + Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT || + rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); + + Assert (rel->rd_rel->relpersistence != persistence); + + if (persistence == RELPERSISTENCE_UNLOGGED) + { + Assert(rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_MATVIEW || + rel->rd_rel->relkind == RELKIND_TOASTVALUE); + RelationCreateInitFork(rel->rd_node); + } + else + RelationDropInitFork(rel->rd_node); +} + + static void heapam_relation_nontransactional_truncate(Relation rel) { @@ -2540,6 +2560,7 @@ static const TableAmRoutine heapam_methods = { .compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples, .relation_set_new_filenode = heapam_relation_set_new_filenode, + .relation_set_persistence = heapam_relation_set_persistence, .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate, .relation_copy_data = heapam_relation_copy_data, .relation_copy_for_cluster = heapam_relation_copy_for_cluster, diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index a7c0cb1bc3..8397002613 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,14 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } } const char * @@ -55,6 +63,9 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; } return id; diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index dbbd3aa31f..7f695e6d8b 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -60,6 +60,8 @@ int
RE: New statistics for tuning WAL buffer size
From: Fujii Masao > I think that we can improve that, for example, by storing backend id > into WalSndCtl and making pg_stat_get_wal_senders() directly > get the walsender's LocalPgBackendStatus with the backend id, > rather than joining pg_stat_get_activity() and pg_stat_get_wal_senders(). Yeah, I had something like that in mind. I think I'll take note of this as my private homework. (Of course, anyone can do it.) Regards Takayuki Tsunakawa
RE: Should walsernder check correctness of WAL records?
From: Michael Paquier > CRC calculation would unlikely be the bottleneck here, no? I would assume > that the extra lseek() calls needed to look after the record data to be more > harmful. Maybe, although I'm not sure lseek() is necessary. I simply thought walsender was designed to just read and send WAL without caring about other things for maximal speed. > Yep. However, I would worry much more about the case of cold archives. In > my experience, there are higher risks to get a WAL segment corrupted because > it was on disk and that this disk got corrupted. Transmission is a one-time > short operation. Cold archives could stay on disk for weeks before getting > reused in WAL replay. Yes, I think cold archives should be checked regularly. pg_verifybackup and pg_waldump can be used for it, can't they? Regards Takayuki Tsunakawa
Re: Add information to rm_redo_error_callback()
On Thu, Oct 01, 2020 at 11:18:30AM +0200, Drouvot, Bertrand wrote: > Had a look at it and did a few tests: looks all good to me. > > No objections at all, thanks! Thanks for double-checking. Applied, then. -- Michael signature.asc Description: PGP signature
Re: New statistics for tuning WAL buffer size
On 2020/10/01 10:50, tsunakawa.ta...@fujitsu.com wrote: From: Kyotaro Horiguchi Another reason that I mildly want to object to subdivided functions is I was annoyed that a stats view makes many individual calls to functions that internally share the same statistics entry. That behavior required me to provide an entry-caching feature to my shared-memory statistics patch. +1 The views for troubleshooting performance problems should be as light as possible. IIRC, we saw frequently searching pg_stat_replication consume unexpectedly high CPU power, because it calls pg_stat_get_activity(null) to get all sessions and join them with the walsenders. At that time, we had hundreds of client sessions. We expected pg_stat_replication to be very lightweight because it provides information about a few walsenders. I think that we can improve that, for example, by storing backend id into WalSndCtl and making pg_stat_get_wal_senders() directly get the walsender's LocalPgBackendStatus with the backend id, rather than joining pg_stat_get_activity() and pg_stat_get_wal_senders(). Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Should walsernder check correctness of WAL records?
On Fri, Oct 02, 2020 at 12:16:25AM +, tsunakawa.ta...@fujitsu.com wrote: > IIUC, walsender tries hard to send WAL as fast as possible to reduce > replication lag and transaction response time, so it doesn't try to > peek each WAL record. I think it's good. CRC calculation would unlikely be the bottleneck here, no? I would assume that the extra lseek() calls needed to look after the record data to be more harmful. > In any case, the WAL can get corrupt during transmission, and > writing and reading on the standby. So, the standby needs to check > the WAL record CRC. Yep. However, I would worry much more about the case of cold archives. In my experience, there are higher risks to get a WAL segment corrupted because it was on disk and that this disk got corrupted. Transmission is a one-time short operation. Cold archives could stay on disk for weeks before getting reused in WAL replay. -- Michael signature.asc Description: PGP signature
RE: Should walsernder check correctness of WAL records?
From: Konstantin Knizhnik > Investigating one of customer's support cases I found out that walsender > is not calculating WAL records CRC and send them to replicas without any > checks. > As a result damaged WAL record causes errors on all replicas: IIUC, walsender tries hard to send WAL as fast as possible to reduce replication lag and transaction response time, so it doesn't try to peek each WAL record. I think it's good. In any case, the WAL can get corrupt during transmission, and writing and reading on the standby. So, the standby needs to check the WAL record CRC. Regards Takayuki Tsunakawa
Re: Asynchronous Append on postgres_fdw nodes.
On Tue, Sep 29, 2020 at 4:45 AM Etsuro Fujita wrote: > BTW: I noticed that you changed the ExecProcNode() API so that an > Append calling FDWs can know wether they return tuples immediately or > not: > That is, 1) in postgresIterateForeignScan() postgres_fdw sets the new > PlanState’s flag asyncstate to AS_AVAILABLE/AS_WAITING depending on > whether it returns a tuple immediately or not, and then 2) the Append > knows that from the new flag when the callback routine returns. I’m > not sure this is a good idea, because it seems likely that the > ExecProcNode() change would affect many other places in the executor, > making maintenance and/or future development difficult. I think the > FDW callback routines proposed in the original patch by Robert would > provide a cleaner way to do asynchronous execution of FDWs without > changing the ExecProcNode() API, IIUC: > > +On the other hand, nodes that wish to produce tuples asynchronously > +generally need to implement three methods: > + > +1. When an asynchronous request is made, the node's ExecAsyncRequest callback > +will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the > +number of file descriptor events for which it wishes to wait and whether it > +wishes to receive a callback when the process latch is set. Alternatively, > +it can instead use ExecAsyncRequestDone if a result is available immediately. > + > +2. When the event loop wishes to wait or poll for file descriptor events and > +the process latch, the ExecAsyncConfigureWait callback is invoked to > configure > +the file descriptor wait events for which the node wishes to wait. This > +callback isn't needed if the node only cares about the process latch. > + > +3. When file descriptors or the process latch become ready, the node's > +ExecAsyncNotify callback is invoked. > > What is the reason for not doing like this in your patch? I think we should avoid changing the ExecProcNode() API. Thomas’ patch also provides a clean FDW API that doesn’t change the ExecProcNode() API, but I think the FDW API provided in Robert’ patch would be better designed, because I think it would support more different types of asynchronous interaction between the core and FDWs. Consider this bit from Thomas’ patch, which produces a tuple when a file descriptor becomes ready: + if (event.events & WL_SOCKET_READABLE) + { + /* Linear search for the node that told us to wait for this fd. */ + for (i = 0; i < node->nasyncplans; ++i) + { + if (event.fd == node->asyncfds[i]) + { + TupleTableSlot *result; + + /* + --> * We assume that because the fd is ready, it can produce + --> * a tuple now, which is not perfect. An improvement + --> * would be if it could say 'not yet, I'm still not + --> * ready', so eg postgres_fdw could PQconsumeInput and + --> * then say 'I need more input'. +*/ + result = ExecProcNode(node->asyncplans[i]); + if (!TupIsNull(result)) + { + /* +* Remember this plan so that append_next_async will +* keep trying this subplan first until it stops +* feeding us buffered tuples. +*/ + node->lastreadyplan = i; + /* We can stop waiting for this fd. */ + node->asyncfds[i] = 0; + return result; + } + else + { + /* +* This subplan has reached EOF. We'll go back and +* wait for another one. +*/ + forget_async_subplan(node, i); + break; + } + } + } + } As commented above, his patch doesn’t allow an FDW to do another data fetch from the remote side before returning a tuple when the file descriptor becomes available, but Robert’s patch would, using his FDW API ForeignAsyncNotify(), which is called when the file descriptor becomes available, IIUC. I might be missing something, but I feel inclined to vote for Robert’s patch (more precisely, Robert’s patch as a base patch with (1) some planner/executor changes from Horiguchi-san’s patch and (2) postgres_fdw changes from Thomas’ patch adjusted to match Robert’s FDW API). Best regards, Etsuro Fujita
Re: NOTIFY docs fixup - emit and deliver consistency
On Tue, Sep 29, 2020 at 8:38 PM David G. Johnston < david.g.johns...@gmail.com> wrote: > On Tue, Sep 29, 2020 at 7:58 PM Tom Lane wrote: > >> Maybe we could use terminology along the lines of "added to the >> queue" and "removed from the queue"? >> > > Quickly looking it over with this in mind there are a few spots that can > be cleaned up and linked together by explicitly talking about a FIFO queue > as the mechanism instead of the less precise send/process/deliver. A bit > more invasive but I think it will be done more clearly with this approach. > As attached. v2-doc-notify-queue-focus.patch Description: Binary data
Re: Rejecting redundant options in Create Collation
On Thu, Oct 01, 2020 at 02:58:53PM -0400, Tom Lane wrote: > Hmm ... I think that that is pretty standard behavior for a lot of > our utility commands. Trying something at random, The behavior handling is a bit inconsistent. For example EXPLAIN and VACUUM don't do that, because their parenthesized grammar got introduced after the flavor that handles options as separate items in the query, so redundant options was not something possible with only the original grammar. > I can see the argument for complaining about repeated options rather > than letting this pass. But shouldn't we try to make it uniform > instead of letting different commands have different policies? Consistency would be a good thing. There is a very recent discussion in this area for COPY actually, where the option redundancy is handled incorrectly: https://www.postgresql.org/message-id/20200929072433.ga15...@paquier.xyz The argument goes in both ways. If we handle redundant options, I would also suspect that users complain back about the lack of error checks with stupid input values, but I'd like to think that it can also be beneficial in some cases. Now, if we don't handle redundant options, that's less code, less risk to introduce semi-broken behaviors like what COPY has done for years and a bit less work for new translators, but the last point is not really an issue as we have a generic error message here. From the point of view of a code maintainer, not handling redundant options at all is more appealing IMO in the long-term. From the user perspective, there are benefits and disadvantages with both. So, at the end, it seems to me that not doing it anymore for any commands would have more benefits in the long-term than doing it. -- Michael signature.asc Description: PGP signature
Re: Improving connection scalability: GetSnapshotData()
On 10/1/20 4:22 PM, Andres Freund wrote: > Hi, > > On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote: >> My strong suspicion is that we're getting unwanted CRs. Note the >> presence of numerous instances of this in PostgresNode.pm: > >> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; >> >> So you probably want something along those lines at the top of the loop >> in send_query_and_wait: >> >> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; > Yikes, that's ugly :(. > > > I assume it's not, as the comments says > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > output > # if we're using native Perl, but not if we're using MSys Perl. So do > it > # by hand in the latter case, here and elsewhere. > that IPC::Run converts things, but that native windows perl uses > https://perldoc.perl.org/perlrun#PERLIO > a PERLIO that includes :crlf, whereas msys probably doesn't? > > Any chance you could run something like > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > on both native and msys perl? > > >> possibly also for stderr, just to make it more futureproof, and at the >> top of the file: >> >> use Config; >> >> >> Do you want me to test that first? > That'd be awesome. > > > The change I suggested makes jacana happy. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: buildfarm animal shoveler failing with "Illegal instruction"
On Thu, Oct 01, 2020 at 12:12:44PM -0700, Andres Freund wrote: > Hi Mark, > > shoveler has been failing for a while with an odd error. E.g. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler=2020-09-18%2014%3A01%3A48 > > Illegal instruction > pg_dumpall: error: pg_dump failed on database "template1", exiting > waiting for server to shut down done > > None of the changes in that time frame look like they're likely causing > illegal instructions to be emitted that weren't before. So I am > wondering if anything changed on that machine around 2020-09-18 > 14:01:48 ? It looks like the last package update was 2020-06-10 06:59:26, according to the apt logs. I'm getting Tom set up with access too, in case he has time before me to get a stack trace to see what's happening... Regards, Mark
Re: WIP: BRIN multi-range indexes
On 2020-Oct-01, Tomas Vondra wrote: > OK, so this seems like a data corruption bug in BRIN, actually. Oh crap. You're right -- the data needs to be detoasted before being put in the index. I'll have a look at how this can be fixed.
Re: enable_incremental_sort changes query behavior
On Thu, Oct 01, 2020 at 09:02:57AM -0400, James Coleman wrote: On Thu, Oct 1, 2020 at 3:09 AM Jaime Casanova wrote: On Wed, 30 Sep 2020 at 21:21, James Coleman wrote: > > On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova > wrote: > > > > Hi, > > > > With sqlsmith I found a query that gives this error: > > ERROR: ORDER/GROUP BY expression not found in targetlist > > [...] > > > > But if I set enable_incremental_sort to off the query gets executed > > without problems (attached the explain produced for that case) > > Thanks for the report. > Hi, by experiment I reduced the query to this --- 0 --- select distinct subq_0.c1 as c0, case when (true = pg_catalog.pg_rotate_logfile_old()) then ref_0.t else ref_0.t end as c4 from public.ref_0, lateral (select ref_0.i as c1 from generate_series(1, 100) as ref_1) as subq_0 --- 0 --- the only custom table already needed can be created with this commands: --- 0 --- create table ref_0 as select repeat('abcde', (random() * 10)::integer) t, random() * 1000 i from generate_series(1, 50); create index on ref_0 (i); analyze ref_0 ; --- 0 --- > Is there by an chance an index on ref_0.radi_text_temp? > there is an index involved but not on that field, commands above create the index in the right column... after that, ANALYZE the table > And if you set enable_hashagg = off what plan do you get (or error)? > same error I was able to reproduce the error without incremental sort enabled (i.e., it happens with a full sort also). The function call in the SELECT doesn't have to be in a case expression; for example I was able to reproduce changing that to `random()::text || ref_0.t`. It looks like the issue happens when: 1. The sort happens within a parallel node. 2. One of the sort keys is an expression containing a volatile function call and a column from the lateral join. Here are the settings I used with your above repro case to show it with regular sort: enable_hashagg=off enable_incremental_sort=off enable_seqscan=off parallel_setup_cost=10 parallel_tuple_cost=0 The plan (obtained by replacing the volatile function with a stable one): Unique -> Nested Loop -> Gather Merge Workers Planned: 2 -> Sort Sort Key: ref_0.i, (md5(ref_0.t)) -> Parallel Index Scan using ref_0_i_idx on ref_0 -> Function Scan on generate_series ref_1 Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error. I haven't been able to dig further than that yet, but my intuition is to poke around in the parallel query machinery? Nope. Bisect says this was introduced by this commit: ba3e76cc57 Consider Incremental Sort paths at additional places Looking at the diff, I suspect generate_useful_gather_paths (or maybe get_useful_pathkeys_for_relation) fails to do something with the pathkeys. Of course, that'd explain why it only happens for parallel plans, as this builds gather nodes. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-10-01 16:44:03 -0400, Andrew Dunstan wrote: > > I assume it's not, as the comments says > > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > > output > > # if we're using native Perl, but not if we're using MSys Perl. So do > > it > > # by hand in the latter case, here and elsewhere. > > that IPC::Run converts things, but that native windows perl uses > > https://perldoc.perl.org/perlrun#PERLIO > > a PERLIO that includes :crlf, whereas msys probably doesn't? > > > > Any chance you could run something like > > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > > on both native and msys perl? > > sys (jacana): stdio > > native: unixcrlf Interesting. That suggest we could get around needing the if msys branches in several places by setting PERLIO to unixcrlf somewhere centrally when using msys. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 10/1/20 4:22 PM, Andres Freund wrote: > Hi, > > On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote: >> My strong suspicion is that we're getting unwanted CRs. Note the >> presence of numerous instances of this in PostgresNode.pm: > >> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; >> >> So you probably want something along those lines at the top of the loop >> in send_query_and_wait: >> >> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; > Yikes, that's ugly :(. > > > I assume it's not, as the comments says > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > output > # if we're using native Perl, but not if we're using MSys Perl. So do > it > # by hand in the latter case, here and elsewhere. > that IPC::Run converts things, but that native windows perl uses > https://perldoc.perl.org/perlrun#PERLIO > a PERLIO that includes :crlf, whereas msys probably doesn't? > > Any chance you could run something like > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > on both native and msys perl? > > sys (jacana): stdio native: unixcrlf cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote: > My strong suspicion is that we're getting unwanted CRs. Note the > presence of numerous instances of this in PostgresNode.pm: > $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; > > So you probably want something along those lines at the top of the loop > in send_query_and_wait: > > $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; Yikes, that's ugly :(. I assume it's not, as the comments says # Note: on Windows, IPC::Run seems to convert \r\n to \n in program output # if we're using native Perl, but not if we're using MSys Perl. So do it # by hand in the latter case, here and elsewhere. that IPC::Run converts things, but that native windows perl uses https://perldoc.perl.org/perlrun#PERLIO a PERLIO that includes :crlf, whereas msys probably doesn't? Any chance you could run something like perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' on both native and msys perl? > possibly also for stderr, just to make it more futureproof, and at the > top of the file: > > use Config; > > > Do you want me to test that first? That'd be awesome. > The difference between the canonical way perl states the regex is due to > perl version differences. It shouldn't matter. Thanks! Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 10/1/20 2:26 PM, Andres Freund wrote: > Hi Ian, Andrew, All, > > On 2020-09-30 15:43:17 -0700, Andres Freund wrote: >> Attached is an updated version of the test (better utility function, >> stricter regexes, bailing out instead of failing just the current when >> psql times out). I'm leaving it in this test for now, but it's fairly >> easy to use this way, in my opinion, so it may be worth moving to >> PostgresNode at some point. > I pushed this yesterday. Ian, thanks again for finding this and helping > with fixing & testing. > > > Unfortunately currently some buildfarm animals don't like the test for > reasons I don't quite understand. Looks like it's all windows + msys > animals that run the tap tests. On jacana and fairywren the new test > fails, but with a somewhat confusing problem: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34 > Bail out! aborting wait: program timed out > # stream contents: >>data > # (0 rows) > # << > # pattern searched for: (?m-xis:^\\(0 rows\\)$) > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13 > Bail out! aborting wait: program timed out > stream contents: >>data > (0 rows) > << > pattern searched for: (?^m:^\\(0 rows\\)$) > > I don't know with the -xis indicates on jacana, and why it's not present > on fairywren. Nor do I know why the pattern doesn't match in the first > place, sure looks like it should? > > Andrew, do you have an insight into how mingw's regex match differs > from native windows and proper unixoid systems? I guess it's somewhere > around line endings or such? > > Jacana successfully deals with 013_crash_restart.pl, which does use the > same mechanis as the new 021_row_visibility.pl - I think the only real > difference is that I used ^ and $ in the regexes in the latter... My strong suspicion is that we're getting unwanted CRs. Note the presence of numerous instances of this in PostgresNode.pm: $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; So you probably want something along those lines at the top of the loop in send_query_and_wait: $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; possibly also for stderr, just to make it more futureproof, and at the top of the file: use Config; Do you want me to test that first? The difference between the canonical way perl states the regex is due to perl version differences. It shouldn't matter. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
buildfarm animal shoveler failing with "Illegal instruction"
Hi Mark, shoveler has been failing for a while with an odd error. E.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler=2020-09-18%2014%3A01%3A48 Illegal instruction pg_dumpall: error: pg_dump failed on database "template1", exiting waiting for server to shut down done None of the changes in that time frame look like they're likely causing illegal instructions to be emitted that weren't before. So I am wondering if anything changed on that machine around 2020-09-18 14:01:48 ? Greetings, Andres Freund
Re: Rejecting redundant options in Create Collation
"Daniel Verite" writes: > Currently, it's not an error for CREATE COLLATION to be invoked > with options repeated several times. The last (rightmost) value is kept > and the others are lost. > ... > I suggest the attached simple patch to raise an error when any of > these options is specified multiple times. Hmm ... I think that that is pretty standard behavior for a lot of our utility commands. Trying something at random, regression=# create operator <<< (leftarg = int, leftarg = int, regression(# rightarg = int99, procedure = int4pl, rightarg = int); CREATE OPERATOR Note the lack of any comment on the flat-out-wrong first value for rightarg :-( I can see the argument for complaining about repeated options rather than letting this pass. But shouldn't we try to make it uniform instead of letting different commands have different policies? regards, tom lane
Rejecting redundant options in Create Collation
Hi, Currently, it's not an error for CREATE COLLATION to be invoked with options repeated several times. The last (rightmost) value is kept and the others are lost. For instance CREATE COLLATION x (lc_ctype='en_US.UTF8', lc_collate='en_US.UTF8', lc_ctype='C') silently ignores lc_ctype='en_US.UTF8'. But that kind of invocation isn't likely to be legit. It's more plausible that it's the result of some mistake or confusion. The same goes for the other options: CREATE COLLATION [ IF NOT EXISTS ] name ( [ LOCALE = locale, ] [ LC_COLLATE = lc_collate, ] [ LC_CTYPE = lc_ctype, ] [ PROVIDER = provider, ] [ DETERMINISTIC = boolean, ] [ VERSION = version ] ) I suggest the attached simple patch to raise an error when any of these options is specified multiple times. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index 9f6582c530..7dfd544396 100644 --- a/src/backend/commands/collationcmds.c +++ b/src/backend/commands/collationcmds.c @@ -108,6 +108,19 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e break; } + if (*defelp != NULL) + { + /* +* If the option was previously set, it means that it occurs +* several times in the list, which is not allowed. +*/ + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("conflicting or redundant options", + defel->defname), +parser_errposition(pstate, defel->location))); + break; + } *defelp = defel; }
Re: WIP: BRIN multi-range indexes
On Wed, Sep 30, 2020 at 07:57:19AM -0400, John Naylor wrote: On Mon, Sep 28, 2020 at 10:12 PM Tomas Vondra wrote: Is it actually all that different from the existing BRIN indexes? Consider this example: create table x (a text, b text, c text); create index on x using brin (a,b,c); create or replace function random_str(p_len int) returns text as $$ select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from generate_series(1,$1)) foo; $$ language sql; test=# insert into x select random_str(1000), random_str(1000), random_str(1000); ERROR: index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx" Hmm, okay. As for which comes first, insert or index creation, I'm baffled, too. I also would expect the example above would take up a bit over 6000 bytes, but not 9000. OK, so this seems like a data corruption bug in BRIN, actually. The ~9000 bytes is actually about right, because the strings are in UTF-8 so roughly 1.5B per character seems about right. And we have 6 values to store (3 columns, min/max for each), so 6 * 1500 = 9000. The real question is how come INSERT + CREATE INDEX actually manages to create an index tuple. And the answer is pretty simple - brin_form_tuple kinda ignores toasting, happily building index tuples where some values are toasted. Consider this: create table x (a text, b text, c text); insert into x select random_str(1000), random_str(1000), random_str(1000); create index on x using brin (a,b,c); delete from x; vacuum x; set enable_seqscan=off; insert into x select random_str(10), random_str(10), random_str(10); ERROR: missing chunk number 0 for toast value 16530 in pg_toast_16525 explain analyze select * from x where a = 'xxx'; ERROR: missing chunk number 0 for toast value 16530 in pg_toast_16525 select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 'x_a_b_c_idx'::regclass); ERROR: missing chunk number 0 for toast value 16547 in pg_toast_16541 Interestingly enough, running the select before the insert seems to be working - not sure why. Anyway, it behaves like this since 9.5 :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi Ian, Andrew, All, On 2020-09-30 15:43:17 -0700, Andres Freund wrote: > Attached is an updated version of the test (better utility function, > stricter regexes, bailing out instead of failing just the current when > psql times out). I'm leaving it in this test for now, but it's fairly > easy to use this way, in my opinion, so it may be worth moving to > PostgresNode at some point. I pushed this yesterday. Ian, thanks again for finding this and helping with fixing & testing. Unfortunately currently some buildfarm animals don't like the test for reasons I don't quite understand. Looks like it's all windows + msys animals that run the tap tests. On jacana and fairywren the new test fails, but with a somewhat confusing problem: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34 Bail out! aborting wait: program timed out # stream contents: >>data # (0 rows) # << # pattern searched for: (?m-xis:^\\(0 rows\\)$) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13 Bail out! aborting wait: program timed out stream contents: >>data (0 rows) << pattern searched for: (?^m:^\\(0 rows\\)$) I don't know with the -xis indicates on jacana, and why it's not present on fairywren. Nor do I know why the pattern doesn't match in the first place, sure looks like it should? Andrew, do you have an insight into how mingw's regex match differs from native windows and proper unixoid systems? I guess it's somewhere around line endings or such? Jacana successfully deals with 013_crash_restart.pl, which does use the same mechanis as the new 021_row_visibility.pl - I think the only real difference is that I used ^ and $ in the regexes in the latter... Greetings, Andres Freund
Re: small cleanup: unify scanstr() functions
On Thu, Oct 1, 2020 at 11:19 AM Tom Lane wrote: > I'm unsure this topic is worth messing with. But if we do so, I'd kind > of like to move scanstr() out of parser/scansup.c. Since it's not used > by the core scanner, only bootstrap, it seems out of place there; and > it's confusing because somebody coming across it for the first time > would reasonably assume that it does have something to do with how > we lex normal SQL strings. > > Of course, that just begs the question of where to put it instead. > But since guc-file.l lives in utils/misc/, in or beside that file > does not seem that awful. A ringing endorsement if I ever heard one. ;-) Seems fine if it remains in guc-file.l. Cosmetic considerations are - The declaration in guc.h should fit in with code that's already there. I put it next to the *Parse* functions. - There seem to be a few unused headers included into bootscanner.l. To keep "#include guc.h" from looking as strange as those, I added a comment. It might be worth removing those extra includes, but that's for another day. > We might consider renaming it to something a shade less generic, too. > I have no immediate suggestions on that score. I've named it DeescapeQuotedString to see how that looks. > In short: maybe instead of what you have here, leave GUC_scanstr() > alone except for a possible rename; make bootscanner.l use that; > and drop the now-unused scanstr(). v2 done that way. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services v2-0001-Remove-duplicate-scanstr-function.patch Description: Binary data
Improve choose_custom_plan for initial partition prune case
Given the plan example: CREATE TABLE measurement ( city_id int not null, logdate date not null, peaktempint, unitsales int ) PARTITION BY RANGE (logdate); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM ('2006-02-01') TO ('2006-03-01'); CREATE TABLE measurement_y2006m03 PARTITION OF measurement FOR VALUES FROM ('2006-03-01') TO ('2006-04-01'); prepare s as select * from measurement where logdate = $1; execute s('2006-02-01'). The generic plan will probably not be chosen because it doesn't reduce the cost which can be reduced at initial_prune while the custom plan reduces such cost at planning time. which makes the cost comparison not fair. I'm thinking if we can get an estimated cost reduction of initial_prunne for generic plan based on the partition pruned at plan time from custom plan and then reducing such costs from the generic plan. I just went through the related code but didn't write anything now. I'd like to see if this is a correct direction to go. -- Best Regards Andy Fan
Re: [PATCH] Automatic HASH and LIST partition creation
On 30.09.2020 22:58, Rahila Syed wrote: Hi Anastasia, I tested the syntax with some basic commands and it works fine, regression tests also pass. Thank you for your review. Couple of comments: 1. The syntax used omits the { IMMEDIATE | DEFERRED} keywords suggested in the earlier discussions. I think it is intuitive to include IMMEDIATE with the current implementation so that the syntax can be extended with a DEFERRED clause in future for dynamic partitions. CREATE TABLE tbl_lst (i int) PARTITION BY LIST (i) CONFIGURATION (values in (1, 2), (3, 4) DEFAULT PARTITION tbl_default); After some consideration, I decided that we don't actually need to introduce IMMEDIATE | DEFERRED keyword. For hash and list partitions it will always be immediate, as the number of partitions cannot change after we initially set it. For range partitions, on the contrary, it doesn't make much sense to make partitions immediately, because in many use-cases one bound will be open. 2. One suggestion for generation of partition names is to append a unique id to avoid conflicts. Can you please give an example of such a conflict? I agree that current naming scheme is far from perfect, but I think that 'tablename'_partnum provides unique name for each partition. 3. Probably, here you mean to write list and hash instead of range and list as per the current state. Range and list partitioning also support automatic creation of partitions with an optional CONFIGURATION clause. 4. Typo in default_part_name +VALUES IN ( partition_bound_expr [, ...] ), [( partition_bound_expr [, ...] )] [, ...] [DEFAULT PARTITION defailt_part_name] +MODULUS numeric_literal Yes, you're right. I will fix these typos in next version of the patch. Thank you, Rahila Syed -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
On Thu, Oct 1, 2020 at 8:10 PM Fujii Masao wrote: > > pg_stat_clear_snapshot() can be used to reset the entry. > Thanks. I wasn't knowing it. > > + EXIT WHEN proccnt = 0; > +END LOOP; > > Isn't it better to sleep here, to avoid th busy loop? > +1. > > So what I thought was something like > > CREATE OR REPLACE PROCEDURE wait_for_backend_termination() > LANGUAGE plpgsql > AS $$ > BEGIN > LOOP > PERFORM * FROM pg_stat_activity WHERE application_name = > 'fdw_retry_check'; > EXIT WHEN NOT FOUND; > PERFORM pg_sleep(1), pg_stat_clear_snapshot(); > END LOOP; > END; > $$; > Changed. Attaching v8 patch, please review it.. Both make check and make check-world passes on v8. I have another question not related to this patch: though we have wait_pid() function, we are not able to use it like pg_terminate_backend() in other modules, wouldn't be nice if we can make it generic under the name pg_wait_pid() and usable across all pg modules? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v8-Retry-Cached-Remote-Connections-For-postgres_fdw.patch Description: Binary data
Should walsernder check correctness of WAL records?
Hi hackers, Investigating one of customer's support cases I found out that walsender is not calculating WAL records CRC and send them to replicas without any checks. As a result damaged WAL record causes errors on all replicas: LOG: incorrect resource manager data checksum in record at 5FB9/D199F7D8 FATAL: terminating walreceiver process due to administrator command I wonder if it will be better to detect this problem earlier at master? We can try to recover damaged WAL record (it is not always possible, but...) Or at least do not advance replication slots and make it possible for DBA to restore corrupted WAL segment from archive and resume replication. And right now the only choice is to restore replicas using basebackup which may take significant amount of time (for larger database). And during this time master will not be protected from failures. Or extra overhead of computing CRC in WAL sender is assumed to be to high? Sorry, if this question was already discussed - I failed to find it in the archive. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: small cleanup: unify scanstr() functions
John Naylor writes: > In the attached, the GUC_scanstr() function body is moved to scansup.c > to replace scanstr(), but with a different order of if-statements to > make the diff smaller. Since we have control over what goes in the BKI > file, we can use single-quoted escape strings there, allowing removal > of special case code for double quotes. I'm unsure this topic is worth messing with. But if we do so, I'd kind of like to move scanstr() out of parser/scansup.c. Since it's not used by the core scanner, only bootstrap, it seems out of place there; and it's confusing because somebody coming across it for the first time would reasonably assume that it does have something to do with how we lex normal SQL strings. Of course, that just begs the question of where to put it instead. But since guc-file.l lives in utils/misc/, in or beside that file does not seem that awful. We might consider renaming it to something a shade less generic, too. I have no immediate suggestions on that score. In short: maybe instead of what you have here, leave GUC_scanstr() alone except for a possible rename; make bootscanner.l use that; and drop the now-unused scanstr(). regards, tom lane
Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
On 2020/10/01 21:14, Bharath Rupireddy wrote: On Wed, Sep 30, 2020 at 11:32 PM Fujii Masao wrote: And another way, if we don't want to use wait_pid() is to have a plpgsql stored procedure, that in a loop keeps on checking for the backed pid from pg_stat_activity, exit when pid is 0. and then proceed to issue the next foreign table query. Thoughts? +1 for this approach! We can use plpgsql or DO command. Used plpgsql procedure as we have to use the procedure 2 times. mypid = -1; while (mypid != 0) SELECT pid INTO mypid FROM pg_stat_activity WHERE backend_type = 'client backend' AND application_name = 'fdw_retry_check'; Or we can just wait for the number of processes with appname='fdw_retry_check' to be zero rather than checking the pid. Done. Attaching v7 patch with above changes. Please review it. Thanks for updating the patch! +-- committed the txn. The entry of the terminated backend from pg_stat_activity +-- would be removed only after the txn commit. pg_stat_clear_snapshot() can be used to reset the entry. + EXIT WHEN proccnt = 0; +END LOOP; Isn't it better to sleep here, to avoid th busy loop? So what I thought was something like CREATE OR REPLACE PROCEDURE wait_for_backend_termination() LANGUAGE plpgsql AS $$ BEGIN LOOP PERFORM * FROM pg_stat_activity WHERE application_name = 'fdw_retry_check'; EXIT WHEN NOT FOUND; PERFORM pg_sleep(1), pg_stat_clear_snapshot(); END LOOP; END; $$; Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, Oct 1, 2020 at 4:44 PM Amit Kapila wrote: > > On Thu, Oct 1, 2020 at 10:12 AM Dilip Kumar wrote: > > > > On Wed, Sep 30, 2020 at 3:00 PM Amit Kapila wrote: > > > > > > On Wed, Sep 30, 2020 at 12:16 PM Dilip Kumar > > > wrote: > > > > > And you might need to update the below comment as well: > > > typedef struct ReorderBufferTXN > > > { > > > .. > > > /* > > > * List of ReorderBufferChange structs, including new Snapshots and new > > > * CommandIds > > > */ > > > dlist_head changes; > > > > > You forgot to address the above comment. Fixed > Few other comments: > == > 1. > void > ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, > @@ -2813,12 +2830,27 @@ ReorderBufferAddInvalidations(ReorderBuffer > *rb, TransactionId xid, > SharedInvalidationMessage *msgs) > { > ReorderBufferTXN *txn; > + MemoryContext oldcontext; > + ReorderBufferChange *change; > > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); > > + oldcontext = MemoryContextSwitchTo(rb->context); > + > + change = ReorderBufferGetChange(rb); > + change->action = REORDER_BUFFER_CHANGE_INVALIDATION; > + change->data.inval.ninvalidations = nmsgs; > + change->data.inval.invalidations = (SharedInvalidationMessage *) > + MemoryContextAlloc(rb->context, > +sizeof(SharedInvalidationMessage) * nmsgs); > + memcpy(change->data.inval.invalidations, msgs, > +sizeof(SharedInvalidationMessage) * nmsgs); > + > + ReorderBufferQueueChange(rb, xid, lsn, change, false); > + > /* > - * We collect all the invalidations under the top transaction so that we > - * can execute them all together. > + * Additionally, collect all the invalidations under the top transaction so > + * that we can execute them all together. See comment atop this function > */ > if (txn->toptxn) > txn = txn->toptxn; > @@ -2830,8 +2862,7 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, > TransactionId xid, > { > txn->ninvalidations = nmsgs; > txn->invalidations = (SharedInvalidationMessage *) > > Here what is the need for using MemoryContextAlloc, can't we directly > use palloc? Also, isn't it better to store the invalidation in txn > before queuing it for change because doing so can lead to the > processing of this and other changes accumulated till that point > before recording the same in txn. As such, there is no problem with it > but still, I think if any error happens while processing those changes > we would be able to clear the cache w.r.t this particular > invalidation. Fixed > 2. > XXX Do we need to care about relcacheInitFileInval and > + * the other fields added to ReorderBufferChange, or just > + * about the message itself? > > I don't think we need this comment in the patch. > > 3. > - * This needs to be called for each XLOG_XACT_INVALIDATIONS message and > - * accumulates all the invalidation messages in the toplevel transaction. > - * This is required because in some cases where we skip processing the > - * transaction (see ReorderBufferForget), we need to execute all the > - * invalidations together. > + * This needs to be called for each XLOG_XACT_INVALIDATIONS message. The > + * invalidation messages will be added in the reorder buffer as a change as > + * well as all the invalidations will be accumulated under the toplevel > + * transaction. We collect them as a change so that during decoding, we can > + * execute only those invalidations which are specific to the command instead > + * of executing all the invalidations, OTOH after decoding is complete or on > + * transaction abort (see ReorderBufferForget) we need to execute all the > + * invalidations to avoid any cache pollution so it is better to keep them > + * together > > Can we rewrite the comment as below? > This needs to be called for each XLOG_XACT_INVALIDATIONS message and > accumulates all the invalidation messages in the toplevel transaction > as well as in the form of change in reorder buffer. We require to > record it in form of change so that we can execute only required > invalidations instead of executing all the invalidations on each > CommandId increment. We also need to accumulate these in top-level txn > because in some cases where we skip processing the transaction (see > ReorderBufferForget), we need to execute all the invalidations > together. Done > 4. > +void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, > XLogRecPtr lsn, > + int nmsgs, SharedInvalidationMessage *msgs); > As pointed by Keisuke-San as well, this is not required. Fixed > 5. Can you please once repeat the performance test done by Keisuke-San > to see if you have similar observations? Additionally, see if you are > also seeing the inconsistency related to the Truncate message reported > by him and if so why? > Okay, I will set up and do this test early next week. Keisuke-San, can you send me your complete test script? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Manager for commit fest 2020-09
Michael Paquier writes: > And one month later, here we are with the following results and a CF > now closed: > Committed: 48 > Moved to next CF: 147 > Withdrawn: 4 > Rejected: 3 > Returned with Feedback: 33 > Total: 235 Once again, thanks for doing all that tedious work! regards, tom lane
Re: Transactions involving multiple postgres foreign servers, take 2
On Wed, 30 Sep 2020 at 16:02, tsunakawa.ta...@fujitsu.com wrote: > > From: Masahiko Sawada > > To avoid misunderstanding, I didn't mean to disregard the performance. > > I mean especially for the transaction management feature it's > > essential to work fine even in failure cases. So I hope we have a > > safe, robust, and probably simple design for the first version that > > might be low performance yet though but have a potential for > > performance improvement and we will be able to try to improve > > performance later. > > Yes, correctness (safety?) is a basic premise. I understand that given the > time left for PG 14, we haven't yet given up a sound design that offers > practical or normally expected performance. I don't think the design has not > well thought yet to see if it's simple or complex. At least, I don't believe > doing "send commit request, perform commit on a remote server, and wait for > reply" sequence one transaction at a time in turn is what this community (and > other DBMSs) tolerate. A kid's tricycle is safe, but it's not safe to ride a > tricycle on the road. Let's not rush to commit and do our best! Okay. I'd like to resolve my concern that I repeatedly mentioned and we don't find a good solution yet. That is, how we handle errors raised by FDW transaction callbacks during committing/rolling back prepared foreign transactions. Actually, this has already been discussed before[1] and we concluded at that time that using a background worker to commit/rolling back foreign prepared transactions is the best way. Anyway, let me summarize the discussion on this issue so far. With your idea, after the local commit, the backend process directly call transaction FDW API to commit the foreign prepared transactions. However, it's likely to happen an error (i.g. ereport(ERROR)) during that due to various reasons. It could be an OOM by memory allocation, connection error whatever. In case an error happens during committing prepared foreign transactions, the user will get the error but it's too late. The local transaction and possibly other foreign prepared transaction have already been committed. You proposed the first idea to avoid such a situation that FDW implementor can write the code while trying to reduce the possibility of errors happening as much as possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive solution. They might miss, not know it, or use other functions provided by the core that could lead an error. Another idea is to use PG_TRY() and PG_CATCH(). IIUC with this idea, FDW implementor catches an error but ignores it rather than rethrowing by PG_RE_THROW() in order to return the control to the core after an error. I’m really not sure it’s a correct usage of those macros. In addition, after returning to the core, it will retry to resolve the same or other foreign transactions. That is, after ignoring an error, the core needs to continue working and possibly call transaction callbacks of other FDW implementations. Regards, [1] https://www.postgresql.org/message-id/CA%2BTgmoY%3DVkHrzXD%3Djw5DA%2BPp-ePW_6_v5n%2BTJk40s5Q9VXY-Pw%40mail.gmail.com -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why does PostgresNode.pm set such a low value of max_wal_senders?
Michael Paquier writes: > On Wed, Sep 30, 2020 at 10:38:59PM -0700, Noah Misch wrote: >> In favor of minimal values, we've had semaphore-starved buildfarm members in >> the past. Perhaps those days are over, seeing that this commit has not yet >> broken a buildfarm member in that particular way. Keeping max_wal_senders=10 >> seems fine. > Indeed, I am not spotting anything suspicious here. Yeah, so far so good. Note that PostgresNode.pm does attempt to cater for semaphore-starved machines, by cutting max_connections as much as it can. In practice the total semaphore usage of a subscription test is probably still less than that of one postmaster with default max_connections. >> No, PostgreSQL commit 54c2ecb changed that. I recommend an explicit >> max_wal_senders=10 in PostgresNode, which makes it easy to test >> wal_level=minimal: > Ah, thanks, I have missed this piece. So we really need to have a > value set in this module after all. Agreed, I'll go put it back. On the other point, I think that we should continue to complain about max_wal_senders > 0 with wal_level = minimal. If we reduce that to a LOG message, which'd be the net effect of trying to be laxer, people wouldn't see it and would then wonder why they can't start replication. regards, tom lane
Re: enable_incremental_sort changes query behavior
On Thu, Oct 1, 2020 at 3:09 AM Jaime Casanova wrote: > > On Wed, 30 Sep 2020 at 21:21, James Coleman wrote: > > > > On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova > > wrote: > > > > > > Hi, > > > > > > With sqlsmith I found a query that gives this error: > > > ERROR: ORDER/GROUP BY expression not found in targetlist > > > > [...] > > > > > > But if I set enable_incremental_sort to off the query gets executed > > > without problems (attached the explain produced for that case) > > > > Thanks for the report. > > > > Hi, > > by experiment I reduced the query to this > > --- 0 --- > select distinct > subq_0.c1 as c0, > case when (true = pg_catalog.pg_rotate_logfile_old()) then > ref_0.t else ref_0.t > end > as c4 > from > public.ref_0, > lateral (select > > ref_0.i as c1 > from > generate_series(1, 100) as ref_1) as subq_0 > --- 0 --- > > the only custom table already needed can be created with this commands: > > --- 0 --- > create table ref_0 as select repeat('abcde', (random() * 10)::integer) > t, random() * 1000 i from generate_series(1, 50); > create index on ref_0 (i); > analyze ref_0 ; > --- 0 --- > > > > Is there by an chance an index on ref_0.radi_text_temp? > > > > there is an index involved but not on that field, commands above > create the index in the right column... after that, ANALYZE the table > > > And if you set enable_hashagg = off what plan do you get (or error)? > > > > same error I was able to reproduce the error without incremental sort enabled (i.e., it happens with a full sort also). The function call in the SELECT doesn't have to be in a case expression; for example I was able to reproduce changing that to `random()::text || ref_0.t`. It looks like the issue happens when: 1. The sort happens within a parallel node. 2. One of the sort keys is an expression containing a volatile function call and a column from the lateral join. Here are the settings I used with your above repro case to show it with regular sort: enable_hashagg=off enable_incremental_sort=off enable_seqscan=off parallel_setup_cost=10 parallel_tuple_cost=0 The plan (obtained by replacing the volatile function with a stable one): Unique -> Nested Loop -> Gather Merge Workers Planned: 2 -> Sort Sort Key: ref_0.i, (md5(ref_0.t)) -> Parallel Index Scan using ref_0_i_idx on ref_0 -> Function Scan on generate_series ref_1 Changing `md5(ref_0.t)` to `random()::text || ref_0.t` causes the error. I haven't been able to dig further than that yet, but my intuition is to poke around in the parallel query machinery? James
RE: [Patch] Optimize dropping of relation buffers using dlist
On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote: > From: Kyotaro Horiguchi > > I thought that the advantage of this optimization is that we don't > > need to visit all buffers? If we need to run a full-scan for any > > reason, there's no point in looking-up already-visited buffers again. > > That's just wastefull cycles. Am I missing somethig? > > > > I don't understand. If we chose to the optimized dropping, the reason > > is the number of buffer lookup is fewer than a certain threashold. Why > > do you think that the fork kind a buffer belongs to is relevant to the > > criteria? > > I rethought about this, and you certainly have a point, but... OK, I think I > understood. I should have thought in a complicated way. In other words, > you're suggesting "Let's simply treat all forks as one relation to determine > whether to optimize," right? That is, the code simple becomes: > > Sums up the number of buffers to invalidate in all forks; if (the cached sizes > of all forks are valid && # of buffers to invalidate < THRESHOLD) { > do the optimized way; > return; > } > do the traditional way; > > This will be simple, and I'm +1. This is actually close to the v18 I posted trying Horiguchi-san's approach, but that patch had bug. So attached is an updated version (v20) trying this approach again. I hope it's bug-free this time. Regards, Kirk Jamison v20-Optimize-DropRelFileNodeBuffers-during-recovery.patch Description: v20-Optimize-DropRelFileNodeBuffers-during-recovery.patch v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch Description: v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
On Wed, Sep 30, 2020 at 11:32 PM Fujii Masao wrote: > > > And another way, if we don't want to use wait_pid() is to have a > > plpgsql stored procedure, that in a loop keeps on checking for the > > backed pid from pg_stat_activity, exit when pid is 0. and then proceed > > to issue the next foreign table query. Thoughts? > > +1 for this approach! We can use plpgsql or DO command. > Used plpgsql procedure as we have to use the procedure 2 times. > > > > > mypid = -1; > > while (mypid != 0) > > SELECT pid INTO mypid FROM pg_stat_activity WHERE backend_type = > > 'client backend' AND application_name = 'fdw_retry_check'; > > Or we can just wait for the number of processes with > appname='fdw_retry_check' to be zero rather than checking the pid. > Done. Attaching v7 patch with above changes. Please review it. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com From e2a85116116e79c76a7b0fcfc1a25eff57627ed4 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 1 Oct 2020 15:16:26 +0530 Subject: [PATCH v7] Retry Cached Remote Connections For postgres_fdw Remote connections are cached for postgres_fdw in the local backend. There are high chances that the remote backend may not be available i.e. it can get killed, the subsequent foreign queries from local backend/session that use cached connection will fail as the remote backend would have become unavailable. This patch proposes a retry mechanism. Local backend/session uses cached connection and if it receives connection bad error, it deletes the cached entry and retry to get a new connection. This retry happens only at the start of remote xact. --- contrib/postgres_fdw/connection.c | 55 +--- .../postgres_fdw/expected/postgres_fdw.out| 85 +++ contrib/postgres_fdw/sql/postgres_fdw.sql | 64 ++ 3 files changed, 192 insertions(+), 12 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 08daf26fdf..441124963f 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -108,6 +108,7 @@ PGconn * GetConnection(UserMapping *user, bool will_prep_stmt) { bool found; + volatile bool retry_conn = false; ConnCacheEntry *entry; ConnCacheKey key; @@ -159,23 +160,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt) /* Reject further use of connections which failed abort cleanup. */ pgfdw_reject_incomplete_xact_state_change(entry); +retry: /* * If the connection needs to be remade due to invalidation, disconnect as - * soon as we're out of all transactions. + * soon as we're out of all transactions. Also, if previous attempt to + * start new remote transaction failed on the cached connection, disconnect + * it to retry a new connection. */ - if (entry->conn != NULL && entry->invalidated && entry->xact_depth == 0) + if ((entry->conn != NULL && entry->invalidated && + entry->xact_depth == 0) || retry_conn) { - elog(DEBUG3, "closing connection %p for option changes to take effect", - entry->conn); + if (retry_conn) + elog(DEBUG3, "closing connection %p to reestablish a new one", + entry->conn); + else + elog(DEBUG3, "closing connection %p for option changes to take effect", + entry->conn); disconnect_pg_server(entry); } - /* - * We don't check the health of cached connection here, because it would - * require some overhead. Broken connection will be detected when the - * connection is actually used. - */ - /* * If cache entry doesn't have a connection, we have to establish a new * connection. (If connect_pg_server throws an error, the cache entry @@ -206,9 +209,37 @@ GetConnection(UserMapping *user, bool will_prep_stmt) } /* - * Start a new transaction or subtransaction if needed. + * We check the health of the cached connection here, while the remote xact + * is going to begin. Broken connection will be detected and the + * connection is retried. We do this only once during each begin remote + * xact call, if the connection is still broken after the retry attempt, + * then the query will fail. */ - begin_remote_xact(entry); + PG_TRY(); + { + /* Start a new transaction or subtransaction if needed. */ + begin_remote_xact(entry); + retry_conn = false; + } + PG_CATCH(); + { + if (!entry->conn || + PQstatus(entry->conn) != CONNECTION_BAD || + entry->xact_depth > 0 || + retry_conn) + PG_RE_THROW(); + retry_conn = true; + } + PG_END_TRY(); + + if (retry_conn) + { + ereport(DEBUG3, +(errmsg("could not start remote transaction on connection %p", + entry->conn)), + errdetail_internal("%s", pchomp(PQerrorMessage(entry->conn; + goto retry; + } /* Remember if caller will prepare statements */ entry->have_prep_stmt |= will_prep_stmt; diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 10e23d02ed..09d0578d05 100644 ---
small cleanup: unify scanstr() functions
Hi, Looking at the guc file code, GUC_scanstr() is almost the same as the exported function scanstr(), except the latter requires finicky extra coding around double quotes both in its callers and while creating the input. In the attached, the GUC_scanstr() function body is moved to scansup.c to replace scanstr(), but with a different order of if-statements to make the diff smaller. Since we have control over what goes in the BKI file, we can use single-quoted escape strings there, allowing removal of special case code for double quotes. -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 93fce2a68674e1f3ec8999b15c8526535e12fffa Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 14 Sep 2020 11:29:43 -0400 Subject: [PATCH] Unify scanstr() implementations Use single-quoted escaped strings in the BKI file, the same as those used for GUC parameters. Rewrite scanstr() to match GUC_scanstr() and remove the latter. Get rid of special case coding around double quotes both in the callers of scanstr() and while formatting the input. --- doc/src/sgml/bki.sgml | 6 +- src/backend/bootstrap/bootscanner.l | 9 +-- src/backend/catalog/genbki.pl | 6 +- src/backend/parser/scansup.c| 34 +- src/backend/utils/misc/guc-file.l | 97 + src/bin/initdb/initdb.c | 34 +++--- 6 files changed, 35 insertions(+), 151 deletions(-) diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml index 4e696d1d3e..b7f0bccf35 100644 --- a/doc/src/sgml/bki.sgml +++ b/doc/src/sgml/bki.sgml @@ -752,7 +752,7 @@ $ perl rewrite_dat_with_prokind.pl pg_proc.dat next token that syntactically cannot belong to the preceding command starts a new one. (Usually you would put a new command on a new line, for clarity.) Tokens can be certain key words, special - characters (parentheses, commas, etc.), numbers, or double-quoted + characters (parentheses, commas, etc.), numbers, or single-quoted strings. Everything is case sensitive. @@ -876,7 +876,7 @@ $ perl rewrite_dat_with_prokind.pl pg_proc.dat NULL values can be specified using the special key word _null_. Values that do not look like - identifiers or digit strings must be double quoted. + identifiers or digit strings must be single-quoted. @@ -1046,7 +1046,7 @@ $ perl rewrite_dat_with_prokind.pl pg_proc.dat create test_table 420 (oid = oid, cola = int4, colb = text) open test_table -insert ( 421 1 "value1" ) +insert ( 421 1 'value 1' ) insert ( 422 2 _null_ ) close test_table diff --git a/src/backend/bootstrap/bootscanner.l b/src/backend/bootstrap/bootscanner.l index 1048e70d05..452dc067d0 100644 --- a/src/backend/bootstrap/bootscanner.l +++ b/src/backend/bootstrap/bootscanner.l @@ -66,7 +66,7 @@ static int yyline = 1; /* line number for error reporting */ id [-A-Za-z0-9_]+ -sid \"([^\"])*\" +sid \'([^']|\'\')*\' /* * Keyword tokens return the keyword text (as a constant string) in yylval.kw, @@ -120,14 +120,11 @@ NOT{ yylval.kw = "NOT"; return XNOT; } NULL { yylval.kw = "NULL"; return XNULL; } {id} { - yylval.str = scanstr(yytext); + yylval.str = pstrdup(yytext); return ID; } {sid} { - /* leading and trailing quotes are not passed to scanstr */ - yytext[strlen(yytext) - 1] = '\0'; - yylval.str = scanstr(yytext+1); - yytext[strlen(yytext)] = '"'; /* restore yytext */ + yylval.str = scanstr(yytext); return ID; } diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl index dc5f442397..ef3105af44 100644 --- a/src/backend/catalog/genbki.pl +++ b/src/backend/catalog/genbki.pl @@ -845,17 +845,15 @@ sub print_bki_insert # since that represents a NUL char in C code. $bki_value = '' if $bki_value eq '\0'; - # Handle single quotes by doubling them, and double quotes by - # converting them to octal escapes, because that's what the + # Handle single quotes by doubling them, because that's what the # bootstrap scanner requires. We do not process backslashes # specially; this allows escape-string-style backslash escapes # to be used in catalog data. $bki_value =~ s/'/''/g; - $bki_value =~ s/"/\\042/g; # Quote value if needed. We need not quote values that satisfy # the "id" pattern in bootscanner.l, currently "[-A-Za-z0-9_]+". - $bki_value = sprintf(qq'"%s"', $bki_value) + $bki_value = sprintf("'%s'", $bki_value) if length($bki_value) == 0 or $bki_value =~ /[^-A-Za-z0-9_]/; diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c index cac70d5df7..f35ee91d64 100644 --- a/src/backend/parser/scansup.c +++ b/src/backend/parser/scansup.c @@ -23,8 +23,8 @@ /* * scanstr * - * if the string passed in has escaped codes, map the escape codes to actual - * chars + * Strip the
Re: Error code missing for "wrong length of inner sequence" error
> On 1 Oct 2020, at 12:54, Heikki Linnakangas wrote: > Most checks when converting between SQL and Python types use the PLy_elog() > function, which uses the genericc ERRCODE_EXTERNAL_ROUTINE_EXCEPTION error > code, but I think ERRCODE_ARRAY_SUBSCRIPT_ERROR is better. On that note, wouldn't the dimension check errors in PLySequence_ToArray be just as well off using normal ereport()'s? Only one of them seem to error out in a way that could propagate an error from Python to postgres. > Thoughts? If not, I'm going to add errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR) to > that. +1 on using ERRCODE_ARRAY_SUBSCRIPT_ERROR as it better conveys meaning. cheers ./daniel
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, Oct 1, 2020 at 10:12 AM Dilip Kumar wrote: > > On Wed, Sep 30, 2020 at 3:00 PM Amit Kapila wrote: > > > > On Wed, Sep 30, 2020 at 12:16 PM Dilip Kumar wrote: > > > And you might need to update the below comment as well: > > typedef struct ReorderBufferTXN > > { > > .. > > /* > > * List of ReorderBufferChange structs, including new Snapshots and new > > * CommandIds > > */ > > dlist_head changes; > > You forgot to address the above comment. Few other comments: == 1. void ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, @@ -2813,12 +2830,27 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, SharedInvalidationMessage *msgs) { ReorderBufferTXN *txn; + MemoryContext oldcontext; + ReorderBufferChange *change; txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); + oldcontext = MemoryContextSwitchTo(rb->context); + + change = ReorderBufferGetChange(rb); + change->action = REORDER_BUFFER_CHANGE_INVALIDATION; + change->data.inval.ninvalidations = nmsgs; + change->data.inval.invalidations = (SharedInvalidationMessage *) + MemoryContextAlloc(rb->context, +sizeof(SharedInvalidationMessage) * nmsgs); + memcpy(change->data.inval.invalidations, msgs, +sizeof(SharedInvalidationMessage) * nmsgs); + + ReorderBufferQueueChange(rb, xid, lsn, change, false); + /* - * We collect all the invalidations under the top transaction so that we - * can execute them all together. + * Additionally, collect all the invalidations under the top transaction so + * that we can execute them all together. See comment atop this function */ if (txn->toptxn) txn = txn->toptxn; @@ -2830,8 +2862,7 @@ ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid, { txn->ninvalidations = nmsgs; txn->invalidations = (SharedInvalidationMessage *) Here what is the need for using MemoryContextAlloc, can't we directly use palloc? Also, isn't it better to store the invalidation in txn before queuing it for change because doing so can lead to the processing of this and other changes accumulated till that point before recording the same in txn. As such, there is no problem with it but still, I think if any error happens while processing those changes we would be able to clear the cache w.r.t this particular invalidation. 2. XXX Do we need to care about relcacheInitFileInval and + * the other fields added to ReorderBufferChange, or just + * about the message itself? I don't think we need this comment in the patch. 3. - * This needs to be called for each XLOG_XACT_INVALIDATIONS message and - * accumulates all the invalidation messages in the toplevel transaction. - * This is required because in some cases where we skip processing the - * transaction (see ReorderBufferForget), we need to execute all the - * invalidations together. + * This needs to be called for each XLOG_XACT_INVALIDATIONS message. The + * invalidation messages will be added in the reorder buffer as a change as + * well as all the invalidations will be accumulated under the toplevel + * transaction. We collect them as a change so that during decoding, we can + * execute only those invalidations which are specific to the command instead + * of executing all the invalidations, OTOH after decoding is complete or on + * transaction abort (see ReorderBufferForget) we need to execute all the + * invalidations to avoid any cache pollution so it is better to keep them + * together Can we rewrite the comment as below? This needs to be called for each XLOG_XACT_INVALIDATIONS message and accumulates all the invalidation messages in the toplevel transaction as well as in the form of change in reorder buffer. We require to record it in form of change so that we can execute only required invalidations instead of executing all the invalidations on each CommandId increment. We also need to accumulate these in top-level txn because in some cases where we skip processing the transaction (see ReorderBufferForget), we need to execute all the invalidations together. 4. +void ReorderBufferAddInvalidation(ReorderBuffer *, TransactionId, XLogRecPtr lsn, + int nmsgs, SharedInvalidationMessage *msgs); As pointed by Keisuke-San as well, this is not required. 5. Can you please once repeat the performance test done by Keisuke-San to see if you have similar observations? Additionally, see if you are also seeing the inconsistency related to the Truncate message reported by him and if so why? -- With Regards, Amit Kapila.
Error code missing for "wrong length of inner sequence" error
In PLySequence_ToArray_recurse(), there's this check: if (PySequence_Length(list) != dims[dim]) ereport(ERROR, (errmsg("wrong length of inner sequence: has length %d, but %d was expected", (int) PySequence_Length(list), dims[dim]), (errdetail("To construct a multidimensional array, the inner sequences must all have the same length."; It's missing an error code, making it implicitly ERRCODE_INTERNAL_ERROR, which is surely not right. That's simple to fix, but what error code should we use? I'm inclined to use ERRCODE_ARRAY_SUBSCRIPT_ERROR, because that's used in the similar error message with SQL ARRAY expression, like "ARRAY[[1], [2, 3]]". Most checks when converting between SQL and Python types use the PLy_elog() function, which uses the genericc ERRCODE_EXTERNAL_ROUTINE_EXCEPTION error code, but I think ERRCODE_ARRAY_SUBSCRIPT_ERROR is better. You get this error e.g. if you try to return a python lists of lists from a PL/python function as a multi-dimensional array, but the sublists have different lengths: create function return_array() returns int[] as $$ return [[1], [1,2]] $$ language plpythonu; postgres=# select return_array(); ERROR: wrong length of inner sequence: has length 2, but 1 was expected DETAIL: To construct a multidimensional array, the inner sequences must all have the same length. CONTEXT: while creating return value PL/Python function "return_array" Thoughts? If not, I'm going to add errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR) to that. - Heikki
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Thu, Oct 1, 2020 at 2:43 PM Keisuke Kuroda wrote: > > > > I test the patch on the master HEAD(9796f455) and it works fine. > > > * make installcheck-world: passed > > > * walsender process continues to use 100% of the CPU 1core when > > > TRUNCATE 1000 partition: 1s or less > > > ** before patch : 230s > > > > Does this result indicate that it is still CPU bound but it does the > > actual decoding and completes in 1s instead of spending 230s mainly to > > execute unnecessary invalidations? > > Okay, to check the decodeing and invalidation times, > I got the time to return the results to pg_recvlogical and perf. > > Before the patch, most of the processing was done > by hash_seq_search in the ReferenodeMapInvalidateCallback. > After the patch, this problem has been resolved. > > # test > > (1) start pg_recvlogical > (2) INSERT to 2000 partition for creating RelfilenodemapHash > (3) TRUNCATE 1000 partition > > (1) > pg_recvlogical --create-slot --start -f - --if-not-exists \ > --plugin=test_decoding --slot=test1 --dbname=postgres --username=postgres \ > --option=include-timestamp | gawk '{ print strftime("%Y-%m-%d > %H:%M:%S"), $0; fflush(); }' > > (2) > DO $$ > DECLARE > i INT; > j INT; > schema TEXT; > tablename TEXT; > BEGIN > FOR i IN 1 .. 2 LOOP > schema := 'nsp_' || to_char(i, 'FM000'); > tablename := 'tbl_' || to_char(i, 'FM000'); > EXECUTE 'INSERT INTO ' || schema || '.' || tablename || ' > SELECT i,i,i,0,0,0,0,0,0,0,0,0,''a''::bytea,''a''::bytea FROM > generate_series(0,,1) AS i'; > END LOOP; > END; > $$ LANGUAGE plpgsql; > > (3) > TRUNCATE TABLE nsp_001.tbl_001; > > # before > > (3) TRUNCATE > decode + invalidation = 222s > > 2020-10-01 16:58:29 BEGIN 529 > 2020-10-01 17:02:11 COMMIT 529 (at 2020-10-01 16:58:29.281747+09) > > (Before the patch was applied, I was concerned that > the "table nsp_001.tbl_001 ... TRUNCATE" was not outputting.) > Why is it that before the patch these were not displayed and after the patch, we started to see this? -- With Regards, Amit Kapila.
Re: Improving connection scalability: GetSnapshotData()
On Tue, 15 Sep 2020 at 07:17, Andres Freund wrote: > > Hi, > > On 2020-09-09 17:02:58 +0900, Ian Barwick wrote: > > Attached, though bear in mind I'm not very familiar with parts of this, > > particularly 2PC stuff, so consider it educated guesswork. > > Thanks for this, and the test case! > > Your commit fixes the issues, but not quite correctly. Multixacts > shouldn't matter, so we don't need to do anything there. And for the > increases, I think they should be inside the already existing > ProcArrayLock acquisition, as in the attached. > > > I've also included a quite heavily revised version of your test. Instead > of using dblink I went for having a long-running psql that I feed over > stdin. The main reason for not liking the previous version is that it > seems fragile, with the sleep and everything. I expanded it to cover > 2PC is as well. > > The test probably needs a bit of cleanup, wrapping some of the > redundancy around the pump_until calls. > > I think the approach of having a long running psql session is really > useful, and probably would speed up some tests. Does anybody have a good > idea for how to best, and without undue effort, to integrate this into > PostgresNode.pm? I don't really have a great idea, so I think I'd leave > it with a local helper in the new test? 2ndQ has some infra for that and various other TAP enhancements that I'd like to try to upstream. I'll ask what I can share and how. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
> > I test the patch on the master HEAD(9796f455) and it works fine. > > * make installcheck-world: passed > > * walsender process continues to use 100% of the CPU 1core when > > TRUNCATE 1000 partition: 1s or less > > ** before patch : 230s > > Does this result indicate that it is still CPU bound but it does the > actual decoding and completes in 1s instead of spending 230s mainly to > execute unnecessary invalidations? Okay, to check the decodeing and invalidation times, I got the time to return the results to pg_recvlogical and perf. Before the patch, most of the processing was done by hash_seq_search in the ReferenodeMapInvalidateCallback. After the patch, this problem has been resolved. # test (1) start pg_recvlogical (2) INSERT to 2000 partition for creating RelfilenodemapHash (3) TRUNCATE 1000 partition (1) pg_recvlogical --create-slot --start -f - --if-not-exists \ --plugin=test_decoding --slot=test1 --dbname=postgres --username=postgres \ --option=include-timestamp | gawk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }' (2) DO $$ DECLARE i INT; j INT; schema TEXT; tablename TEXT; BEGIN FOR i IN 1 .. 2 LOOP schema := 'nsp_' || to_char(i, 'FM000'); tablename := 'tbl_' || to_char(i, 'FM000'); EXECUTE 'INSERT INTO ' || schema || '.' || tablename || ' SELECT i,i,i,0,0,0,0,0,0,0,0,0,''a''::bytea,''a''::bytea FROM generate_series(0,,1) AS i'; END LOOP; END; $$ LANGUAGE plpgsql; (3) TRUNCATE TABLE nsp_001.tbl_001; # before (3) TRUNCATE decode + invalidation = 222s 2020-10-01 16:58:29 BEGIN 529 2020-10-01 17:02:11 COMMIT 529 (at 2020-10-01 16:58:29.281747+09) (Before the patch was applied, I was concerned that the "table nsp_001.tbl_001 ... TRUNCATE" was not outputting.) Samples: 228K of event 'cpu-clock:uhH', Event count (approx.): 5708075 Overhead Command Shared Object Symbol - 96.00% postgres postgres[.] hash_seq_search hash_seq_search - RelfilenodeMapInvalidateCallback - LocalExecuteInvalidationMessage ReorderBufferExecuteInvalidations ReorderBufferProcessTXN ReorderBufferCommit DecodeCommit DecodeXactOp LogicalDecodingProcessRecord XLogSendLogical WalSndLoop StartLogicalReplication exec_replication_command PostgresMain BackendRun BackendStartup ServerLoop PostmasterMain main __libc_start_main _start +0.77% postgres postgres[.] LocalExecuteInvalidationMessage 0.47% postgres postgres[.] hash_bytes_uint32 0.42% postgres postgres[.] hash_search_with_hash_value 0.37% postgres postgres[.] RelfilenodeMapInvalidateCallback 0.32% postgres postgres[.] CatCacheInvalidate 0.23% postgres postgres[.] PlanCacheRelCallback 0.23% postgres postgres[.] ReorderBufferExecuteInvalidations # after (3) TRUNCATE decode + invalidation = 1s or less 2020-10-01 17:09:43 BEGIN 537 2020-10-01 17:09:43 table nsp_001.tbl_001, nsp_001.part_0001, nsp_001.part_0002 ... nsp_001.part_0999, nsp_001.part_1000: TRUNCATE: (no-flags) 2020-10-01 17:09:43 COMMIT 537 (at 2020-10-01 17:09:43.192989+09) Samples: 337 of event 'cpu-clock:uhH', Event count (approx.): 8425 Overhead Command Shared Object Symbol - 47.48% postgres postgres[.] hash_seq_search hash_seq_search RelfilenodeMapInvalidateCallback LocalExecuteInvalidationMessage ReorderBufferExecuteInvalidations ReorderBufferProcessTXN ReorderBufferCommit DecodeCommit DecodeXactOp LogicalDecodingProcessRecord XLogSendLogical WalSndLoop StartLogicalReplication exec_replication_command PostgresMain BackendRun BackendStartup ServerLoop PostmasterMain main __libc_start_main _start +3.26% postgres postgres[.] heap_hot_search_buffer +2.67% postgres postgres[.] pg_comp_crc32c_sse42 +2.08% postgres libc-2.17.so[.] __memcpy_ssse3_back +2.08% postgres postgres[.] DecodeXLogRecord +2.08% postgres postgres[.] hash_search_with_hash_value +1.48% postgres libpthread-2.17.so [.] __libc_recv +1.19% postgres libc-2.17.so[.] __memset_sse2 +1.19% postgres libpthread-2.17.so [.] __errno_location +1.19% postgres postgres[.] LocalExecuteInvalidationMessage +1.19% postgres postgres[.] ReadPageInternal +1.19% postgres postgres[.] XLogReadRecord +1.19% postgres postgres[.] socket_is_send_pending hash_seq_search Best Regards, -- Keisuke Kuroda NTT Software Innovation Center keisuke.kuroda.3...@gmail.com
RE: Disable WAL logging to speed up data loading
Hi, Horiguchi-San and Fujii-San. Thank you so much both of you. > > the table needs to be rewriitten. One idea for that is to improve that > > command so that it skips the table rewrite if wal_level=minimal. > > Of course, also you can change wal_level after marking the table as > > unlogged. > > tablecmd.c: The idea is really interesting. I didn't come up with getting rid of the whole copy of the ALTER TABLE UNLOGGED/LOGGED commands only when wal_level='minimal'. > > * There are two reasons for requiring a rewrite when changing > > * persistence: on one hand, we need to ensure that the buffers > > * belonging to each of the two relations are marked with or without > > * BM_PERMANENT properly. On the other hand, since rewriting creates > > * and assigns a new relfilenode, we automatically create or drop an > > * init fork for the relation as appropriate. Thanks for sharing concrete comments in the source code. > According to this comment, perhaps we can do that at least for > wal_level=minimal. When I compare the 2 ideas, one of the benefits of this ALTER TABLE 's improvement is that we can't avoid the downtime while that of wal_level='none' provides an easy and faster major version up via output file of pg_dumpall. Both ideas have good points. However, actually to modify ALTER TABLE's copy looks far more difficult than wal_level='none' and beyond my current ability. So, I'd like to go forward with the direction of wal_level='none'. Did you have strong objections for this direction ? Regards, Takamichi Osumi
Re: Why does PostgresNode.pm set such a low value of max_wal_senders?
On Wed, Sep 30, 2020 at 10:38:59PM -0700, Noah Misch wrote: > In favor of minimal values, we've had semaphore-starved buildfarm members in > the past. Perhaps those days are over, seeing that this commit has not yet > broken a buildfarm member in that particular way. Keeping max_wal_senders=10 > seems fine. Indeed, I am not spotting anything suspicious here. > No, PostgreSQL commit 54c2ecb changed that. I recommend an explicit > max_wal_senders=10 in PostgresNode, which makes it easy to test > wal_level=minimal: > > printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' > 'max_wal_senders = 0' >/tmp/minimal.conf > make check-world TEMP_CONFIG=/tmp/minimal.conf > > thorntail is doing the equivalent, hence the failures. Ah, thanks, I have missed this piece. So we really need to have a value set in this module after all. -- Michael signature.asc Description: PGP signature
Re: PATCH: Batch/pipelining support for libpq
This patch fixes compilation on windows and compilation of the documentation. Matthieu Garrigues On Thu, Oct 1, 2020 at 8:41 AM Matthieu Garrigues wrote: > > On Thu, Oct 1, 2020 at 6:35 AM Michael Paquier wrote: > > > The documentation is failing to build, and the patch does not build > > correctly on Windows. Could you address that? > > -- > > Michael > > Yes I'm on it. > > -- > Matthieu diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 92556c7ce0..932561d0c5 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -4829,6 +4829,465 @@ int PQflush(PGconn *conn); + + Batch mode and query pipelining + + + libpq + batch mode + + + + libpq + pipelining + + + + libpq supports queueing up queries into + a pipeline to be executed as a batch on the server. Batching queries allows + applications to avoid a client/server round-trip after each query to get + the results before issuing the next query. + + + + When to use batching + + +Much like asynchronous query mode, there is no performance disadvantage to +using batching and pipelining. It increases client application complexity +and extra caution is required to prevent client/server deadlocks but +can sometimes offer considerable performance improvements. + + + +Batching is most useful when the server is distant, i.e. network latency +(ping time) is high, and when many small operations are being performed in +rapid sequence. There is usually less benefit in using batches when each +query takes many multiples of the client/server round-trip time to execute. +A 100-statement operation run on a server 300ms round-trip-time away would take +30 seconds in network latency alone without batching; with batching it may spend +as little as 0.3s waiting for results from the server. + + + +Use batches when your application does lots of small +INSERT, UPDATE and +DELETE operations that can't easily be transformed into +operations on sets or into a +COPY operation. + + + +Batching is not useful when information from one operation is required by the +client before it knows enough to send the next operation. The client must +introduce a synchronisation point and wait for a full client/server +round-trip to get the results it needs. However, it's often possible to +adjust the client design to exchange the required information server-side. +Read-modify-write cycles are especially good candidates; for example: + + BEGIN; + SELECT x FROM mytable WHERE id = 42 FOR UPDATE; + -- result: x=2 + -- client adds 1 to x: + UPDATE mytable SET x = 3 WHERE id = 42; + COMMIT; + +could be much more efficiently done with: + + UPDATE mytable SET x = x + 1 WHERE id = 42; + + + + + + The batch API was introduced in PostgreSQL 14.0, but clients using PostgresSQL 14.0 version of libpq can + use batches on server versions 7.4 and newer. Batching works on any server + that supports the v3 extended query protocol. + + + + + + + Using batch mode + + +To issue batches the application must switch +a connection into batch mode. Enter batch mode with PQenterBatchMode(conn) or test +whether batch mode is active with PQbatchStatus(conn). In batch mode only asynchronous operations are permitted, and +COPY is not recommended as it most likely will trigger failure in batch processing. +Using any synchronous command execution functions such as PQfn, +PQexec or one of its sibling functions are error conditions. +Functions allowed in batch mode are described in . + + + +The client uses libpq's asynchronous query functions to dispatch work, +marking the end of each batch with PQbatchSendQueue. +And to get results, it uses PQgetResult. It may eventually exit +batch mode with PQexitBatchMode once all results are +processed. + + + + + It is best to use batch mode with libpq in + non-blocking mode. If used in + blocking mode it is possible for a client/server deadlock to occur. The + client will block trying to send queries to the server, but the server will + block trying to send results from queries it has already processed to the + client. This only occurs when the client sends enough queries to fill its + output buffer and the server's receive buffer before switching to + processing input from the server, but it's hard to predict exactly when + that'll happen so it's best to always use non-blocking mode. + Batch mode consumes more memory when send/recv is not done as required even in non-blocking mode. + + + + +Issuing queries + + + After entering batch mode the application dispatches requests + using normal asynchronous libpq functions such as + PQsendQueryParams, PQsendPrepare, + PQsendQueryPrepared,
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Kyotaro Horiguchi > I thought that the advantage of this optimization is that we don't > need to visit all buffers? If we need to run a full-scan for any > reason, there's no point in looking-up already-visited buffers > again. That's just wastefull cycles. Am I missing somethig? > > I don't understand. If we chose to the optimized dropping, the reason > is the number of buffer lookup is fewer than a certain threashold. Why > do you think that the fork kind a buffer belongs to is relevant to the > criteria? I rethought about this, and you certainly have a point, but... OK, I think I understood. I should have thought in a complicated way. In other words, you're suggesting "Let's simply treat all forks as one relation to determine whether to optimize," right? That is, the code simple becomes: Sums up the number of buffers to invalidate in all forks; if (the cached sizes of all forks are valid && # of buffers to invalidate < THRESHOLD) { do the optimized way; return; } do the traditional way; This will be simple, and I'm +1. Regards Takayuki Tsunakawa
Re: Protect syscache from bloating with negative cache entries
At Thu, 1 Oct 2020 13:37:29 +0900, Michael Paquier wrote in > On Wed, Jan 22, 2020 at 02:38:19PM +0900, Kyotaro Horiguchi wrote: > > I changed my mind to attach the benchmark patch as .txt file, > > expecting the checkers not picking up it as a part of the patchset. > > > > I have in the precise performance measurement mode for a long time, > > but I think it's settled. I'd like to return to normal mode and > > explain this patch. > > Looking at the CF bot, this patch set does not compile properly. > Could you look at that? It is complaining that TimestampDifference is implicitly used. I'm not sure the exact cause but maybe some refactoring in header file inclusion caused that. This is the rebased version. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center >From 07ce4cf303a365c3f3efc3f133551a67e4646875 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 10 Jan 2020 15:02:26 +0900 Subject: [PATCH v2 1/3] base_change Adds struct members needed by catcache expiration feature and a GUC variable that controls the behavior of the feature. But no substantial code is not added yet. If existence of some variables alone can cause degradation, benchmarking after this patch shows that. --- src/backend/access/transam/xact.c | 3 +++ src/backend/utils/cache/catcache.c | 15 +++ src/backend/utils/misc/guc.c | 13 + src/include/utils/catcache.h | 20 4 files changed, 51 insertions(+) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index af6afcebb1..1040713955 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1085,6 +1085,9 @@ ForceSyncCommit(void) static void AtStart_Cache(void) { + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); + AcceptInvalidationMessages(); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 3613ae5f44..c5bad63aac 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -60,9 +60,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} /* * internal support functions diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 596bcb7b84..e39c756f80 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -88,6 +88,8 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" +#include "utils/guc_tables.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -2358,6 +2360,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered for removal."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + _cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f4aa316604..3d3870f05a 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + TimestampTz cc_oldest_ts; /* timestamp of the oldest tuple in the hash */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + unsigned int naccess; /* # of access to this
Re: Add information to rm_redo_error_callback()
On Thu, Sep 24, 2020 at 03:03:46PM +0900, Michael Paquier wrote: > Hmm. I still think that knowing at least about a FPW could be an > interesting piece of information even here. Anyway, instead of > copying a logic that exists already in xlog_outrec(), why not moving > the block information print into a separate routine out of the > WAL_DEBUG section, and just reuse the same format for the context of > the redo error callback? That would also be more consistent with what > we do in pg_waldump where we don't show the fork name of a block when > it is on a MAIN_FORKNUM. And this would avoid a third copy of the > same logic. If we add the XID, previous LSN and the record length > on the stack of what is printed, we could just reuse the existing > routine, still that's perhaps too much information displayed. Seeing nothing, I took a swing at that, and finished with the attached that refactors the logic and prints the block information as wanted. Any objections to that? -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 79a77ebbfe..e96075158a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -940,6 +940,7 @@ static bool CheckForStandbyTrigger(void); #ifdef WAL_DEBUG static void xlog_outrec(StringInfo buf, XLogReaderState *record); #endif +static void xlog_block_info(StringInfo buf, XLogReaderState *record); static void xlog_outdesc(StringInfo buf, XLogReaderState *record); static void pg_start_backup_callback(int code, Datum arg); static void pg_stop_backup_callback(int code, Datum arg); @@ -10258,6 +10259,19 @@ xlog_outrec(StringInfo buf, XLogReaderState *record) appendStringInfo(buf, "; len %u", XLogRecGetDataLen(record)); + xlog_block_info(buf, record); +} +#endif /* WAL_DEBUG */ + +/* + * Returns a string giving information about all the blocks in an + * XLogRecord. + */ +static void +xlog_block_info(StringInfo buf, XLogReaderState *record) +{ + int block_id; + /* decode block references */ for (block_id = 0; block_id <= record->max_block_id; block_id++) { @@ -10284,7 +10298,6 @@ xlog_outrec(StringInfo buf, XLogReaderState *record) appendStringInfoString(buf, " FPW"); } } -#endif /* WAL_DEBUG */ /* * Returns a string describing an XLogRecord, consisting of its identity @@ -11765,6 +11778,7 @@ rm_redo_error_callback(void *arg) initStringInfo(); xlog_outdesc(, record); + xlog_block_info(, record); /* translator: %s is a WAL record description */ errcontext("WAL redo at %X/%X for %s", signature.asc Description: PGP signature
Re: enable_incremental_sort changes query behavior
On Wed, 30 Sep 2020 at 21:21, James Coleman wrote: > > On Sat, Sep 26, 2020 at 2:49 PM Jaime Casanova > wrote: > > > > Hi, > > > > With sqlsmith I found a query that gives this error: > > ERROR: ORDER/GROUP BY expression not found in targetlist > > [...] > > > > But if I set enable_incremental_sort to off the query gets executed > > without problems (attached the explain produced for that case) > > Thanks for the report. > Hi, by experiment I reduced the query to this --- 0 --- select distinct subq_0.c1 as c0, case when (true = pg_catalog.pg_rotate_logfile_old()) then ref_0.t else ref_0.t end as c4 from public.ref_0, lateral (select ref_0.i as c1 from generate_series(1, 100) as ref_1) as subq_0 --- 0 --- the only custom table already needed can be created with this commands: --- 0 --- create table ref_0 as select repeat('abcde', (random() * 10)::integer) t, random() * 1000 i from generate_series(1, 50); create index on ref_0 (i); analyze ref_0 ; --- 0 --- > Is there by an chance an index on ref_0.radi_text_temp? > there is an index involved but not on that field, commands above create the index in the right column... after that, ANALYZE the table > And if you set enable_hashagg = off what plan do you get (or error)? > same error -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Get memory contexts of an arbitrary backend process
Hi, On Fri, Sep 25, 2020 at 4:28 PM torikoshia wrote: > Thanks for all your comments, I updated the patch. Thanks for updating the patch. I did a brief test and code review. > I added a shared hash table consisted of minimal members > mainly for managing whether the file is dumped or not. > Some members like 'loc' seem useful in the future, but I > haven't added them since it's not essential at this point. Yes, that would be good. +/* + * Since we allow only one session can request to dump memory context at + * the same time, check whether the dump files already exist. + */ +while (stat(dumpfile, _tmp) == 0 || stat(tmpfile, _tmp) == 0) +{ +pg_usleep(100L); +} If pg_get_backend_memory_contexts() is executed by two or more sessions at the same time, it cannot be run exclusively in this way. Currently it seems to cause a crash when do it so. This is easy to reproduce and can be done as follows. [session-1] BEGIN; LOCK TABKE t1; [Session-2] BEGIN; LOCK TABLE t1; <- waiting [Session-3] select * FROM pg_get_backend_memory_contexts(); [Session-4] select * FROM pg_get_backend_memory_contexts(); If you issue commit or abort at session-1, you will get SEGV. Instead of checking for the existence of the file, it might be better to use a hash (mcxtdumpHash) entry with LWLock. +if (proc == NULL) +{ +ereport(WARNING, +(errmsg("PID %d is not a PostgreSQL server process", dst_pid))); +return (Datum) 1; +} Shouldn't it clear the hash entry before return? +/* Wait until target process finished dumping file. */ +while (!entry->is_dumped) +{ +CHECK_FOR_INTERRUPTS(); +pg_usleep(1L); +} If the target is killed and exit before dumping the memory information, you're in an infinite loop here. So how about making sure that any process that has to stop before doing a memory dump changes the status of the hash (entry->is_dumped) before stopping and the caller detects it? I'm not sure it's best or not, but you might want to use something like the on_shmem_exit callback. In the current design, if the caller stops processing before reading the dumped file, you will have an orphaned file. It looks like the following. [session-1] BEGIN; LOCK TABKE t1; [Session-2] BEGIN; LOCK TABLE t1; <- waiting [Session-3] select * FROM pg_get_backend_memory_contexts(); If you cancel or terminate the session-3, then issue commit or abort at session-1, you will get orphaned files in pg_memusage. So if you allow only one session can request to dump file, it could call pg_memusage_reset() before send the signal in this function. Best regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
Re: Parallel copy
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow wrote: > > Hi Vignesh and Bharath, > > Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as > parallel-unsafe. > Can you explain why this is? > I don't think we need to restrict this case and even if there is some reason to do so then probably the same should be mentioned in the comments. -- With Regards, Amit Kapila.
Re: Resetting spilled txn statistics in pg_stat_replication
On Thu, Oct 1, 2020 at 12:06 PM Amit Kapila wrote: > > Thanks for the review. > oops, forgot to attach the updated patches, doing now. -- With Regards, Amit Kapila. v7-0001-Track-statistics-for-spilling-of-changes-from-Reo.patch Description: Binary data v7-0002-Test-stats.patch Description: Binary data
Re: Resetting spilled txn statistics in pg_stat_replication
On Wed, Sep 30, 2020 at 4:34 PM Dilip Kumar wrote: > > On Wed, Sep 30, 2020 at 2:40 PM Amit Kapila wrote: > > > > On Wed, Sep 30, 2020 at 1:12 PM Dilip Kumar wrote: > > > > > > On Fri, Sep 25, 2020 at 4:33 PM Amit Kapila > > > wrote: > > > > > > > > On Thu, Sep 24, 2020 at 5:44 PM Amit Kapila > > > > wrote: > > > > > > > > I have done some more testing of this patch especially for the case > > > > where we spill before streaming the transaction and found everything > > > > is working as expected. Additionally, I have changed a few more > > > > comments and ran pgindent. I am still not very sure whether we want to > > > > display physical slots in this view as all the stats are for logical > > > > slots but anyway we can add stats w.r.t physical slots in the future. > > > > I am fine either way (don't show physical slots in this view or show > > > > them but keep stats as 0). Let me know if you have any thoughts on > > > > these points, other than that I am happy with the current state of the > > > > patch. > > > > > > IMHO, It will make more sense to only show the logical replication > > > slots in this view. > > > > > > > Okay, Sawada-San, others, do you have any opinion on this matter? > I have changed so that view will only show logical replication slots and adjusted the docs accordingly. I think we can easily extend it to show physical replication slots if required in the future. > I have started looking into this patch, I have a few comments. > > +Number of times transactions were spilled to disk. Transactions > +may get spilled repeatedly, and this counter gets incremented on > every > +such invocation. > + > + > + > + > + > + Tracking of spilled transactions works only for logical replication. In > > The number of spaces used after the full stop is not uniform. > I have removed this sentence as it is not required as we only want to show logical slots in the view and for that I have updated the other parts of the doc. > + /* update the statistics iff we have spilled anything */ > + if (spilled) > + { > + rb->spillCount += 1; > + rb->spillBytes += size; > + > + /* Don't consider already serialized transactions. */ > > Single line comments are not uniform, "update the statistics" is > starting is small letter and not ending with the full stop > whereas 'Don't consider' is starting with capital and ending with full stop. > Actually, it doesn't matter in this case but I have changed to keep it consistent. > > + > + /* > + * We set this flag to indicate if the transaction is ever serialized. > + * We need this to accurately update the stats. > + */ > + txn->txn_flags |= RBTXN_IS_SERIALIZED_CLEAR; > > I feel we can explain the exact scenario in the comment, i.e. after > spill if we stream then we still > need to know that it spilled in past so that we don't count this again > as a new spilled transaction. > Okay, I have expanded the comment a bit more to explain this. > old slot. We send the drop > + * and create messages while holding ReplicationSlotAllocationLock to > + * reduce that possibility. If the messages reached in reverse, we would > + * lose one statistics update message. But > > Spacing after the full stop is not uniform. > Changed. > > + * Statistics about transactions spilled to disk. > + * > + * A single transaction may be spilled repeatedly, which is why we keep > + * two different counters. For spilling, the transaction counter includes > + * both toplevel transactions and subtransactions. > + */ > + int64 spillCount; /* spill-to-disk invocation counter */ > + int64 spillTxns; /* number of transactions spilled to disk */ > + int64 spillBytes; /* amount of data spilled to disk */ > > Can we keep the order as spillTxns, spillTxns, spillBytes because > every other place we kept like that > so that way it will look more uniform. > Changed here and at one more place as per this suggestion. > Other than that I did not see any problem. > Thanks for the review. -- With Regards, Amit Kapila.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > For non-recovery path, did you mean by any chance > measuring the cache hit rate for varying shared_buffers? No. You can test the speed of DropRelFileNodeBuffers() during normal operation, i.e. by running TRUNCATE on psql, instead of performing recovery. To enable that, you can just remove the checks for recovery, i.e. removing the check if InRecovery and if the value is cached or not. Regards Takayuki Tsunakawa
Re: making update/delete of inheritance trees scale better
Hi, On Thu, Oct 1, 2020 at 1:32 PM Michael Paquier wrote: > > On Fri, Sep 11, 2020 at 07:20:56PM +0900, Amit Langote wrote: > > I have been working away at this and have updated the patches for many > > cosmetic and some functional improvements. > > Please note that this patch set fails to apply. Could you provide a > rebase please? Yeah, I'm working on posting an updated patch. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: Batch/pipelining support for libpq
On Thu, Oct 1, 2020 at 6:35 AM Michael Paquier wrote: > The documentation is failing to build, and the patch does not build > correctly on Windows. Could you address that? > -- > Michael Yes I'm on it. -- Matthieu