Re: New docs chapter on Transaction Management and related changes
On Wed, 30 Nov 2022 at 01:51, Bruce Momjian wrote: > Thanks to Simon for getting this important > information in our docs, and for the valuable feedback from others that > made this even better. And thanks to you for pulling that all together Bruce. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
On Tue, 29 Nov 2022 at 20:46, Tom Lane wrote: > > I wrote: > > That seems like a fairly bad idea: it will add extra contention > > on ProcArrayLock, and I see no real strong argument that the path > > can't get traversed often enough for that to matter. It would > > likely be better for KnownAssignedXidsCompress to obtain the lock > > for itself, only after it knows there is something worth doing. > > Since we're running out of time in the current commitfest, > I went ahead and changed that, and made the cosmetic fixes > I wanted, and pushed. That is a complete patch from multiple angles; very happy here. Thanks for a great job. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Wed, 30 Nov 2022 at 03:50, Thomas Munro wrote: > > On Wed, Nov 30, 2022 at 1:32 AM Simon Riggs > wrote: > > Re-attaching patch for bgwriter and walwriter, so it is clear this is > > not yet committed. > > I'm just curious, and not suggesting that 60s wakeups are a problem > for the polar ice caps, but why even time out at all? Are the latch > protocols involved not reliable enough? At a guess from a quick > glance, the walwriter's is but maybe the bgwriter could miss a wakeup > as it races against StrategyGetBuffer(), which means you might stay > asleep until the *next* buffer allocation, but that's already true I > think, and a 60s timeout is not much of a defence. That sounds reasonable. It does sound like we agree that the existing behavior of waking up every 5s or 2.5s is not good. I hope you will act to improve that. The approach taken in this patch, and others of mine, has been to offer a minimal change that achieves the objective of lengthy hibernation to save power. Removing the timeout entirely may not work in other circumstances I have not explored. Doing that requires someone to check it actually works, and for others to believe that check has occurred. For me, that is too time consuming to actually happen in this dev cycle, and time is part of the objective since perfect designs yet with unreleased code have no utility. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: O(n) tasks cause lengthy startups and checkpoints
On Wed, 30 Nov 2022 at 03:56, Nathan Bossart wrote: > > On Tue, Nov 29, 2022 at 12:02:44PM +, Simon Riggs wrote: > > The last important point for me is tests, in src/test/modules > > probably. It might be possible to reuse the final state of other > > modules' tests to test cleanup, or at least integrate a custodian test > > into each module. > > Of course. I found some existing tests for the test_decoding plugin that > appear to reliably generate the files we want the custodian to clean up, so > I added them there. Thanks for adding the tests; I can see they run clean. The only minor thing I would personally add is a note in each piece of code to explain where the tests are for each one and/or something in the main custodian file that says tests exist within src/test/module. Otherwise, ready for committer. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Bug in wait time when waiting on nested subtransaction
On Mon, 28 Nov 2022 at 21:10, Robert Haas wrote: > > On Mon, Nov 28, 2022 at 3:01 PM Tom Lane wrote: > > One thing we need to be pretty careful of here is to not break the > > promise of atomic commit. At topmost commit, all subxacts must > > appear committed simultaneously. It's not quite clear to me whether > > we need a similar guarantee in the rollback case. It seems like > > we shouldn't, but maybe I'm missing something, in which case maybe > > the current behavior is correct? > > In short, I think Simon is right that there's a problem and right > about which commit caused it, but I'm not sure what I think we ought > to do about it. I'm comfortable with ignoring it, on the basis that it *is* a performance optimization, but I suggest we keep the test (with modified output) and document the behavior, if we do. The really big issue is the loss of performance we get from having subtrans point only to its immediate parent, which makes XidInMVCCSnapshot() go really slow in the presence of lots of subtransactions. So ignoring the issue on this thread will open the door for the optimization posted for this patch: https://commitfest.postgresql.org/40/3806/ -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Thu, 17 Nov 2022 at 17:29, Simon Riggs wrote: > (yes, the last line shows x10 performance patched, that is not a typo) New version of patch, now just a one-line patch! Results show it's still a good win for XidInMVCCSnapshot(). -- Simon Riggshttp://www.EnterpriseDB.com/ subtrans_points_to_topxid.v13.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Sun, 20 Nov 2022 at 20:00, Simon Riggs wrote: > > On Thu, 24 Mar 2022 at 16:21, Robert Haas wrote: > > > > On Thu, Mar 24, 2022 at 12:02 PM Simon Riggs > > > > What changes will be acceptable for bgwriter, walwriter and logical > > > worker? > > > > Hmm, I think it would be fine to introduce some kind of hibernation > > mechanism for logical workers. bgwriter and wal writer already have a > > hibernation mechanism, so I'm not sure what your concern is there > > exactly. In your initial email you said you weren't proposing changes > > there, but maybe that changed somewhere in the course of the > > subsequent discussion. If you could catch me up on your present > > thinking that would be helpful. > > Now that we seem to have solved the problem for Startup process, let's > circle back to the others Thanks for committing changes to the startup process. > Bgwriter does hibernate currently, but only at 50x the bgwriter_delay. > At default values that is 5s, but could be much less. So we need to > move that up to 60s, same as others. > > WALwriter also hibernates currently, but only at 25x the > wal_writer_delay. At default values that is 2.5s, but could be much > less. So we need to move that up to 60s, same as others. At the same > time, make sure that when we hibernate we use a new WaitEvent, > similarly to the way bgwriter reports its hibernation state (which > also helps test the patch). Re-attaching patch for bgwriter and walwriter, so it is clear this is not yet committed. I've added this to the next CF, since the entry was closed when the startup patch was committed. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_bgwriter_walwriter.v5.patch Description: Binary data
Re: O(n) tasks cause lengthy startups and checkpoints
On Mon, 28 Nov 2022 at 23:40, Nathan Bossart wrote: > > Okay, here is a new patch set. 0004 adds logic to prevent custodian tasks > from delaying shutdown. That all seems good, thanks. The last important point for me is tests, in src/test/modules probably. It might be possible to reuse the final state of other modules' tests to test cleanup, or at least integrate a custodian test into each module. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Mon, 28 Nov 2022 at 23:16, Thomas Munro wrote: > > I found some more comments and some documentation to word-smith very > lightly, and pushed. Thanks > The comments were stray references to the > trigger file. It's > a little confusing because the remaining mechanism also uses a file, > but it uses a signal first so seems better to refer to promotion > requests rather than files. ..and again > /me wonders how many idle servers there are in the world My estimate is there are 100 million servers in use worldwide, with only about 1% of them on a continuously busy duty cycle and most of them not in the cloud. If we guess that we save 10W when idle, then that saves some proportion of a GW. It's not a huge contribution to the effort, but if by doing this we inspire others to do the same, we may yet make a difference. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Bug in wait time when waiting on nested subtransaction
On Mon, 28 Nov 2022 at 17:38, Robert Haas wrote: > > On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs > wrote: > > So we have these options > > > > 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). > > That releases lock waiters earlier than expected, which requires > > pushups in XactLockTableWait() to cope with that (which are currently > > broken). Why do we do it? To save shmem space in the lock table should > > anyone want to run a transaction that contains thousands of > > subtransactions, or more. So option (1) alone would eventually cause > > us to run out of space in the lock table and a transaction would > > receive ERRORs rather than be allowed to run for a long time. > > This seems unprincipled to me. The point of having a lock on the > subtransaction in the lock table is so that people can wait for the > subtransaction to end. If we don't remove the lock table entry at > subtransaction end, then that lock table entry doesn't serve that > purpose any more. An easy point to confuse: "subtransaction to end": The subtransaction is "still running" to other backends even AFTER it has been subcommitted, but its state now transfers to the parent. So the subtransaction doesn't cease running until it aborts, one of its parent aborts or top level commit. The subxid lock should, on principle, exist until one of those events occurs. It doesn't, but that is an optimization, for the stated reason. (All of the above is current behavior). > > 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so > > we go up the levels one by one as we did before. However, (2) causes > > huge subtrans contention and if we implemented that and backpatched it > > the performance issues could be significant. So my feeling is that if > > we do (2) then we should not backpatch it. > > What I find suspicious about the coding of this function is that it > doesn't distinguish between commits and aborts at all. Like, if a > subtransaction commits, the changes don't become globally visible > until the parent transaction also commits. If a subtransaction aborts, > though, what happens to the top-level XID doesn't seem to matter at > all. The comment seems to agree: > > * Note that this does the right thing for subtransactions: if we wait on a > * subtransaction, we will exit as soon as it aborts or its top parent > commits. > > I feel like what I'd expect to see given this comment is code which > (1) waits until the supplied XID is no longer running, (2) checks > whether the XID aborted, and if so return at once, and (3) otherwise > recurse to the parent XID. But the code doesn't do that. Perhaps > that's not actually the right thing to do, since it seems like a big > behavior change, but then I don't understand the comment. As I mention above, the correct behavior is that the subxact doesn't cease running until it aborts, one of its parent aborts or top level commit. Which is slightly different from the comment, which may explain why the bug exists. > Incidentally, one possible optimization here to try to release locking > traffic would be to try waiting for the top-parent first using a > conditional lock acquisition. If that works, cool. If not, go back > around and work up the tree level by level. Since that path would only > be taken in the unhappy case where we're probably going to have to > wait anyway, the cost probably wouldn't be too bad. That sounds like a potential bug fix (not just an optimization). > > My preferred solution would be a mix of the above, call it option (3) > > > > 3. > > a) Include the lock table entry for the first 64 subtransactions only, > > so that we limit shmem. For those first 64 entries, have the subtrans > > point direct to top, since this makes a subtrans lookup into an O(1) > > operation, which is important for performance of later actions. > > > > b) For any subtransactions after first 64, delete the subxid lock on > > subcommit, to save shmem, but make subtrans point to the immediate > > parent (only), not the topxid. That fixes the bug, but causes > > performance problems in XidInMVCCSnapshot() and others, so we also do > > c) and d) > > > > c) At top level commit, go back and alter subtrans again for subxids > > so now it points to the topxid, so that we avoid O(N) behavior in > > XidInMVCCSnapshot() and other callers. Additional cost for longer > > transactions, but it saves considerable cost in later callers that > > need to call GetTopmostTransaction. > > > > d) Optimize SubTransGetTopmostTransaction() so it retrieves entries > > page-at-a-time. This will reduce the contention of repeatedly > > re-visit
Re: Bug in wait time when waiting on nested subtransaction
On Mon, 28 Nov 2022 at 18:53, Alvaro Herrera wrote: > > On 2022-Nov-28, Simon Riggs wrote: > > > A narrative description of the issue follows: > > session1 - requests multiple nested subtransactions like this: > > BEGIN; ... > > SAVEPOINT subxid1; ... > > SAVEPOINT subxid2; ... > > > However, if subxid2 subcommits, then the lock wait moves from subxid2 > > to the topxid. > > Hmm, do we really do that? Seems very strange .. it sounds to me like > the lock should have been transferred to subxid1 (which is subxid2's > parent), not to the top-level Xid. Correct; that is exactly what I'm saying and why we have a bug since 3c27944fb2141. > Maybe what the user wanted was to > release subxid1 before establishing subxid2? Or do they want to > continue to be able to rollback to subxid1 after establishing subxid2? > (but why?) This isn't a description of a user's actions, it is a script that illustrates the bug in XactLockTableWait(). Perhaps a better example would be nested code blocks with EXCEPTION clauses where the outer block fails... e.g. DO $$ BEGIN SELECT 1; BEGIN SELECT 1; EXCEPTION WHEN OTHERS THEN RAISE NOTICE 's2'; END; RAISE division_by_zero; -- now back in outer subxact, which now fails EXCEPTION WHEN OTHERS THEN RAISE NOTICE 's1'; END;$$; Of course, debugging this is harder since there is no way to return the current subxid in SQL. -- Simon Riggshttp://www.EnterpriseDB.com/
Bug in wait time when waiting on nested subtransaction
Issue in current XactLockTableWait, starting with 3c27944fb2141, discovered while reviewing https://commitfest.postgresql.org/40/3806/ Test demonstrating the problem is 001-isotest-tuplelock-subxact.v1.patch A narrative description of the issue follows: session1 - requests multiple nested subtransactions like this: BEGIN; ... SAVEPOINT subxid1; ... SAVEPOINT subxid2; ... If another session2 sees an xid from subxid2, it waits. If subxid2 aborts, then session2 sees the abort and can continue processing normally. However, if subxid2 subcommits, then the lock wait moves from subxid2 to the topxid. If subxid1 subsequently aborts, it will also abort subxid2, but session2 waiting for subxid2 to complete doesn't see this and waits for topxid instead. Which means that it waits longer than it should, and later arriving lock waiters may actually get served first. So it's a bug, but not too awful, since in many cases people don't use nested subtransactions, and if they do use SAVEPOINTs, they don't often close them using RELEASE. And in most cases the extra wait time is not very long, hence why nobody ever reported this issue. Thanks to Robert Haas and Julien Tachoires for asking the question "are you sure the existing coding is correct?". You were both right; it is not. How to fix? Correct lock wait can be achieved while a subxid is running if we do either * a lock table entry for the subxid OR * a subtrans entry that points to its immediate parent So we have these options 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). That releases lock waiters earlier than expected, which requires pushups in XactLockTableWait() to cope with that (which are currently broken). Why do we do it? To save shmem space in the lock table should anyone want to run a transaction that contains thousands of subtransactions, or more. So option (1) alone would eventually cause us to run out of space in the lock table and a transaction would receive ERRORs rather than be allowed to run for a long time. 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so we go up the levels one by one as we did before. However, (2) causes huge subtrans contention and if we implemented that and backpatched it the performance issues could be significant. So my feeling is that if we do (2) then we should not backpatch it. So both (1) and (2) have issues. The main result from patch https://commitfest.postgresql.org/40/3806/ is that having subtrans point direct to topxid is very good for performance in XidIsInMVCCSnapshot(), and presumably other places also. This bug prevents the simple application of a patch to improve performance. So now we need a stronger mix of ideas to both resolve the bug and fix the subtrans contention issue in HEAD. My preferred solution would be a mix of the above, call it option (3) 3. a) Include the lock table entry for the first 64 subtransactions only, so that we limit shmem. For those first 64 entries, have the subtrans point direct to top, since this makes a subtrans lookup into an O(1) operation, which is important for performance of later actions. b) For any subtransactions after first 64, delete the subxid lock on subcommit, to save shmem, but make subtrans point to the immediate parent (only), not the topxid. That fixes the bug, but causes performance problems in XidInMVCCSnapshot() and others, so we also do c) and d) c) At top level commit, go back and alter subtrans again for subxids so now it points to the topxid, so that we avoid O(N) behavior in XidInMVCCSnapshot() and other callers. Additional cost for longer transactions, but it saves considerable cost in later callers that need to call GetTopmostTransaction. d) Optimize SubTransGetTopmostTransaction() so it retrieves entries page-at-a-time. This will reduce the contention of repeatedly re-visiting the same page(s) and ensure that a page is less often paged out when we are still using it. Thoughts? -- Simon Riggshttp://www.EnterpriseDB.com/ 001-isotest-tuplelock-subxact.v1.patch Description: Binary data
Re: O(n) tasks cause lengthy startups and checkpoints
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart wrote: > > Thanks for taking a look! > > On Thu, Nov 24, 2022 at 05:31:02PM +, Simon Riggs wrote: > > * not sure I believe that everything it does can always be aborted out > > of and shutdown - to achieve that you will need a > > CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least > > I did something like this earlier, but was advised to simply let the > functions finish as usual during shutdown [0]. I think this is what the > checkpointer process does today, anyway. If we say "The custodian is not an essential process and can shutdown quickly when requested.", and yet we know its not true in all cases, then that will lead to misunderstandings and bugs. If we perform a restart and the custodian is performing extra work that delays shutdown, then it also delays restart. Given the title of the thread, we should be looking to improve that, or at least know it occurred. > > * not sure why you want immediate execution of custodian tasks - I > > feel supporting two modes will be a lot harder. For me, I would run > > locally when !IsUnderPostmaster and also in an Assert build, so we can > > test it works right - i.e. running in its own process is just a > > production optimization for performance (which is the stated reason > > for having this) > > I added this because 0004 involves requesting a task from the postmaster, > so checking for IsUnderPostmaster doesn't work. Those tasks would always > run immediately. However, we could use IsPostmasterEnvironment instead, > which would allow us to remove the "immediate" argument. I did it this way > in v14. Thanks > > 0005 seems good from what I know > > * There is no check to see if it worked in any sane time > > What did you have in mind? Should the custodian begin emitting WARNINGs > after a while? I think it might be useful if it logged anything that took an "extended period", TBD. Maybe that is already covered by startup process logging. Please tell me that still works? > > Rather than explicitly use DEBUG1 everywhere I would have an > > #define CUSTODIAN_LOG_LEVEL LOG > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > line change in a later phase of Beta > > I can create a separate patch for this, but I don't think I've ever seen > this sort of thing before. Much of recovery is coded that way, for the same reason. > Is the idea just to help with debugging during > the development phase? "Just", yes. Tests would be desirable also, under src/test/modules. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Add 64-bit XIDs into PostgreSQL 15
On Fri, 21 Oct 2022 at 17:09, Maxim Orlov wrote: > Reviews and opinions are very welcome! I'm wondering whether the safest way to handle this is by creating a new TAM called "heap64", so that all storage changes happens there. (Obviously there are still many other changes in core, but they are more easily fixed). That would reduce the code churn around "heap", allowing us to keep it stable while we move to the brave new world. Many current users see stability as one of the greatest strengths of Postgres, so while I very much support this move, I wonder if this gives us a way to have both stability and innovation at the same time? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: O(n) tasks cause lengthy startups and checkpoints
On Thu, 24 Nov 2022 at 00:19, Nathan Bossart wrote: > > On Sun, Nov 06, 2022 at 02:38:42PM -0800, Nathan Bossart wrote: > > rebased > > another rebase for cfbot 0001 seems good to me * I like that it sleeps forever until requested * not sure I believe that everything it does can always be aborted out of and shutdown - to achieve that you will need a CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least * not sure why you want immediate execution of custodian tasks - I feel supporting two modes will be a lot harder. For me, I would run locally when !IsUnderPostmaster and also in an Assert build, so we can test it works right - i.e. running in its own process is just a production optimization for performance (which is the stated reason for having this) 0005 seems good from what I know * There is no check to see if it worked in any sane time * It seems possible that "Old" might change meaning - will that make it break/fail? 0006 seems good also * same comments for 5 Rather than explicitly use DEBUG1 everywhere I would have an #define CUSTODIAN_LOG_LEVEL LOG so we can run with it in LOG mode and then set it to DEBUG1 with a one line change in a later phase of Beta I can't really comment with knowledge on sub-patches 0002 to 0004. Perhaps you should aim to get 1, 5, 6 committed first and then return to the others in a later CF/separate thread? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Tue, 22 Nov 2022 at 18:44, Tom Lane wrote: > > I wrote: > > Still wondering if there's really no CHECK_FOR_INTERRUPT anywhere > > else in this loop. > > I did some experimentation using the test case Jakub presented > to start with, and verified that that loop does respond promptly > to control-C even in HEAD. So there are CFI(s) in the loop as > I thought, and we don't need another. Thanks for checking. Sorry for not responding earlier. > What we do need is some more work on nearby comments. I'll > see about that and push it. Thanks; nicely streamlined. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Hash index build performance tweak from sorting
On Wed, 23 Nov 2022 at 13:04, David Rowley wrote: > After getting rid of the HashInsertState code and just adding bool > sorted to _hash_doinsert() and _hash_pgaddtup(), the resulting patch > is much more simple: Seems good to me and I wouldn't argue with any of your comments. > and v4 includes 7 extra lines in hashinsert.c for the Assert() I > mentioned in my previous email plus a bunch of extra comments. Oh, I did already include that in v3 as requested. > I'd rather see this solved like v4 is doing it. Please do. No further comments. Thanks for your help -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Tue, 22 Nov 2022 at 16:43, Justin Pryzby wrote: > > On Mon, Nov 21, 2022 at 03:07:25PM +, Simon Riggs wrote: > > Attached patch implements VACUUM (BACKGROUND). > > > > There are quite a few small details to consider; please read the docs > > and comments. > > > > There is a noticeable delay before the background vacuum starts. > > You disallowed some combinations of unsupported options, but not others, > like FULL, PARALLEL, etc. They should either be supported or > prohibited. > > + /* use default values */ > + tab.at_params.log_min_duration = 0; > > 0 isn't the default ? > > Maybe VERBOSE should mean to set min_duration=0, otherwise it should use > the default ? +1 > You only handle one rel, but ExecVacuum() has a loop around rels. > > +NOTICE: autovacuum of "vactst" has been requested, using the options > specified > > => I don't think it's useful to say "using the options specified". > > Should autovacuum de-duplicate requests ? > BRIN doesn't do that, but it's intended for append-only tables, so the > issue doesn't really come up. Easy to do > Could add psql tab-completion. > > Is it going to be confusing that the session's GUC variables won't be > transmitted to autovacuum ? For example, the freeze and costing > parameters. I think we should start with the "how do I want it to behave" parts of the above and leave spelling and tab completion as final items. Other questions are whether there should be a limit on number of background vacuums submitted at any time. Whether there should be a GUC that specifies the max number of queued tasks. Do we need a query that shows what items are queued? etc Justin, if you wanted to take up the patch from here, I would be more than happy. You have the knowledge and insight to make this work right. We should probably start a new CF patch entry so we can return the original patch as rejected, then continue with this new idea separately. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
On Tue, 22 Nov 2022 at 16:53, Tom Lane wrote: > > Simon Riggs writes: > > On Tue, 22 Nov 2022 at 16:28, Tom Lane wrote: > >> If we do those things, do we need a wasted-work counter at all? > > > The wasted work counter works well to respond to heavy read-only > > traffic and also avoids wasted compressions for write-heavy workloads. > > So I still like it the best. > > This argument presumes that maintenance of the counter is free, > which it surely is not. I don't know how bad contention on that > atomically-updated variable could get, but it seems like it could > be an issue when lots of processes are acquiring snapshots. I understand. I was assuming that you and Andres liked that approach. In the absence of that approach, falling back to a counter that compresses every N xids would be best, in addition to the two new forced compression events. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
On Tue, 22 Nov 2022 at 16:28, Tom Lane wrote: > > Simon Riggs writes: > > We seem to have replaced one magic constant with another, so not sure > > if this is autotuning, but I like it much better than what we had > > before (i.e. better than my prev patch). > > Yeah, the magic constant is still magic, even if it looks like it's > not terribly sensitive to the exact value. > > > 1. I was surprised that you removed the limits on size and just had > > the wasted work limit. If there is no read traffic that will mean we > > hardly ever compress, which means the removal of xids at commit will > > get slower over time. I would prefer that we forced compression on a > > regular basis, such as every time we process an XLOG_RUNNING_XACTS > > message (every 15s), as well as when we hit certain size limits. > > > 2. If there is lots of read traffic but no changes flowing, it would > > also make sense to force compression when the startup process goes > > idle rather than wait for the work to be wasted first. > > If we do those things, do we need a wasted-work counter at all? > > I still suspect that 90% of the problem is the max_connections > dependency in the existing heuristic, because of the fact that > you have to push max_connections to the moon before it becomes > a measurable problem. If we do > > -if (nelements < 4 * PROCARRAY_MAXPROCS || > -nelements < 2 * pArray->numKnownAssignedXids) > +if (nelements < 2 * pArray->numKnownAssignedXids) > > and then add the forced compressions you suggest, where > does that put us? The forced compressions I propose happen * when idle - since we have time to do it when that happens, which happens often since most workloads are bursty * every 15s - since we already have lock which is overall much less often than every 64 commits, as benchmarked by Michail. I didn't mean to imply that superceded the wasted work approach, it was meant to be in addition to. The wasted work counter works well to respond to heavy read-only traffic and also avoids wasted compressions for write-heavy workloads. So I still like it the best. > Also, if we add more forced compressions, it seems like we should have > a short-circuit for a forced compression where there's nothing to do. > So more or less like > > nelements = head - tail; > if (!force) > { > if (nelements < 2 * pArray->numKnownAssignedXids) > return; > } > else > { > if (nelements == pArray->numKnownAssignedXids) > return; > } +1 > I'm also wondering why there's not an > > Assert(compress_index == pArray->numKnownAssignedXids); > > after the loop, to make sure our numKnownAssignedXids tracking > is sane. +1 -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Mon, 21 Nov 2022 at 20:55, Robert Haas wrote: > > On Mon, Nov 21, 2022 at 1:17 PM Andres Freund wrote: > > On November 21, 2022 9:37:34 AM PST, Robert Haas > > wrote: > > >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund wrote: > > >> This can't quite be right - isn't this only applying the limit if we > > >> found a > > >> visible tuple? > > > > > >It doesn't look that way to me, but perhaps I'm just too dense to see > > >the problem? > > > > The earlier version didn't have the issue, but the latest seems to only > > limit after a visible tuple has been found. Note the continue; when > > fetching a heap tuple fails. > > Oh, that line was removed in Simon's patch but not in Jakub's version, > I guess. Jakub's version also leaves out the last_block = block line > which seems pretty critical. New patch version reporting for duty, sir. Please take it from here! -- Simon Riggshttp://www.EnterpriseDB.com/ 0001-Damage-control-for-planner-s-get_actual_variable_end.v3.patch Description: Binary data
Re: Code checks for App Devs, using new options for transaction behavior
On Mon, 7 Nov 2022 at 14:25, Simon Riggs wrote: > > On Wed, 2 Nov 2022 at 07:40, Simon Riggs wrote: > > > > What will be the behavior if someone declares a savepoint with this > > > name ("_internal_nested_xact"). Will this interfere with this new > > > functionality? > > > > Clearly! I guess you are saying we should disallow them. > > > > > Have we tested scenarios like that? > > > > No, but that can be done. > > More tests as requested, plus minor code rework, plus comment updates. New versions -- Simon Riggshttp://www.EnterpriseDB.com/ 001_psql_parse_only.v1.patch Description: Binary data 002_nested_xacts.v11.patch Description: Binary data 003_rollback_on_commit.v2.patch Description: Binary data
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Mon, 21 Nov 2022 at 22:15, Tom Lane wrote: > > Andres Freund writes: > > On 2022-11-21 16:17:56 -0500, Robert Haas wrote: > >> But ... what if they're not? Could the index contain a large number of > >> pages containing just 1 tuple each, or no tuples at all? If so, maybe > >> we can read ten bazillion index pages trying to find each heap tuple > >> and still end up in trouble. > > > ISTM that if you have an index in such a poor condition that a single > > value lookup reads thousands of pages inside the index, planner > > estimates taking long is going to be the smallest of your worries... > > Yeah, that sort of situation is going to make any operation on the > index slow, not only get_actual_variable_endpoint(). That was also my conclusion: this is actually a common antipattern for our indexes, not anything specific to the planner. In another recent situation, I saw a very bad case of performance for a "queue table". In that use case the rows are inserted at head and removed from tail. Searching for the next item to be removed from the queue involves an increasingly long tail search, in the case that a long running transaction prevents us from marking the index entries killed. Many tables exhibit such usage, for example, the neworder table in TPC-C. We optimized the case of frequent insertions into the rightmost index page; now we also need to optimize the case of a long series of deletions from the leftmost index pages. Not sure how, just framing the problem. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Mon, 21 Nov 2022 at 23:34, Robert Haas wrote: > > On Mon, Nov 21, 2022 at 6:17 PM Tom Lane wrote: > > evidence that it's a live problem. API warts are really hard to > > get rid of once instituted. > > Yeah, agreed. Agreed, happy not to; that version was just a WIP to see how it might work. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Mon, 21 Nov 2022 at 18:17, Andres Freund wrote: > > Hi, > > On November 21, 2022 9:37:34 AM PST, Robert Haas > wrote: > >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund wrote: > >> This can't quite be right - isn't this only applying the limit if we found > >> a > >> visible tuple? > > > >It doesn't look that way to me, but perhaps I'm just too dense to see > >the problem? > > The earlier version didn't have the issue, but the latest seems to only limit > after a visible tuple has been found. Note the continue; when fetching a heap > tuple fails. Agreed, resolved in this version. Robert, something like this perhaps? limit on both the index and the heap. -- Simon Riggshttp://www.EnterpriseDB.com/ 0001-Damage-control-for-planner-s-get_actual_variable_end.v2.patch Description: Binary data
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Mon, 21 Nov 2022 at 15:23, Robert Haas wrote: > > On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs > wrote: > > > > What we need is a solution that avoids reading an unbounded number of > > > > tuples under any circumstances. I previously suggested using > > > > SnapshotAny here, but Tom didn't like that. I'm not sure if there are > > > > safety issues there or if Tom was just concerned about the results > > > > being misleading. Either way, maybe there's some variant on that theme > > > > that could work. For instance, could we teach the index scan to stop > > > > if the first 100 tuples that it finds are all invisible? Or to reach > > > > at most 1 page, or at most 10 pages, or something? > > > > > > A hard limit on the number of index pages examined seems like it > > > might be a good idea. > > > > Good, that is what the patch does. > > > > Oh, that's surprisingly simple. Nice! > > Is there any reason to tie this into page costs? I'd be more inclined > to just make it a hard limit on the number of pages. I think that > would be more predictable and less prone to surprising (bad) behavior. > And to be honest I would be inclined to make it quite a small number. > Perhaps 5 or 10. Is there a good argument for going any higher? +1, that makes the patch smaller and the behavior more predictable. (Just didn't want to do anything too simple, in case it looked like a kluge.) -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Damage control for planner's get_actual_variable_endpoint() runaway
On Mon, 21 Nov 2022 at 15:01, Tom Lane wrote: > > > What we need is a solution that avoids reading an unbounded number of > > tuples under any circumstances. I previously suggested using > > SnapshotAny here, but Tom didn't like that. I'm not sure if there are > > safety issues there or if Tom was just concerned about the results > > being misleading. Either way, maybe there's some variant on that theme > > that could work. For instance, could we teach the index scan to stop > > if the first 100 tuples that it finds are all invisible? Or to reach > > at most 1 page, or at most 10 pages, or something? > > A hard limit on the number of index pages examined seems like it > might be a good idea. Good, that is what the patch does. > > If we don't find a > > match, we could either try to use a dead tuple, or we could just > > return false which, I think, would end up using the value from > > pg_statistic rather than any updated value. > > Yeah, the latter seems like the best bet. Yes, just breaking out of the loop is enough to use the default value. > If we do install such a thing, should we undo any of the previous > changes that backed off the reliability of the result? Not sure. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Fri, 18 Nov 2022 at 11:54, Simon Riggs wrote: > > On Thu, 17 Nov 2022 at 20:00, Justin Pryzby wrote: > > > > On Wed, Nov 16, 2022 at 05:14:07PM -0500, Greg Stark wrote: > > > I think this requesting autovacuum worker should be a distinct > > > command. Or at least an explicit option to vacuum. > > > > +1. I was going to suggest VACUUM (NOWAIT) .. > > Yes, I have no problem with an explicit command. > > At the moment the patch runs VACUUM in the background in an autovacuum > process, but the call is asynchronous, since we do not wait for the > command to finish (or even start). > > So the command names I was thinking of would be one of these: > > VACUUM (BACKGROUND) or VACUUM (AUTOVACUUM) - which might be clearer > or > VACUUM (ASYNC) - which is more descriptive of the behavior > > or we could go for both > VACUUM (BACKGROUND, ASYNC) - since this allows us to have a > BACKGROUND, SYNC version in the future Attached patch implements VACUUM (BACKGROUND). There are quite a few small details to consider; please read the docs and comments. There is a noticeable delay before the background vacuum starts. -- Simon Riggshttp://www.EnterpriseDB.com/ background_vacuum.v3.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
On Fri, 18 Nov 2022 at 18:26, Tom Lane wrote: > > Robert Haas writes: > > On Fri, Nov 18, 2022 at 7:04 AM Simon Riggs > > wrote: > >> So if consistency is also a strong requirement, then maybe we should > >> make that new command the default, i.e. make VACUUM always just a > >> request to vacuum in background. That way it will be consistent. > > > Since one fairly common reason for running vacuum in the foreground is > > needing to vacuum a table when all autovacuum workers are busy, or > > when they are vacuuming it with a cost limit and it needs to get done > > sooner, I think this would surprise a lot of users in a negative way. > > It would also break a bunch of our regression tests, which expect a > VACUUM to complete immediately. > > >> Can we at least have a vacuum_runs_in_background = on | off, to allow > >> users to take advantage of this WITHOUT needing to rewrite all of > >> their scripts? > > > I'm not entirely convinced that's a good idea, but happy to hear what > > others think. > > I think the answer to that one is flat no. We learned long ago that GUCs > with significant semantic impact on queries are a bad idea. For example, > if a user issues VACUUM expecting behavior A and she gets behavior B > because somebody changed the postgresql.conf entry, she won't be happy. > > Basically, I am not buying Simon's requirement that this be transparent. > I think the downsides would completely outweigh whatever upside there > may be (and given the shortage of prior complaints, I don't think the > upside is very large). Just to say I'm happy with that decision and will switch to the request for a background vacuum. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Patch: Global Unique Index
On Thu, 17 Nov 2022 at 22:01, Cary Huang wrote: > > Patch: Global Unique Index Let me start by expressing severe doubt on the usefulness of such a feature, but also salute your efforts to contribute. > In other words, a global unique index and a regular partitioned index are > essentially the same in terms of their storage structure except that one can > do cross-partition uniqueness check, the other cannot. This is the only workable architecture, since it allows DETACH to be feasible, which is essential. You don't seem to mention that this would require a uniqueness check on each partition. Is that correct? This would result in O(N) cost of uniqueness checks, severely limiting load speed. I notice you don't offer any benchmarks on load speed or the overhead associated with this, which is not good if you want to be taken seriously, but at least it is recoverable. (It might be necessary to specify some partitions as READ ONLY, to allow us to record their min/max values for the indexed cols, allowing us to do this more quickly.) > - Supported Features - > 1. Global unique index is supported only on btree index type Why? Surely any index type that supports uniqueness is good. > - Not-supported Features - > 1. Global uniqueness check with Sub partition tables is not yet supported as > we do not have immediate use case and it may involve majoy change in current > implementation Hmm, sounds like a problem. Arranging the calls recursively should work. > - Create a global unique index - > To create a regular unique index on a partitioned table, Postgres has to > perform heap scan and sorting on every child partition. Uniqueness check > happens during the sorting phase and will raise an error if multiple tuples > with the same index key are sorted together. To achieve global uniqueness > check, we make Postgres perform the sorting after all of the child partitions > have been scanned instead of on the "sort per partition" fashion. In > otherwords, the sorting only happens once at the very end and it sorts the > tuples coming from all the partitions and therefore can ensure global > uniqueness. My feeling is that performance on this will suck so badly that we must warn people away from it, and tell people if they want this, create the index at the start and let it load. Hopefully CREATE INDEX CONCURRENTLY still works. Let's see some benchmarks on this also please. You'll need to think about progress reporting early because correctly reporting the progress and expected run times are likely critical for usability. > Example: > > > CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a); > > CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10); > > CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20); > > CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL; > > INSERT INTO gidx_part values(5, 5, 'test'); > > INSERT INTO gidx_part values(15, 5, 'test'); > ERROR: duplicate key value violates unique constraint "gidx_part1_b_idx" > DETAIL: Key (b)=(5) already exists. Well done. > - DETACH - > Since we retain the same partitioned structure, detaching a partition with > global unique index is straightforward. Upon DETACH, Postgres will change its > relkind from RELKIND_GLOBAL_INDEX to RELKIND_INDEX and remove their > inheritance relationship as usual. It's the only way that works > - Optimizer, query planning and vacuum - > Since no major modification is done on global unique index's structure and > storage, it works in the same way as a regular partitioned index. No major > change is required to be done on optimizer, planner and vacuum process as > they should work in the same way as regular index. Agreed Making a prototype is a great first step. The next step is to understand the good and the bad aspects of it, so you can see what else needs to be done. You need to be honest and real about the fact that this may not actually be desirable in practice, or in a restricted use case. That means performance analysis of create, load, attach, detach, INSERT, SELECT, UPD/DEL and anything else that might be affected, together with algorithmic analysis of what happens for larger N and larger tables. Expect many versions; take provisions for many days. Best of luck -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Mon, 21 Nov 2022 at 08:40, Laurenz Albe wrote: > > On Mon, 2022-11-21 at 07:36 +, Simon Riggs wrote: > > On Mon, 21 Nov 2022 at 05:07, Laurenz Albe wrote: > > > > > > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote: > > > > I'll wait 24 hours before committing, to > > > > provide a last chance for anyone who wants to complain about dropping > > > > promote_trigger_file. > > > > > > Remove "promote_trigger_file"? Now I have never seen anybody use that > > > parameter, but I don't think that it is a good idea to deviate from our > > > usual standard of deprecating a feature for about five years before > > > actually removing it. > > > > We aren't removing the ability to promote, just enforcing a change to > > a better mechanism, hence I don't see a reason for a long(er) > > deprecation period than we have already had. > > We have had a deprecation period? I looked at the documentation, but found > no mention of a deprecation. How hard can it be to leave the GUC and only > poll for the existence of the file if it is set? > > I personally don't need the GUC, and I know nobody who does, Nobody else does either. > I disagree. With the same argument, you could rip out "serial", since we > have supported identity columns since v11. ...and this is not a user facing change, only HA systems interface with this. > but I think > we should not be cavalier about introducing unnecessary compatibility breaks. I agree we should not be cavalier, nor has anyone been so. The first version of the patch was cautious on this very point, but since many people think we should remove it, it is not a user facing feature and nobody on this thread knows anybody or anything that uses it, I have changed my mind and now think we should remove it. We have two versions to choose from now * v6 deprecates this option * v10 removes this option so it is no effort at all to choose either option. The main issue is about reducing power usage, so please let's move to commit something soon, one way or another. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Sun, 20 Nov 2022 at 22:55, Nathan Bossart wrote: > > On Mon, Nov 21, 2022 at 10:31:15AM +1300, Thomas Munro wrote: > > On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs > > wrote: > >> As a 3rd patch, I will work on making logical workers hibernate. > > > > Duelling patch warning: Nathan mentioned[1] that he's hacking on a > > patch for that, along the lines of the recent walreceiver change IIUC. > > I coded something up last week, but, like the walreceiver patch, it caused > check-world to take much longer [0], and I haven't looked into whether it > could be easily fixed. I'm hoping to make some time for this again in the > near future. > > [0] https://postgr.es/m/20221113222644.GA1269110%40nathanxps13 OK, Nathan, will leave this one to you - remembering that we need to fix ALL processes to get a useful power reduction when idle. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Mon, 21 Nov 2022 at 05:07, Laurenz Albe wrote: > > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote: > > I'll wait 24 hours before committing, to > > provide a last chance for anyone who wants to complain about dropping > > promote_trigger_file. > > Remove "promote_trigger_file"? Now I have never seen anybody use that > parameter, but I don't think that it is a good idea to deviate from our > usual standard of deprecating a feature for about five years before > actually removing it. We aren't removing the ability to promote, just enforcing a change to a better mechanism, hence I don't see a reason for a long(er) deprecation period than we have already had. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Thu, 24 Mar 2022 at 16:21, Robert Haas wrote: > > On Thu, Mar 24, 2022 at 12:02 PM Simon Riggs > > What changes will be acceptable for bgwriter, walwriter and logical worker? > > Hmm, I think it would be fine to introduce some kind of hibernation > mechanism for logical workers. bgwriter and wal writer already have a > hibernation mechanism, so I'm not sure what your concern is there > exactly. In your initial email you said you weren't proposing changes > there, but maybe that changed somewhere in the course of the > subsequent discussion. If you could catch me up on your present > thinking that would be helpful. Now that we seem to have solved the problem for Startup process, let's circle back to the others Bgwriter does hibernate currently, but only at 50x the bgwriter_delay. At default values that is 5s, but could be much less. So we need to move that up to 60s, same as others. WALwriter also hibernates currently, but only at 25x the wal_writer_delay. At default values that is 2.5s, but could be much less. So we need to move that up to 60s, same as others. At the same time, make sure that when we hibernate we use a new WaitEvent, similarly to the way bgwriter reports its hibernation state (which also helps test the patch). As a 3rd patch, I will work on making logical workers hibernate. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_bgwriter_walwriter.v5.patch Description: Binary data
Re: Slow standby snapshot
On Sun, 20 Nov 2022 at 13:45, Michail Nikolaev wrote: > If such approach looks committable - I could do more careful > performance testing to find the best value for > WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS. Nice patch. We seem to have replaced one magic constant with another, so not sure if this is autotuning, but I like it much better than what we had before (i.e. better than my prev patch). Few thoughts 1. I was surprised that you removed the limits on size and just had the wasted work limit. If there is no read traffic that will mean we hardly ever compress, which means the removal of xids at commit will get slower over time. I would prefer that we forced compression on a regular basis, such as every time we process an XLOG_RUNNING_XACTS message (every 15s), as well as when we hit certain size limits. 2. If there is lots of read traffic but no changes flowing, it would also make sense to force compression when the startup process goes idle rather than wait for the work to be wasted first. Quick patch to add those two compression events also. That should favour the smaller wasted work limits. -- Simon Riggshttp://www.EnterpriseDB.com/ events_that_force_compression.v1.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Sat, 19 Nov 2022 at 10:59, Simon Riggs wrote: > New version attached. Fix for doc xref -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v10.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Fri, 18 Nov 2022 at 20:26, Thomas Munro wrote: > > On Sat, Nov 19, 2022 at 7:54 AM Simon Riggs > wrote: > > I agree. I can't see a reason to keep it anymore. > > +Use of promote_trigger_file is deprecated. If you're > > I think 'deprecated' usually implies that it still works but you > should avoid it. I think you need something stronger. Whisky? Or maybe just reword the sentence... New version attached. > > I'm nervous about not having any wakeup at all, but since we are > > removing the parameter there is no other reason not to do as Andres > > suggests. > > Why? If we're accidentally relying on this timeout for recovery to > not hang in some situation, that's a bug waiting to be discovered and > fixed and it won't be this patch's fault. > > > New version attached, which assumes that the SIGALRMs are silenced on > > the other thread. > > I tested this + Bharath's v5 from the other thread. meson test > passes, and tracing the recovery process shows that it is indeed, > finally, completely idle. Huzzah! Thanks for testing! Finally completely idle? Good to achieve that. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v9.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Thu, 17 Nov 2022 at 20:38, Robert Haas wrote: > > On Thu, Nov 17, 2022 at 2:55 AM Simon Riggs > wrote: > > No, it will have a direct effect only on people using promote_trigger_file > > who do not read and act upon the deprecation notice before upgrading > > by making a one line change to their failover scripts. > > TBH, I wonder if we shouldn't just nuke promote_trigger_file. > pg_promote was added in 2018, and pg_ctl promote was added in 2011. So > even if we haven't said promote_trigger_file was deprecated, it hasn't > been the state-of-the-art way of doing this in a really long time. If > we think that there are still a lot of people using it, or if popular > tools are relying on it, then perhaps a deprecation period is > warranted anyway. But I think we should at least consider the > possibility that it's OK to just get rid of it right away. I agree. I can't see a reason to keep it anymore. I'm nervous about not having any wakeup at all, but since we are removing the parameter there is no other reason not to do as Andres suggests. New version attached, which assumes that the SIGALRMs are silenced on the other thread. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v8.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
On Thu, 17 Nov 2022 at 20:06, Robert Haas wrote: > > On Wed, Nov 16, 2022 at 5:14 PM Greg Stark wrote: > > However I'm not a fan of commands that sometimes do one thing and > > sometimes magically do something very different. I don't like the idea > > that the same vacuum command would sometimes run in-process and > > sometimes do this out of process request. And the rules for when it > > does which are fairly complex to explain -- it runs in process unless > > you're in a transaction when it runs out of process unless it's a > > temporary table ... > > 100% agree. I agree as well. At the moment, the problem (OT) is that VACUUM behaves inconsistently Outside a transaction - works perfectly In a transaction - throws ERROR, which prevents a whole script from executing correctly What we are trying to do is avoid the ERROR. I don't want them to behave like this, but that's the only option possible to avoid ERROR. So if consistency is also a strong requirement, then maybe we should make that new command the default, i.e. make VACUUM always just a request to vacuum in background. That way it will be consistent. Can we at least have a vacuum_runs_in_background = on | off, to allow users to take advantage of this WITHOUT needing to rewrite all of their scripts? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Thu, 17 Nov 2022 at 20:00, Justin Pryzby wrote: > > On Wed, Nov 16, 2022 at 05:14:07PM -0500, Greg Stark wrote: > > I think this requesting autovacuum worker should be a distinct > > command. Or at least an explicit option to vacuum. > > +1. I was going to suggest VACUUM (NOWAIT) .. Yes, I have no problem with an explicit command. At the moment the patch runs VACUUM in the background in an autovacuum process, but the call is asynchronous, since we do not wait for the command to finish (or even start). So the command names I was thinking of would be one of these: VACUUM (BACKGROUND) or VACUUM (AUTOVACUUM) - which might be clearer or VACUUM (ASYNC) - which is more descriptive of the behavior or we could go for both VACUUM (BACKGROUND, ASYNC) - since this allows us to have a BACKGROUND, SYNC version in the future Thoughts? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Fri, 18 Nov 2022 at 08:55, Bharath Rupireddy wrote: > However, I'm a bit > worried about how it'll affect the tools/providers/extensions that > depend on it. Who is that? Which ones depend upon it? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Thu, 17 Nov 2022 at 20:29, Tomas Vondra wrote: > > On 11/17/22 18:29, Simon Riggs wrote: > > On Thu, 17 Nov 2022 at 17:04, Simon Riggs > > wrote: > >> > > 003 includes the idea to not-always do SubTransSetParent() > > > I'm a bit confused by the TransactionIdsAreOnSameXactPage naming. Isn't > this really checking clog pages? Yes, clog page. I named it to match the new name of pg_xact -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Wed, 16 Nov 2022 at 15:44, Tom Lane wrote: > > Simon Riggs writes: > > On Wed, 16 Nov 2022 at 00:09, Tom Lane wrote: > >> No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed > >> is set (ie, somebody somewhere/somewhen overflowed), then it does > >> SubTransGetTopmostTransaction and searches only the xips with the result. > >> This behavior requires that all live subxids be correctly mapped by > >> SubTransGetTopmostTransaction, or we'll draw false conclusions. > > > Your comments are correct wrt to the existing coding, but not to the > > patch, which is coded as described and does not suffer those issues. > > Ah, OK. > > Still ... I really do not like this patch. It introduces a number of > extremely fragile assumptions, and I think those would come back to > bite us someday, even if they hold now which I'm still unsure about. Completely understand. It took me months to think this through. > It doesn't help that you've chosen to document them only at the place > making them and not at the place(s) likely to break them. Yes, apologies for that, I focused on the holistic explanation in the slides. > Also, to be blunt, this is not Ready For Committer. It's more WIP, > because even if the code is okay there are comments all over the system > that you've invalidated. (At the very least, the file header comments > in subtrans.c and the comments in struct SnapshotData need work; I've > not looked hard but I'm sure there are more places with comments > bearing on these data structures.) New version with greatly improved comments coming very soon. > Perhaps it would be a good idea to split up the patch. The business > about making pg_subtrans flat rather than a tree seems like a good > idea in any event, although as I said it doesn't seem like we've got > a fleshed-out version of that here. We could push forward on getting > that done and then separately consider the rest of it. Yes, I thought you might ask that so, after some thought, have found a clean way to do that and have split this into two parts. Julien has agreed to do further perf tests and is working on that now. I will post new versions soon, earliest tomorrow. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Thu, 17 Nov 2022 at 07:36, Bharath Rupireddy wrote: > > promote_trigger_file is not tested and there are better ways, so > > deprecating it in this release is fine. > > Hm, but.. > > > Anyone that relies on it can update their mechanisms to a supported > > one with a one-line change. Realistically, anyone using it won't be on > > the latest release anyway, at least for a long time, since if they use > > manual methods then they are well behind the times. > > I may be overly pessimistic here - the change from 5 sec to 60 sec for > detecting promote_trigger_file will have a direct impact on failovers > I believe. No, it will have a direct effect only on people using promote_trigger_file who do not read and act upon the deprecation notice before upgrading by making a one line change to their failover scripts. Since pretty much everyone doing HA uses external HA software (cloud or otherwise) this shouldn't affect anyone. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Wed, 16 Nov 2022 at 12:47, Bharath Rupireddy wrote: > > On Wed, Nov 16, 2022 at 2:34 PM Simon Riggs > wrote: > > > > Reposting v6 now so that patch tester doesn't think this has failed > > when the patch on other thread gets applied. > > Intention of the patch, that is, to get rid of promote_trigger_file > GUC sometime in future, looks good to me. However, the timeout change > to 60 sec from 5 sec seems far-reaching. While it discourages the use > of the GUC, it can impact many existing production servers that still > rely on promote_trigger_file as it can increase the failover times > impacting SLAs around failover. The purpose of 60s is to allow for power reduction, so 5s won't do. promote_trigger_file is not tested and there are better ways, so deprecating it in this release is fine. Anyone that relies on it can update their mechanisms to a supported one with a one-line change. Realistically, anyone using it won't be on the latest release anyway, at least for a long time, since if they use manual methods then they are well behind the times. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Sun, 13 Nov 2022 at 23:07, Simon Riggs wrote: > > On Sun, 13 Nov 2022 at 21:28, Thomas Munro wrote: > > > > On Mon, Nov 14, 2022 at 5:52 AM Simon Riggs > > wrote: > > > The attached patch is a reduced version of the original. It covers only: > > > * deprecation of the promote_trigger_file - there are no tests that > > > use that, hence why there is no test coverage for the patch > > > * changing the sleep time of the startup process to 60s > > > * docs and comments > > > > LPGTM. If we also fix the bogus SIGALRM wakeups[1], then finally a > > completely idle recovery process looks like: > > > > kevent(8,0x0,0,{ },1,{ 60.0 }) = 0 (0x0) > > kevent(8,0x0,0,{ },1,{ 60.0 }) = 0 (0x0) > > kevent(8,0x0,0,{ },1,{ 60.0 }) = 0 (0x0) > > > > Presumably it would have no timeout at all in the next release. > > > > [1] > > https://www.postgresql.org/message-id/calj2acuiyn+zmpguvmgeoy1u7ino2qsvqrnufk8swpvk3a8...@mail.gmail.com > > Clearly, I haven't been watching Hackers! Thanks for the nudge. > > See if this does the trick? Looks like the SIGALRM wakeups will be silenced on the other thread, so we can just revert back to using v6 of this patch. Reposting v6 now so that patch tester doesn't think this has failed when the patch on other thread gets applied. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v6.patch Description: Binary data
Re: when the startup process doesn't (logging startup delays)
On Wed, 16 Nov 2022 at 06:47, Bharath Rupireddy wrote: > > On Tue, Nov 15, 2022 at 10:55 PM Robert Haas wrote: > > > > On Tue, Nov 15, 2022 at 8:33 AM Bharath Rupireddy > > wrote: > > > Please review the v2 patch. > > > > It seems to me that this will call disable_startup_progress_timeout > > once per WAL record, which seems like an unnecessary expense. How > > about leaving the code inside the loop just as we have it, and putting > > if (StandbyMode) disable_startup_progress_timeout() before entering > > the loop? > > That can be done, only if we can disable the timeout in another place > when the StandbyMode is set to true in ReadRecord(), that is, after > the standby server finishes crash recovery and enters standby mode. > > I'm attaching the v3 patch for further review. Please find the CF > entry here - https://commitfest.postgresql.org/41/4012/. begin_startup_progress_phase() checks to see if feature is disabled twice, so I think you can skip the check and just rely on the check in enable(). Otherwise, all good. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Hash index build performance tweak from sorting
n do this in > the code, but I feel like we do it less often in newer code. e.g we do > it in aset.c but not generation.c (which is much newer than aset.c). > My personal preference would be just to name the struct > HashInsertState and have no extra pointer typedefs. Not done, but not disagreeing either, just not very comfortable actually making those changes. -- Simon Riggshttp://www.EnterpriseDB.com/ hash_inserted_sorted.v3.patch Description: Binary data
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Wed, 16 Nov 2022 at 00:09, Tom Lane wrote: > > Simon Riggs writes: > > On Tue, 15 Nov 2022 at 21:03, Tom Lane wrote: > >> The subxidStatus.overflowed check quoted above has a similar sort > >> of myopia: it's checking whether our current transaction has > >> already suboverflowed. But (a) that doesn't prove it won't suboverflow > >> later, and (b) the relevant logic in XidInMVCCSnapshot needs to run > >> SubTransGetTopmostTransaction if *any* proc in the snapshot has > >> suboverflowed. > > > Not the way it is coded now. > > > First, we search the subxid cache in snapshot->subxip. > > Then, and only if the snapshot overflowed (i.e. ANY xact overflowed), > > do we check subtrans. > > No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed > is set (ie, somebody somewhere/somewhen overflowed), then it does > SubTransGetTopmostTransaction and searches only the xips with the result. > This behavior requires that all live subxids be correctly mapped by > SubTransGetTopmostTransaction, or we'll draw false conclusions. Your comments are correct wrt to the existing coding, but not to the patch, which is coded as described and does not suffer those issues. > We could perhaps make it do what you suggest, Already done in the patch since v5. > but that would require > a complete performance analysis to make sure we're not giving up > more than we would gain. I agree that a full performance analysis is sensible and an objective analysis has been performed by Julien Tachoires. This is described here, along with other explanations: https://docs.google.com/presentation/d/1A7Ar8_LM5EdC2OHL_j3U9J-QwjMiGw9mmXeBLJOmFlg/edit?usp=sharing It is important to understand the context here: there is already a well documented LOSS of performance with the current coding. The patch alleviates that, and I have not been able to find a performance case where there is any negative impact. Further tests welcome. > Also, both GetSnapshotData and CopySnapshot assume that the subxips > array is not used if suboverflowed is set, and don't bother > (continuing to) populate it. So we would need code changes and > additional cycles in those areas too. Already done in the patch since v5. Any additional cycles apply only to the case of snapshot overflow, which currently performs very badly. > I'm not sure about your other claims, but I'm pretty sure this one > point is enough to kill the patch. Then please look again because there are misunderstandings above. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
On Wed, 16 Nov 2022 at 00:15, Tom Lane wrote: > > Andres Freund writes: > > On 2022-11-15 23:14:42 +, Simon Riggs wrote: > >> Hence more frequent compression is effective at reducing the overhead. > >> But too frequent compression slows down the startup process, which > >> can't then keep up. > >> So we're just looking for an optimal frequency of compression for any > >> given workload. > > > What about making the behaviour adaptive based on the amount of wasted > > effort > > during those two operations, rather than just a hardcoded "emptiness" > > factor? > > Not quite sure how we could do that, given that those things aren't even > happening in the same process. But yeah, it does feel like the proposed > approach is only going to be optimal over a small range of conditions. I have not been able to think of a simple way to autotune it. > > I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea - > > if you have a workload with plenty subxids you might end up never > > compressing > > because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid > > most/all of the time. > > Yeah, I didn't think that was too safe either. > It'd be more reliable > to use a static counter to skip all but every N'th compress attempt > (something we could do inside KnownAssignedXidsCompress itself, instead > of adding warts at the call sites). I was thinking exactly that myself, for the reason of keeping it all inside KnownAssignedXidsCompress(). -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
On Tue, 15 Nov 2022 at 23:06, Tom Lane wrote: > > BTW, while nosing around this code I came across this statement > (procarray.c, about line 4550 in HEAD): > > * ... To add XIDs to the array, we just insert > * them into slots to the right of the head pointer and then advance the head > * pointer. This wouldn't require any lock at all, except that on machines > * with weak memory ordering we need to be careful that other processors > * see the array element changes before they see the head pointer change. > * We handle this by using a spinlock to protect reads and writes of the > * head/tail pointers. (We could dispense with the spinlock if we were to > * create suitable memory access barrier primitives and use those instead.) > * The spinlock must be taken to read or write the head/tail pointers unless > * the caller holds ProcArrayLock exclusively. > > Nowadays we've *got* those primitives. Can we get rid of > known_assigned_xids_lck, and if so would it make a meaningful > difference in this scenario? I think you could do that *as well*, since it does act as an overhead but that is not related to the main issues: * COMMITs: xids are removed from the array by performing a binary search - this gets more and more expensive as the array gets wider * SNAPSHOTs: scanning the array for snapshots becomes more expensive as the array gets wider Hence more frequent compression is effective at reducing the overhead. But too frequent compression slows down the startup process, which can't then keep up. So we're just looking for an optimal frequency of compression for any given workload. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Tue, 15 Nov 2022 at 21:03, Tom Lane wrote: > > Simon Riggs writes: > > New version attached. > > I looked at this patch and I do not see how it can possibly be safe. I grant you it is complex, so please bear with me. > The most direct counterexample arises from the fact that > HeapCheckForSerializableConflictOut checks SubTransGetTopmostTransaction > in some cases. You haven't tried to analyze when, but just disabled > the optimization in serializable mode: > > + * 2. When IsolationIsSerializable() we sometimes need to access > topxid. > + *This occurs only when SERIALIZABLE is requested by app user. > +... > +if (MyProc->subxidStatus.overflowed || > +IsolationIsSerializable() || > > However, what this is checking is whether *our current transaction* > is serializable. If we're not serializable, but other transactions > in the system are, then this fails to store information that they'll > need for correctness. You'd have to have some way of knowing that > no transaction in the system is using serializable mode, and that > none will start to do so while this transaction is still in-doubt. Not true. src/backend/storage/lmgr/README-SSI says this... * Any transaction which is run at a transaction isolation level other than SERIALIZABLE will not be affected by SSI. If you want to enforce business rules through SSI, all transactions should be run at the SERIALIZABLE transaction isolation level, and that should probably be set as the default. If HeapCheckForSerializableConflictOut() cannot find a subxid's parent then it will not be involved in serialization errors. So skipping the storage of subxids in subtrans for non-serializable xacts is valid for both SSI and non-SSI xacts. Thus the owning transaction can decide to skip the insert into subtrans if it is not serializable. > I fear that's already enough to kill the idea; but there's more. > The subxidStatus.overflowed check quoted above has a similar sort > of myopia: it's checking whether our current transaction has > already suboverflowed. But (a) that doesn't prove it won't suboverflow > later, and (b) the relevant logic in XidInMVCCSnapshot needs to run > SubTransGetTopmostTransaction if *any* proc in the snapshot has > suboverflowed. Not the way it is coded now. First, we search the subxid cache in snapshot->subxip. Then, and only if the snapshot overflowed (i.e. ANY xact overflowed), do we check subtrans. Thus, the owning xact knows that anyone else will find the first 64 xids in the subxid cache, so it need not insert them into subtrans, even if someone else overflowed. When the owning xact overflows, it knows it must now insert the subxid into subtrans before the xid is used anywhere in storage, which the patch does. This allows each owning xact to decide what to do, independent of the actions of others. > Lastly, I don't see what the "transaction on same page" business > has got to do with anything. The comment is certainly failing > to make the case that it's safe to skip filling subtrans when that > is true. That seems strange, I grant you. It's the same logic that is used in TransactionIdSetTreeStatus(), in reverse. I understand it 'cos I wrote it. TRANSACTION_STATUS_SUB_COMMITTED is only ever used if the topxid and subxid are on different pages. Therefore TransactionIdDidCommit() won't ever see a value of TRANSACTION_STATUS_SUB_COMMITTED unless they are on separate pages. So the owning transaction can predict in advance whether anyone will ever call SubTransGetParent() for one of its xids. If they might, then we record the values just in case. If they NEVER will, then we can skip recording them. And just to be clear, all 3 of the above preconditions must be true before the owning xact decides to skip writing a subxid to subtrans. > I think we could salvage this small idea: > > + * Insert entries into subtrans for this xid, noting that the > entry > + * points directly to the topxid, not the immediate parent. This > is > + * done for two reasons: > + * (1) so it is faster in a long chain of subxids, because the > + * algorithm is then O(1), no matter how many subxids are > assigned. > > but some work would be needed to update the comments around > SubTransGetParent and SubTransGetTopmostTransaction to explain that > they're no longer reliably different. I think that that is okay for > the existing use-cases, but they'd better be documented. In fact, > couldn't we simplify them down to one function? Given the restriction > that we don't look back in pg_subtrans further than TransactionXmin, > I don't think that updated code would ever need to resolve cases > written by older code. So we could remove the recursive checks > entirely, o
Re: when the startup process doesn't (logging startup delays)
On Tue, 15 Nov 2022 at 13:33, Bharath Rupireddy wrote: > > On Mon, Nov 14, 2022 at 9:31 PM Robert Haas wrote: > > > > On Mon, Nov 14, 2022 at 7:37 AM Simon Riggs > > wrote: > > > > Whilte at it, I noticed that we report redo progress for PITR, but we > > > > don't report when standby enters archive recovery mode, say due to a > > > > failure in the connection to primary or after the promote signal is > > > > found. Isn't it useful to report in this case as well to know the > > > > recovery progress? > > > > > > I think your patch disables progress too early, effectively turning > > > off the standby progress feature. The purpose was to report on things > > > that take long periods during recovery, not just prior to recovery. > > > > > > I would advocate that we disable progress only while waiting, as I've > > > done here: > > > https://www.postgresql.org/message-id/CANbhV-GcWjZ2cmj0uCbZDWQUHnneMi_4EfY3dVWq0-yD5o7Ccg%40mail.gmail.com > > > > Maybe I'm confused here, but I think that, on a standby, startup > > progress messages are only printed until the main redo loop is > > reached. Otherwise, we would print a message on a standby every 10s > > forever, which seems like a thing that most users would not like. So I > > think that Bharath has the right idea here. > > Yes, the idea is to disable the timeout on standby completely since we > actually don't report any recovery progress. Keeping it enabled, > unnecessarily calls startup_progress_timeout_handler() every > log_startup_progress_interval seconds i.e. 10 seconds. That's the > intention of the patch. As long as we don't get the SIGALRMs that Thomas identified, then I'm happy. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: New docs chapter on Transaction Management and related changes
On Tue, 8 Nov 2022 at 03:41, Bruce Momjian wrote: > > On Mon, Nov 7, 2022 at 10:58:05AM +, Simon Riggs wrote: > > What I've posted is the merged patch, i.e. your latest patch, plus > > changes to RELEASE SAVEPOINT from you on Oct 16, plus changes based on > > the later comments from Robert and I. > > Thanks. I have two changes to your patch. First, I agree "destroy" is > the wrong word for this, but I don't think "subcommit" is good, for > three reasons: > > 1. Release merges the non-aborted changes into the previous transaction > _and_ frees their resources --- "subcommit" doesn't have both meanings, > which I think means if we need a single word, we should use "release" > and later define what that means. > > 2. The "subcommit" concept doesn't closely match the user-visible > behavior, even though we use subtransactions to accomplish this. Release > is more of a rollup/merge into the previously-active > transaction/savepoint. > > 3. "subcommit" is an implementation detail that I don't think we should > expose to users in the manual pages. I don't understand this - you seem to be presuming that "subcommit" means something different and then objecting to that difference. For me, "Subcommit" exactly matches what is happening because the code comments and details already use Subcommit in exactly this way. The main purpose of this patch is to talk about what is happening using the same language as we do in the code. The gap between the code and the docs isn't helping anyone. > I adjusted the first paragraph of RELEASE SAVEPOINT to highlight the > above issues. My original patch had similar wording. > > The first attachment shows my changes to your patch, and the second > attachment is my full patch. OK, though this makes the patch tester look like this doesn't apply. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Mon, 7 Nov 2022 at 21:14, Simon Riggs wrote: > These results are compelling, thank you. > > Setting this to Ready for Committer. New version attached. -- Simon Riggshttp://www.EnterpriseDB.com/ 002_minimize_calls_to_SubTransSetParent.v12.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
On Mon, 14 Nov 2022 at 19:52, Simon Riggs wrote: > > On Tue, 8 Nov 2022 at 03:10, Simon Riggs wrote: > > > > On Mon, 7 Nov 2022 at 08:20, Simon Riggs > > wrote: > > > > > Temp tables are actually easier, since we don't need any of the > > > concurrency features we get with lazy vacuum. > > > Thoughts? > > New patch, which does this, when in a xact block > > 1. For temp tables, only VACUUM FULL is allowed > 2. For persistent tables, an AV task is created to perform the vacuum, > which eventually performs a vacuum > > The patch works, but there are various aspects of the design that need > input. Thoughts? New version. -- Simon Riggshttp://www.EnterpriseDB.com/ single_table_vacuum_in_xact.v4.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
On Tue, 8 Nov 2022 at 03:10, Simon Riggs wrote: > > On Mon, 7 Nov 2022 at 08:20, Simon Riggs wrote: > > > Temp tables are actually easier, since we don't need any of the > > concurrency features we get with lazy vacuum. > Thoughts? New patch, which does this, when in a xact block 1. For temp tables, only VACUUM FULL is allowed 2. For persistent tables, an AV task is created to perform the vacuum, which eventually performs a vacuum The patch works, but there are various aspects of the design that need input. Thoughts? -- Simon Riggshttp://www.EnterpriseDB.com/ single_table_vacuum_in_xact.v3.patch Description: Binary data
Re: when the startup process doesn't (logging startup delays)
On Tue, 8 Nov 2022 at 12:33, Bharath Rupireddy wrote: > > On Tue, Nov 8, 2022 at 4:35 PM Thomas Munro wrote: > > > > On Sat, Oct 30, 2021 at 7:44 AM Robert Haas wrote: > > > Committed. > > > > Is it expected that an otherwise idle standby's recovery process > > receives SIGALRM every N seconds, or should the timer be canceled at > > that point, as there is no further progress to report? > > Nice catch. Yeah, that seems unnecessary, see the below standby logs. > I think we need to disable_timeout(STARTUP_PROGRESS_TIMEOUT, false);, > something like the attached? I think there'll be no issue with the > patch since the StandbyMode gets reset only at the end of recovery (in > FinishWalRecovery()) but it can very well be set during recovery (in > ReadRecord()). Note that I also added an assertion in > has_startup_progress_timeout_expired(), just in case. > > 2022-11-08 11:28:23.563 UTC [980909] LOG: SIGALRM handle_sig_alarm received > 2022-11-08 11:28:23.563 UTC [980909] LOG: > startup_progress_timeout_handler called > 2022-11-08 11:28:33.563 UTC [980909] LOG: SIGALRM handle_sig_alarm received > 2022-11-08 11:28:33.563 UTC [980909] LOG: > startup_progress_timeout_handler called > 2022-11-08 11:28:43.563 UTC [980909] LOG: SIGALRM handle_sig_alarm received > 2022-11-08 11:28:43.563 UTC [980909] LOG: > startup_progress_timeout_handler called > 2022-11-08 11:28:53.563 UTC [980909] LOG: SIGALRM handle_sig_alarm received > 2022-11-08 11:28:53.563 UTC [980909] LOG: > startup_progress_timeout_handler called > > Whilte at it, I noticed that we report redo progress for PITR, but we > don't report when standby enters archive recovery mode, say due to a > failure in the connection to primary or after the promote signal is > found. Isn't it useful to report in this case as well to know the > recovery progress? I think your patch disables progress too early, effectively turning off the standby progress feature. The purpose was to report on things that take long periods during recovery, not just prior to recovery. I would advocate that we disable progress only while waiting, as I've done here: https://www.postgresql.org/message-id/CANbhV-GcWjZ2cmj0uCbZDWQUHnneMi_4EfY3dVWq0-yD5o7Ccg%40mail.gmail.com -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Reducing power consumption on idle servers
On Sun, 13 Nov 2022 at 21:28, Thomas Munro wrote: > > On Mon, Nov 14, 2022 at 5:52 AM Simon Riggs > wrote: > > The attached patch is a reduced version of the original. It covers only: > > * deprecation of the promote_trigger_file - there are no tests that > > use that, hence why there is no test coverage for the patch > > * changing the sleep time of the startup process to 60s > > * docs and comments > > LPGTM. If we also fix the bogus SIGALRM wakeups[1], then finally a > completely idle recovery process looks like: > > kevent(8,0x0,0,{ },1,{ 60.0 }) = 0 (0x0) > kevent(8,0x0,0,{ },1,{ 60.0 }) = 0 (0x0) > kevent(8,0x0,0,{ },1,{ 60.0 }) = 0 (0x0) > > Presumably it would have no timeout at all in the next release. > > [1] > https://www.postgresql.org/message-id/calj2acuiyn+zmpguvmgeoy1u7ino2qsvqrnufk8swpvk3a8...@mail.gmail.com Clearly, I haven't been watching Hackers! Thanks for the nudge. See if this does the trick? -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v7.patch Description: Binary data
Re: Reducing power consumption on idle servers
On Thu, 24 Mar 2022 at 16:02, Simon Riggs wrote: > > On Thu, 24 Mar 2022 at 15:39, Robert Haas wrote: > > > > On Thu, Mar 24, 2022 at 6:59 AM Simon Riggs > > wrote: > > > The proposals of this patch are the following, each of which can be > > > independently accepted/rejected: > > > 1. fix the sleep pattern of bgwriter, walwriter and logical worker > > > (directly affects powersave) > > > 2. deprecate promote_trigger_file, which then allows us to fix the > > > sleep for startup process (directly affects powersave) > > > 3. treat hibernation in all procs the same, for simplicity, and to > > > make sure we don't forget one later > > > 4. provide a design pattern for background worker extensions to > > > follow, so as to encourage powersaving > > > > Unfortunately, the patch isn't split in a way that corresponds to this > > division. I think I'm +1 on change #2 -- deprecating > > promote_trigger_file seems like a good idea to me independently of any > > power-saving considerations. But I'm not sure that I am on board with > > any of the other changes. > > OK, so change (2) is good. Separate patch attached. Thanks to Ian for prompting me to pick up this thread again; apologies for getting distracted. The attached patch is a reduced version of the original. It covers only: * deprecation of the promote_trigger_file - there are no tests that use that, hence why there is no test coverage for the patch * changing the sleep time of the startup process to 60s * docs and comments Other points will be discussed in another branch of this thread. -- Simon Riggshttp://www.EnterpriseDB.com/ hibernate_startup.v6.patch Description: Binary data
Re: New docs chapter on Transaction Management and related changes
On Mon, 7 Nov 2022 at 22:04, Laurenz Albe wrote: > > On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote: > > Agreed; new compilation patch attached, including mine and then > > Robert's suggested rewordings. > > Thanks. There is clearly a lot of usefule information in this. > > Some comments: > > > --- a/doc/src/sgml/func.sgml > > +++ b/doc/src/sgml/func.sgml > > @@ -24673,7 +24673,10 @@ SELECT collation for ('foo' COLLATE "de_DE"); > > > > Returns the current transaction's ID. It will assign a new one if > > the > > current transaction does not have one already (because it has not > > -performed any database updates). > > +performed any database updates); see > +linkend="transaction-id"/> for details. If executed in a > > +subtransaction this will return the top-level xid; see > +linkend="subxacts"/> for details. > > > > > > I would use a comma after "subtransaction", and I think it would be better to > write > "transaction ID" instead of "xid". > > > @@ -24690,6 +24693,7 @@ SELECT collation for ('foo' COLLATE "de_DE"); > > ID is assigned yet. (It's best to use this variant if the > > transaction > > might otherwise be read-only, to avoid unnecessary consumption of > > an > > XID.) > > +If executed in a subtransaction this will return the top-level xid. > > > > > > Same as above. > > > @@ -24733,6 +24737,8 @@ SELECT collation for ('foo' COLLATE "de_DE"); > > > > Returns a current snapshot, a data structure > > showing which transaction IDs are now in-progress. > > +Only top-level xids are included in the snapshot; subxids are not > > +shown; see for details. > > > > > > Again, I would avoid "xid" and "subxid", or at least use "transaction ID > (xid)" > and similar. > > > --- a/doc/src/sgml/ref/release_savepoint.sgml > > +++ b/doc/src/sgml/ref/release_savepoint.sgml > > @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] > > savepoint_name > >Description > > > > > > - RELEASE SAVEPOINT destroys a savepoint previously > > defined > > - in the current transaction. > > + RELEASE SAVEPOINT will subcommit the subtransaction > > + established by the named savepoint, if one exists. This will release > > + any resources held by the subtransaction. If there were any > > + subtransactions of the named savepoint, these will also be subcommitted. > > > > > > > > "Subtransactions of the named savepoint" is somewhat confusing; how about > "subtransactions of the subtransaction established by the named savepoint"? > > If that is too long and explicit, perhaps "subtransactions of that > subtransaction". > > > @@ -78,7 +71,7 @@ RELEASE [ SAVEPOINT ] > > savepoint_name > > > > > > It is not possible to release a savepoint when the transaction is in > > - an aborted state. > > + an aborted state, to do that use . > > > > > > > > I think the following is more English: > "It is not possible ... state; to do that, use ." > > > --- a/doc/src/sgml/ref/rollback.sgml > > +++ b/doc/src/sgml/ref/rollback.sgml > > @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ] > > AND CHAIN > > > > > > - If AND CHAIN is specified, a new transaction is > > + If AND CHAIN is specified, a new unaborted > > transaction is > >immediately started with the same transaction characteristics (see > > >linkend="sql-set-transaction"/>) as the just finished one. > > Otherwise, > >no new transaction is started. > > I don't think that is an improvement. "Unaborted" is an un-word. A new > transaction > is always "unaborted", isn't it? > > > --- a/doc/src/sgml/wal.sgml > > +++ b/doc/src/sgml/wal.sgml > > @@ -909,4 +910,36 @@ > > seem to be a problem in practice. > > > > > > + > > + > > + > > + Two-Phase Transactions > > + > > + > > + PostgreSQL supports a two-phase commit (2PC) > [...] > > + pg_twophase directory. Currently-prepared > > + transactions can be inspected using > + >
Re: Allow single table VACUUM in transaction block
On Mon, 7 Nov 2022 at 08:20, Simon Riggs wrote: > Temp tables are actually easier, since we don't need any of the > concurrency features we get with lazy vacuum. So the answer is to > always run a VACUUM FULL on temp tables since this skips any issues > with indexes etc.. So I see 3 options for what to do next 1. Force the FULL option for all tables, when executed in a transaction block. This gets round the reasonable objections to running a concurrent vacuum in a shared xact block. As Justin points out, CLUSTER is already supported, which uses the same code. 2. Force the FULL option for temp tables, when executed in a transaction block. In a later patch, queue up an autovacuum run for regular tables. 3. Return with feedback this patch. (But then what happens with temp tables?) Thoughts? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Tue, 1 Nov 2022 at 08:55, Julien Tachoires wrote: > > > > 4. Test results with transaction 1 > > > > > > TPS vs number of sub-transaction > > > > > > nsubx HEAD patched > > > > > > 0 441109 439474 > > > 8 439045 438103 > > > 16 439123 436993 > > > 24 436269 434194 > > > 32 439707 437429 > > > 40 439997 437220 > > > 48 439388 437422 > > > 56 439409 437210 > > > 64 439748 437366 > > > 7292869 434448 > > > 8066577 434100 > > > 8861243 434255 > > > 9657016 434419 > > > 10452132 434917 > > > 11249181 433755 > > > 12046581 434044 > > > 12844067 434268 > > > > > > Perf profiling on HEAD with 80 sub-transactions: > > > Overhead Symbol > > > 51.26% [.] LWLockAttemptLock > > > 24.59% [.] LWLockRelease > > >0.36% [.] base_yyparse > > >0.35% [.] PinBuffer > > >0.34% [.] AllocSetAlloc > > >0.33% [.] hash_search_with_hash_value > > >0.22% [.] LWLockAcquire > > >0.20% [.] UnpinBuffer > > >0.15% [.] SimpleLruReadPage_ReadOnly > > >0.15% [.] _bt_compare > > > > > > Perf profiling on patched with 80 sub-transactions: > > > Overhead Symbol > > > 2.64% [.] AllocSetAlloc > > > 2.09% [.] base_yyparse > > > 1.76% [.] hash_search_with_hash_value > > > 1.62% [.] LWLockAttemptLock > > > 1.26% [.] MemoryContextAllocZeroAligned > > > 0.93% [.] _bt_compare > > > 0.92% [.] expression_tree_walker_impl.part.4 > > > 0.84% [.] SearchCatCache1 > > > 0.79% [.] palloc > > > 0.64% [.] core_yylex > > > > > > 5. Test results with transaction 2 > > > > > > nsubx HEAD patched > > > > > > 0 440145 443816 > > > 8 438867 443081 > > > 16 438634 441786 > > > 24 436406 440187 > > > 32 439203 442447 > > > 40 439819 443574 > > > 48 439314 442941 > > > 56 439801 443736 > > > 64 439074 441970 > > > 72 439833 444132 > > > 80 148737 439941 > > > 88 413714 443343 > > > 96 251098 442021 > > > 104 70190 443488 > > > 112 405507 438866 > > > 120 177827 443202 > > > 128 399431 441842 > > > > > > From the performance point of view, this patch clearly fixes the > > > dramatic TPS collapse shown in these tests. > > > > I think these are really promising results. Although the perf result > > shows that the bottleneck on the SLRU is no more there with the patch, > > I think it would be nice to see the wait event as well. > > Please find attached samples returned by the following query when > testing transaction 1 with 80 subxacts: > SELECT wait_event_type, wait_event, locktype, mode, database, > relation, COUNT(*) from pg_stat_activity AS psa JOIN pg_locks AS pl ON > (psa.pid = pl.pid) GROUP BY 1, 2, 3, 4, 5, 6 ORDER BY 7 DESC; These results are compelling, thank you. Setting this to Ready for Committer. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Code checks for App Devs, using new options for transaction behavior
On Wed, 2 Nov 2022 at 07:40, Simon Riggs wrote: > > What will be the behavior if someone declares a savepoint with this > > name ("_internal_nested_xact"). Will this interfere with this new > > functionality? > > Clearly! I guess you are saying we should disallow them. > > > Have we tested scenarios like that? > > No, but that can be done. More tests as requested, plus minor code rework, plus comment updates. -- Simon Riggshttp://www.EnterpriseDB.com/ 002_nested_xacts.v10.patch Description: Binary data 001_psql_parse_only.v1.patch Description: Binary data 003_rollback_on_commit.v1.patch Description: Binary data
Re: Error for WITH options on partitioned tables
On Mon, 7 Nov 2022 at 08:55, Karina Litskevich wrote: > > Hi David, > > > I am not very clear about why `build_reloptions` is removed in patch > > `v2-0002-better-error-message-for-setting-parameters-for-p.patch`, if > > you can help explain would be great. > > "build_reloptions" parses "reloptions" and takes for it a list of allowed > options defined by the 5th argument "relopt_elems" and the 6th argument > "num_relopt_elems", which are NULL and 0 in the removed call. If "validate" > is false, it ignores options, which are not in the list, while parsing. If > "validate" is true, it "elog"s ERROR when it meets option, which is not in the > allowed list. Karina's changes make sense to me, so +1. This is a minor patch, so I will set this as Ready For Committer. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: New docs chapter on Transaction Management and related changes
On Mon, 7 Nov 2022 at 07:43, Bruce Momjian wrote: > > On Fri, Nov 4, 2022 at 04:17:28PM +0100, Laurenz Albe wrote: > > On Sat, 2022-10-15 at 21:08 -0400, Bruce Momjian wrote: > > > I therefore merged all three paragraphs into > > > one and tried to make the text saner; release_savepoint.sgml diff > > > attached, URL content updated. > > > > I wanted to have a look at this, but I am confused. The original patch > > was much bigger. Is this just an incremental patch? If yes, it would > > be nice to have a "grand total" patch, so that I can read it all > > in one go. > > Yeah, I said: > > Yes, I didn't like global either --- I like your wording. I added > your > other changes too, with slight rewording. Merged patch to be posted > in >- > a later email. > > but that was unclear. Let me post one now, and Simon posted one too. What I've posted is the merged patch, i.e. your latest patch, plus changes to RELEASE SAVEPOINT from you on Oct 16, plus changes based on the later comments from Robert and I. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Sun, 6 Nov 2022 at 20:40, Peter Geoghegan wrote: > > On Sun, Nov 6, 2022 at 11:14 AM Tom Lane wrote: > > In general, I do not believe in encouraging users to run VACUUM > > manually in the first place. We would be far better served by > > spending our effort to improve autovacuum's shortcomings. > > I couldn't agree more. A lot of problems seem related to the idea that > VACUUM is just a command that the DBA periodically runs to get a > predictable fixed result, a little like CREATE INDEX. That conceptual > model isn't exactly wrong; it just makes it much harder to apply any > kind of context about the needs of the table over time. There is a > natural cycle to how VACUUM (really autovacuum) is run, and the > details matter. > > There is a significant amount of relevant context that we can't really > use right now. That wouldn't be true if VACUUM only ran within an > autovacuum worker (by definition). The VACUUM command itself would > still be available, and support the same user interface, more or less. > Under the hood the VACUUM command would work by enqueueing a VACUUM > job, to be performed asynchronously by an autovacuum worker. Perhaps > the initial enqueue operation could be transactional, fixing Simon's > complaint. Ah, I see you got to this idea first! Yes, what we need is for the "VACUUM command" to not fail in a script. Not sure anyone cares where the work takes place. Enqueuing a request for autovacuum to do that work, then blocking until it is complete would do the job. > "No more VACUUMs outside of autovacuum" would enable more advanced > autovacuum.c scheduling, allowing us to apply a lot more context about > the costs and benefits, without having to treat manual VACUUM as an > independent thing. We could coalesce together redundant VACUUM jobs, > suspend and resume VACUUM operations, and have more strategies to deal > with problems as they emerge. +1, but clearly this would not make temp table VACUUMs work. > > I'd like to see some sort of direct attack on its inability to deal > > with temp tables, for instance. (Force the owning backend to > > do it? Temporarily change the access rules so that the data > > moves to shared buffers? Dunno, but we sure haven't tried hard.) This was a $DIRECT attack on making temp tables work! ;-) Temp tables are actually easier, since we don't need any of the concurrency features we get with lazy vacuum. So the answer is to always run a VACUUM FULL on temp tables since this skips any issues with indexes etc.. We would need to check a few things first maybe something like this (mostly borrowed heavily from COPY) InvalidateCatalogSnapshot(); if (!ThereAreNoPriorRegisteredSnapshots() || !ThereAreNoReadyPortals()) ereport(WARNING, (errcode(ERRCODE_INVALID_TRANSACTION_STATE), errmsg("vacuum of temporary table ignored because of prior transaction activity"))); CheckTableNotInUse(rel, "VACUUM"); > This is a good example of the kind of thing I have in mind. Perhaps it > could work by killing the backend that owns the temp relation when > things truly get out of hand? I think that that would be a perfectly > reasonable trade-off. +1 > Another related idea: better behavior in the event of a manually > issued VACUUM (now just an enqueued autovacuum) that cannot do useful > work due to the presence of a long running snapshot. The VACUUM > doesn't have to dutifully report "success" when there is no practical > sense in which it was successful. There could be a back and forth > conversation between autovacuum.c and vacuumlazy.c that makes sure > that something useful happens sooner or later. The passage of time > really matters here. Regrettably, neither vacuum nor autovacuum waits for xmin to change; perhaps it should. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Sun, 6 Nov 2022 at 18:50, Peter Geoghegan wrote: > > On Thu, Oct 27, 2022 at 2:31 AM Simon Riggs > wrote: > > Fix, so that this works without issue: > > > > BEGIN; > > > > VACUUM (ANALYZE) vactst; > > > > COMMIT; > > > > Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL. > > > > When in a xact block, we do not set PROC_IN_VACUUM, > > nor update datfrozenxid. > > It doesn't seem like a good idea to add various new special cases to > VACUUM just to make scripts like this work. Usability is a major concern that doesn't get a high enough priority. > I'm pretty sure that there > are several deep, subtle reasons why VACUUM cannot be assumed safe to > run in a user transaction. I expected there were, so it's good to discuss them. Thanks for the input. > If we absolutely have to do this, then the least worst approach might > be to make VACUUM into a no-op rather than throwing an ERROR -- demote > the ERROR into a WARNING. You could argue that we're just arbitrarily > deciding to not do a VACUUM just to be able to avoid throwing an error > if we do that. But isn't that already true with the patch that we > have? Is it really a "true VACUUM" if the operation can never advance > datfrozenxid? At least selectively demoting the ERROR to a WARNING is > "transparent" about it. I'll answer that part in my reply to Tom, since there are good ideas in both. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: New docs chapter on Transaction Management and related changes
On Fri, 4 Nov 2022 at 15:17, Laurenz Albe wrote: > > On Sat, 2022-10-15 at 21:08 -0400, Bruce Momjian wrote: > > I therefore merged all three paragraphs into > > one and tried to make the text saner; release_savepoint.sgml diff > > attached, URL content updated. > > I wanted to have a look at this, but I am confused. The original patch > was much bigger. Is this just an incremental patch? If yes, it would > be nice to have a "grand total" patch, so that I can read it all > in one go. Agreed; new compilation patch attached, including mine and then Robert's suggested rewordings. -- Simon Riggshttp://www.EnterpriseDB.com/ xact_docs.v9.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
Hi Rahila, Thanks for your review. On Fri, 4 Nov 2022 at 07:37, Rahila Syed wrote: >> I would like to bring up a few points that I came across while looking into >> the vacuum code. >> >> 1. As a result of this change to allow VACUUM inside a user transaction, I >> think there is some chance of causing >> a block/delay of concurrent VACUUMs if a VACUUM is being run under a long >> running transaction. >> 2. Also, if a user runs VACUUM in a transaction, performance optimizations >> like PROC_IN_VACUUM won't work. >> 3. Also, if VACUUM happens towards the end of a long running transaction, >> the snapshot will be old >> and xmin horizon for vacuum would be somewhat old as compared to current >> lazy vacuum which >> acquires a new snapshot just before scanning the table. >> >> So, while I understand the need of the feature, I am wondering if there >> should be some mention >> of above caveats in documentation with the recommendation that VACUUM should >> be run outside >> a transaction, in general. >> > > Sorry, I just noticed that you have already mentioned some of these in the > documentation as follows, so it seems > it is already taken care of. > > +VACUUM cannot be executed inside a transaction block, > +unless a single table is specified and FULL is not > +specified. When executing inside a transaction block the vacuum scan can > +hold back the xmin horizon and does not update the database datfrozenxid, > +as a result this usage is not useful for database maintenance, but is > provided > +to allow vacuuming in special circumstances, such as temporary or private > +work tables. Yes, I wondered whether we should have a NOTICE or WARNING to remind people of those points? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Tue, 1 Nov 2022 at 23:56, Simon Riggs wrote: > > I haven't checked the rest of the patch, but +1 for allowing VACUUM FULL > > within a user txn. > > My intention was to prevent that. I am certainly quite uneasy about > changing anything related to CLUSTER/VF, since they are old, complex > and bug prone. > > So for now, I will block VF, as was my original intent. > > I will be guided by what others think... so you may yet get your wish. > > > > Maybe the error message needs to be qualified "...when multiple > > relations are specified". > > > > ERROR: VACUUM cannot run inside a transaction block > > Hmm, that is standard wording based on the statement type, but I can > set a CONTEXT message also. Will update accordingly. > > Thanks for your input. New version attached, as described. Other review comments and alternate opinions welcome. -- Simon Riggshttp://www.EnterpriseDB.com/ single_table_vacuum.v2.patch Description: Binary data
Re: Code checks for App Devs, using new options for transaction behavior
On Wed, 2 Nov 2022 at 03:52, Dilip Kumar wrote: > > On Mon, Oct 31, 2022 at 6:54 PM Simon Riggs > wrote: > > > > > > What is the behavior if "nested_transactions" value is changed within > > > > a transaction execution, suppose the value was on and we have created > > > > a few levels of nested subtransactions and within the same transaction > > > > I switched it to off or to outer? > > I think you missed the above comment? [copy of earlier reply] Patch does the same dance as with other xact variables. XactNesting is the value within the transaction and in the patch this is not exported, so cannot be set externally. XactNesting is set at transaction start to the variable DefaultXactNesting, which is set by the GUC. So its not a problem, but thanks for checking. > > > I am not sure whether it is good to not allow PREPARE or we can just > > > prepare it and come out of the complete nested transaction. Suppose > > > we have multiple savepoints and we say prepare then it will just > > > succeed so why does it have to be different here? > > > > I'm happy to discuss what the behavior should be in this case. It is > > not a common case, > > and people don't put PREPARE in their scripts except maybe in a test. > > > > My reasoning for this code is that we don't want to accept a COMMIT > > until we reach top-level of nesting, > > so the behavior should be similar for PREPARE, which is just > > first-half of final commit. > > Yeah this is not a very common case. And we can see opinions from > others as well. But I think your reasoning for doing it this way also > makes sense to me. > > I have some more comments for 0002 > 1. > +if (XactNesting == XACT_NEST_OUTER && XactNestingLevel > 0) > +{ > +/* Throw ERROR */ > +ereport(ERROR, > +(errmsg("nested ROLLBACK, level %u aborts > outer transaction", XactNestingLevel--))); > +} > > I did not understand in case of 'outer' if we are giving rollback from > inner nesting level why it is throwing error? Documentation just says > this[1] but it did not > mention the error. I agree that we might need to give the rollback as > many times as the nesting level but giving errors seems confusing to > me. Docs mention ROLLBACK at any level will abort the transaction, which is what the ERROR does. > [1] > + > +A setting of outer will cause a nested > +BEGIN to be remembered, so that an equal number > +of COMMIT or ROLLBACK commands > +are required to end the nesting. In that case a > ROLLBACK > +at any level will abort the entire outer transaction. > +Once we reach the top-level transaction, > +the final COMMIT will end the transaction. > +This ensures that all commands within the outer transaction are > atomic. > + > > > 2. > > +if (XactNesting == XACT_NEST_OUTER) > +{ > +if (XactNestingLevel <= 0) > +s->blockState = TBLOCK_END; > +else > +ereport(NOTICE, > +(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION), > + errmsg("nested COMMIT, level %u", > XactNestingLevel))); > +XactNestingLevel--; > +return true; > +} This is decrementing the nesting level for XACT_NEST_OUTER, until we reach the top level, when the commit is allowed. > +while (s->parent != NULL && !found_subxact) > { > +if (XactNesting == XACT_NEST_ALL && > +XactNestingLevel > 0 && > +PointerIsValid(s->name) && > +strcmp(s->name, NESTED_XACT_NAME) == 0) > +found_subxact = true; > + > if (s->blockState == TBLOCK_SUBINPROGRESS) > s->blockState = TBLOCK_SUBCOMMIT > > I think these changes should be explained in the comments. This locates the correct subxact by name, as you mention in (4) > 3. > > +if (XactNesting == XACT_NEST_OUTER) > +{ > +if (XactNestingLevel > 0) > +{ > +ereport(NOTICE, > +(errmsg("nested COMMIT, level %u in > aborted transaction", XactNestingLevel))); > +XactNestingLevel--; > +return false; > +} > +} > > Better to writ
Re: Error for WITH options on partitioned tables
Apologies, I only just noticed these messages. I have set WoA until I have read, understood and can respond to your detailed input, thanks. On Fri, 28 Oct 2022 at 22:21, David Zhang wrote: > > Hi Karina, > > I am not very clear about why `build_reloptions` is removed in patch > `v2-0002-better-error-message-for-setting-parameters-for-p.patch`, if > you can help explain would be great. > > From my observation, it seems the WITH option has different behavior > when creating partitioned table and index. For example, > > pgbench -i --partitions=2 --partition-method=range -d postgres > > postgres=# create index idx_bid on pgbench_accounts using btree(bid) > with (fillfactor = 90); > CREATE INDEX > > postgres=# select relname, relkind, reloptions from pg_class where > relnamespace=2200 order by oid; >relname | relkind |reloptions > +-+-- > idx_bid| I | {fillfactor=90} > pgbench_accounts_1_bid_idx | i | {fillfactor=90} > pgbench_accounts_2_bid_idx | i | {fillfactor=90} > > I can see the `fillfactor` parameter has been added to the indexes, > however, if I try to alter `fillfactor`, I got an error like below. > postgres=# alter index idx_bid set (fillfactor=40); > ERROR: ALTER action SET cannot be performed on relation "idx_bid" > DETAIL: This operation is not supported for partitioned indexes. > > This generic error message is reported by > `errdetail_relkind_not_supported`, and there is a similar error message > for partitioned tables. Anyone knows if this can be an option for > reporting this `fillfactor` parameter not supported error. > > > Best regards, > > David > > On 2022-10-14 8:16 a.m., Karina Litskevich wrote: > > Hi, Simon! > > > > The new error message looks better. But I believe it is better to use > > "parameters" instead of "options" as it is called "storage parameters" > > in the documentation. I also believe it is better to report error in > > partitioned_table_reloptions() (thanks to Japin Li for mentioning this > > function) as it also fixes the error message in the following situation: > > > > test=# CREATE TABLE parted_col_comment (a int, b text) PARTITION BY > > LIST (a); > > CREATE TABLE > > test=# ALTER TABLE parted_col_comment SET (fillfactor=100); > > ERROR: unrecognized parameter "fillfactor" > > > > I attached the new versions of patches. > > > > I'm not sure about the errcode. May be it is better to report error with > > ERRCODE_INVALID_OBJECT_DEFINITION for CREATE TABLE and with > > ERRCODE_WRONG_OBJECT_TYPE for ALTER TABLE (as when you try "ALTER TABLE > > partitioned INHERIT nonpartitioned;" an ERROR with > > ERRCODE_WRONG_OBJECT_TYPE > > is reported). Then either the desired code should be passed to > > partitioned_table_reloptions() or similar checks and ereport calls > > should be > > placed in two different places. I'm not sure whether it is a good idea to > > change the code in one of these ways just to change the error code. > > > > Best regards, > > Karina Litskevich > > Postgres Professional: http://postgrespro.com/ > -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Allow single table VACUUM in transaction block
On Thu, 27 Oct 2022 at 21:07, Justin Pryzby wrote: > > On Thu, Oct 27, 2022 at 10:31:31AM +0100, Simon Riggs wrote: > > Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL. > > Maybe I misunderstood what you meant: you said "not VACUUM FULL", but > with your patch, that works: > > postgres=# begin; VACUUM FULL pg_class; commit; > BEGIN > VACUUM > COMMIT > > Actually, I've thought before that it was bit weird that CLUSTER can be > run within a transaction, but VACUUM FULL cannot (even though it does a > CLUSTER behind the scenes). VACUUM FULL can process multiple relations, > whereas CLUSTER can't, but it seems nice to allow vacuum full for the > case of a single relation. > > I haven't checked the rest of the patch, but +1 for allowing VACUUM FULL > within a user txn. My intention was to prevent that. I am certainly quite uneasy about changing anything related to CLUSTER/VF, since they are old, complex and bug prone. So for now, I will block VF, as was my original intent. I will be guided by what others think... so you may yet get your wish. > Maybe the error message needs to be qualified "...when multiple > relations are specified". > > ERROR: VACUUM cannot run inside a transaction block Hmm, that is standard wording based on the statement type, but I can set a CONTEXT message also. Will update accordingly. Thanks for your input. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: psql: Add command to use extended query protocol
On Tue, 1 Nov 2022 at 20:48, Peter Eisentraut wrote: > > On 01.11.22 10:10, Simon Riggs wrote: > > On Fri, 28 Oct 2022 at 07:53, Peter Eisentraut > > wrote: > >> > >> This adds a new psql command \gp that works like \g (or semicolon) but > >> uses the extended query protocol. Parameters can also be passed, like > >> > >> SELECT $1, $2 \gp 'foo' 'bar' > > > > +1 for the concept. The patch looks simple and complete. > > > > I find it strange to use it the way you have shown above, i.e. \gp on > > same line after a query. > > That's how all the "\g" commands work. Yes, I see that, but it also works exactly the way I said also. i.e. SELECT 'foo' \g is the same thing as SELECT 'foo' \g But there are no examples in the docs of the latter usage, and so it is a surprise to me and probably to others also > > ...since if we used this in a script, it would be used like this, I think... > > > >SELECT $1, $2 > >\gp 'foo' 'bar' > >\gp 'bar' 'baz' > >... > > Interesting, but I think for that we should use named prepared > statements, so that would be a separate "\gsomething" command in psql, like > > SELECT $1, $2 \gprep p1 > \grun p1 'foo' 'bar' > \grun p1 'bar' 'baz' Not sure I understand this... you seem to be arguing against your own patch?? I quite liked the way you had it, I'm just asking for the docs to put the \gp on the following line. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: psql: Add command to use extended query protocol
On Fri, 28 Oct 2022 at 07:53, Peter Eisentraut wrote: > > This adds a new psql command \gp that works like \g (or semicolon) but > uses the extended query protocol. Parameters can also be passed, like > > SELECT $1, $2 \gp 'foo' 'bar' +1 for the concept. The patch looks simple and complete. I find it strange to use it the way you have shown above, i.e. \gp on same line after a query. For me it would be clearer to have tests and docs showing this SELECT $1, $2 \gp 'foo' 'bar' > Perhaps this would also be useful for general psql scripting. ...since if we used this in a script, it would be used like this, I think... SELECT $1, $2 \gp 'foo' 'bar' \gp 'bar' 'baz' ... -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Commit fest 2022-11
On Mon, 31 Oct 2022 at 05:42, Michael Paquier wrote: > As per the world clock, the next commit fest will begin in 30 hours > (11/1 0:00 AoE time). I may have missed something, but it looks like > we have no CFM for this one yet. > > Opinions, thoughts or volunteers? If we have no volunteers, maybe it is because the job of CFM is too big now and needs to be split. I can offer to be co-CFM, and think I can offer to be CFM for about 100 patches, but I don't have the time to do the whole thing myself. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Code checks for App Devs, using new options for transaction behavior
On Mon, 31 Oct 2022 at 12:22, Dilip Kumar wrote: > > On Mon, Oct 31, 2022 at 5:03 PM Dilip Kumar wrote: > > > > On Mon, Oct 31, 2022 at 4:23 PM Dilip Kumar wrote: > > > > > > On Sun, Oct 30, 2022 at 11:32 PM Simon Riggs > > > wrote: > > > > > > > > On Fri, 28 Oct 2022 at 10:33, Simon Riggs > > > > wrote: > > > > > > > > > Thanks for the feedback, I will make all of those corrections in the > > > > > next version. > > > > > > > > New version attached. I've rolled 002-004 into one patch, but can > > > > split again as needed. > > > > > > I like the idea of "parse only" and "nested xact", thanks for working > > > on this. I will look into patches in more detail, especially nested > > > xact. IMHO there is no point in merging "nested xact" and "rollback on > > > commit". They might be changing the same code location but these two > > > are completely different ideas, in fact all these three should be > > > reviewed as three separate threads as you mentioned in the first email > > > in the thread. > > > > What is the behavior if "nested_transactions" value is changed within > > a transaction execution, suppose the value was on and we have created > > a few levels of nested subtransactions and within the same transaction > > I switched it to off or to outer? > > 1. > @@ -3815,6 +3861,10 @@ PrepareTransactionBlock(const char *gid) > /* Set up to commit the current transaction */ > result = EndTransactionBlock(false); > > +/* Don't allow prepare until we are back to an unnested state at level 0 > */ > +if (XactNestingLevel > 0) > +return false; > > > I am not sure whether it is good to not allow PREPARE or we can just > prepare it and come out of the complete nested transaction. Suppose > we have multiple savepoints and we say prepare then it will just > succeed so why does it have to be different here? I'm happy to discuss what the behavior should be in this case. It is not a common case, and people don't put PREPARE in their scripts except maybe in a test. My reasoning for this code is that we don't want to accept a COMMIT until we reach top-level of nesting, so the behavior should be similar for PREPARE, which is just first-half of final commit. Note that the nesting of begin/commit is completely separate to the existence/non-existence of subtransactions, especially with nested_transactions = 'outer' > 2.case TBLOCK_SUBABORT: > ereport(WARNING, > (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION), > errmsg("there is already a transaction in progress"))); > +if (XactNesting == XACT_NEST_OUTER) > +XactNestingLevel++; > break; > > I did not understand what this change is for, can you tell me the > scenario or a test case to hit this? Well spotted, thanks. That seems to be some kind of artefact. There is no test that exercises that since it is an unintended change, so I have removed it. > Remaining part w.r.t "nested xact" patch looks fine, I haven't tested > it yet though. New versions attached, separated again as you suggested. -- Simon Riggshttp://www.EnterpriseDB.com/ 001_psql_parse_only.v1.patch Description: Binary data 002_nested_xacts.v9.patch Description: Binary data 003_rollback_on_commit.v1.patch Description: Binary data
Re: Code checks for App Devs, using new options for transaction behavior
On Mon, 31 Oct 2022 at 11:33, Dilip Kumar wrote: > What is the behavior if "nested_transactions" value is changed within > a transaction execution, suppose the value was on and we have created > a few levels of nested subtransactions and within the same transaction > I switched it to off or to outer? Patch does the same dance as with other xact variables. XactNesting is the value within the transaction and in the patch this is not exported, so cannot be set externally. XactNesting is set at transaction start to the variable DefaultXactNesting, which is set by the GUC. So its not a problem, but thanks for checking. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Code checks for App Devs, using new options for transaction behavior
On Fri, 28 Oct 2022 at 10:33, Simon Riggs wrote: > Thanks for the feedback, I will make all of those corrections in the > next version. New version attached. I've rolled 002-004 into one patch, but can split again as needed. -- Simon Riggshttp://www.EnterpriseDB.com/ 001_psql_parse_only.v1.patch Description: Binary data 002_nested_xacts_and_rollback_on_commit.v8.patch Description: Binary data
Re: Code checks for App Devs, using new options for transaction behavior
On Fri, 28 Oct 2022 at 07:54, Erik Rijkers wrote: > > Op 27-10-2022 om 18:35 schreef Simon Riggs: > > On Thu, 27 Oct 2022 at 12:09, Simon Riggs > > wrote: > > > >> Comments please > > > > Update from patch tester results. > > > > > [001_psql_parse_only.v1.patch ] > > [002_nested_xacts.v7.patch] > > [003_rollback_on_commit.v1.patch ] > > [004_add_params_to_sample.v1.patch] > > > patch 002 has (2x) : >'transction' should be >'transaction' > > also in patch 002: >'at any level will be abort' should be >'at any level will abort' > > I also dislike the 'we' in > >'Once we reach the top-level transaction,' > > That seems a bit too much like the 'we developers working together to > make a database server system' which is of course used often and > usefully on this mailinglist and in code itself. But I think > user-facing docs should be careful with that team-building 'we'. I > remember well how it confused me, many years ago. Better, IMHO: > >'Once the top-level transaction is reached,' Thanks for the feedback, I will make all of those corrections in the next version. I'm guessing you like the features?? -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Code checks for App Devs, using new options for transaction behavior
On Thu, 27 Oct 2022 at 12:09, Simon Riggs wrote: > Comments please Update from patch tester results. -- Simon Riggshttp://www.EnterpriseDB.com/ 002_nested_xacts.v7.patch Description: Binary data 001_psql_parse_only.v1.patch Description: Binary data 003_rollback_on_commit.v1.patch Description: Binary data 004_add_params_to_sample.v1.patch Description: Binary data
Re: Allow single table VACUUM in transaction block
On Thu, 27 Oct 2022 at 10:31, Simon Riggs wrote: > Tests, docs. The patch tester says that a pg_upgrade test is failing on Windows, but works for me. t/002_pg_upgrade.pl .. ok Anybody shed any light on that, much appreciated. -- Simon Riggshttp://www.EnterpriseDB.com/
Code checks for App Devs, using new options for transaction behavior
In the past, developers have wondered how we can provide "--dry-run" functionality https://www.postgresql.org/message-id/15791.1450383201%40sss.pgh.pa.us This is important for application developers, especially when migrating programs to Postgres. Presented here are 3 features aimed at developers, each of which is being actively used by me in a large and complex migration project. * psql --parse-only Checks the syntax of all SQL in a script, but without actually executing it. This is very important in the early stages of complex migrations because we need to see if the code would generate syntax errors before we attempt to execute it. When there are many dependencies between objects, actual execution fails very quickly if we run in a single transaction, yet running outside of a transaction can leave a difficult cleanup task. Fixing errors iteratively is difficult when there are long chains of dependencies between objects, since there is no easy way to predict how long it will take to make everything work unless you understand how many syntax errors exist in the script. 001_psql_parse_only.v1.patch * nested transactions = off (default) | all | on Handle nested BEGIN/COMMIT, which can cause chaos on failure. This is an important part of guaranteeing that everything that gets executed is part of a single atomic transaction, which can then be rolled back - this is a pre-requisite for the last feature. 002_nested_xacts.v7.patch The default behavior is unchanged (off) Setting "all" treats nested BEGIN/COMMIT as subtransactions, allowing some parts to fail without rolling back the outer transaction. Setting "outer" flattens nested BEGIN/COMMIT into one single outer transaction, so that any failure rolls back the entire transaction. * rollback_on_commit = off (default) | on Force transactions to fail their final commit, ensuring that no lasting change is made when a script is tested. i.e. accept COMMIT, but do rollback instead. 003_rollback_on_commit.v1.patch We will probably want to review these on separate threads, but the common purpose of these features is hopefully clear from these notes. 001 and 003 are fairly small patches, 002 is longer. Comments please -- Simon Riggshttp://www.EnterpriseDB.com/ 001_psql_parse_only.v1.patch Description: Binary data 002_nested_xacts.v7.patch Description: Binary data 003_rollback_on_commit.v1.patch Description: Binary data
Allow single table VACUUM in transaction block
It is a common user annoyance to have a script fail because someone added a VACUUM, especially when using --single-transaction option. Fix, so that this works without issue: BEGIN; VACUUM (ANALYZE) vactst; COMMIT; Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL. When in a xact block, we do not set PROC_IN_VACUUM, nor update datfrozenxid. Tests, docs. -- Simon Riggshttp://www.EnterpriseDB.com/ single_table_vacuum.v1.patch Description: Binary data
Re: New docs chapter on Transaction Management and related changes
On Sun, 16 Oct 2022 at 02:08, Bruce Momjian wrote: > > On Fri, Oct 14, 2022 at 05:46:55PM -0400, Robert Treat wrote: > > On Fri, Oct 14, 2022 at 3:51 PM Bruce Momjian wrote: > > > Attached is the merged patch from all the great comments I received. I > > > have also rebuilt the docs with the updated patch: > > > > > > https://momjian.us/tmp/pgsql/ > > > > > > > + RELEASE SAVEPOINT also subcommits and destroys > > + all savepoints that were established after the named savepoint was > > + established. This means that any subtransactions of the named savepoint > > + will also be subcommitted and destroyed. > > > > Wonder if we should be more explicit that data changes are preserved, > > not destroyed... something like: > > "This means that any changes within subtransactions of the named > > savepoint will be subcommitted and those subtransactions will be > > destroyed." > > Good point. I reread the section and there was just too much confusion > over subtransactions, partly because the behavior just doesn't map > easily to subtransaction. I therefore merged all three paragraphs into > one and tried to make the text saner; release_savepoint.sgml diff > attached, URL content updated. Just got around to reading this, thanks for changes. The rewording doesn't work for me. The use of the word "destroy" is very misleading, since the effect is to commit. So I think we must avoid use of the word destroy completely. Possible rewording: RELEASE SAVEPOINT will subcommit the subtransaction established by the named savepoint, releasing any resources held by it. If there were any subtransactions created by the named savepoint, these will also be subcommitted. The point is that savepoints create subtransactions, but they are not the only way to create them, so we cannot equate savepoint and subtransaction completely. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: New docs chapter on Transaction Management and related changes
On Fri, 14 Oct 2022 at 08:55, Simon Riggs wrote: > In related changes, I've also done some major rewording of the RELEASE > SAVEPOINT command, since it was incorrectly described as having "no > other user visible behavior". A complex example is given to explain, > using the terminology established in the main patch. ROLLBACK TO SAVEPOINT also needs some clarification, patch attached. (Commentary: It's confusing to me that ROLLBACK TO SAVEPOINT starts a new subtransaction, whereas RELEASE SAVEPOINT does not. You might imagine they would both start a new subtransaction, but that is not the case.) -- Simon Riggshttp://www.EnterpriseDB.com/ doc-rollback-to.v1.patch Description: Binary data
Re: New docs chapter on Transaction Management and related changes
On Fri, 14 Oct 2022 at 09:46, Alvaro Herrera wrote: > > +1 for this new chapter. This latest patch looks pretty good. I think > that introducing the concept of "sub-commit" as in Simon's follow-up > patch clarifies things, though the word itself looks very odd. Maybe > it's okay. The addition of the savepoint example looks good also. > > On 2022-Oct-13, Bruce Momjian wrote: > > > + > > + PostgreSQL supports a two-phase commit (2PC) > > + protocol that allows multiple distributed systems to work together > > + in a transactional manner. The commands are PREPARE > > + TRANSACTION, COMMIT PREPARED and > > I suggest/request that we try to avoid breaking tagged constants in > DocBook; doing so makes it much easier to miss them later when grepping > for them (don't laugh, it has happened to me). Also, it breaks > formatting in some weird cases. I know this makes editing a bit harder > because you can't just reflow with your editor like you would normal > text. So this'd be: > > + in a transactional manner. The commands are PREPARE > TRANSACTION, > + COMMIT PREPARED and > > with whatever word wrapping you like, except breaking between PREPARE > and TRANSACTION. > > > + > > +In addition to vxid and xid, > > +when a transaction is prepared it is also identified by a Global > > +Transaction Identifier (GID). GIDs > > +are string literals up to 200 bytes long, which must be > > +unique amongst other currently prepared transactions. > > +The mapping of GID to xid is shown in > + > > linkend="view-pg-prepared-xacts">pg_prepared_xacts. > > + > > Maybe say "is prepared for two-phase commit", to make the topic of this > paragraph more obvious? > > > + > > +Lock waits on table-level locks are shown waiting for > > +virtualxid, while lock waits on row-level > > +locks are shown waiting for transactionid. > > +Row-level read and write locks are recorded directly in locked > > +rows and can be inspected using the > > +extension. Row-level read locks might also require the assignment > > +of multixact IDs (mxid). Mxids are recorded in > > +the pg_multixact directory. > > + > > Hmm, ok. > > > + > > +The parent xid of each subxid is recorded in the > > +pg_subtrans directory. No entry is made for > > +top-level xids since they do not have a parent, nor is an entry made > > +for read-only subtransactions. > > + > > Maybe say "the immediate parent xid of each ...", or is it too obvious? +1 to all of those suggestions -- Simon Riggshttp://www.EnterpriseDB.com/
Re: New docs chapter on Transaction Management and related changes
On Thu, 13 Oct 2022 at 23:06, Bruce Momjian wrote: > > Thanks, updated patch attached. You can see the output at: Thank you for your work to tighten and cleanup this patch, much appreciated. I had two minor typos, plus a slight rewording to avoid using the word "global" in the last section, since that is associated with distributed or 2PC transactions. For those changes, I provide a patch-on-patch so you can see clearly. In related changes, I've also done some major rewording of the RELEASE SAVEPOINT command, since it was incorrectly described as having "no other user visible behavior". A complex example is given to explain, using the terminology established in the main patch. -- Simon Riggshttp://www.EnterpriseDB.com/ xact.patch-on-patch Description: Binary data doc-release-savepoint.v1.patch Description: Binary data
Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records
On Wed, 5 Oct 2022 at 16:30, Tom Lane wrote: > > Kyotaro Horiguchi writes: > > At Tue, 4 Oct 2022 17:15:31 -0700, Nathan Bossart > > wrote in > >> On Tue, Oct 04, 2022 at 07:53:11PM -0400, Tom Lane wrote: > >>> After further thought, maybe it'd be better to do it as attached, > >>> with one long-lived hash table for all the locks. > > > First one is straight forward outcome from the current implement but I > > like the new one. I agree that it is natural and that the expected > > overhead per (typical) transaction is lower than both the first one > > and doing the same operation on a list. I don't think that space > > inefficiency in that extent doesn't matter since it is the startup > > process. > > To get some hard numbers about this, I made a quick hack to collect > getrusage() numbers for the startup process (patch attached for > documentation purposes). I then ran the recovery/t/027_stream_regress.pl > test a few times and collected the stats (also attached). This seems > like a reasonably decent baseline test, since the core regression tests > certainly take lots of AccessExclusiveLocks what with all the DDL > involved, though they shouldn't ever take large numbers at once. Also > they don't run long enough for any lock list bloat to occur, so these > numbers don't reflect a case where the patches would provide benefit. > > If you look hard, there's maybe about a 1% user-CPU penalty for patch 2, > although that's well below the run-to-run variation so it's hard to be > sure that it's real. The same comments apply to the max resident size > stats. So I'm comforted that there's not a significant penalty here. > > I'll go ahead with patch 2 if there's not objection. Happy to see this change. > One other point to discuss: should we consider back-patching? I've > got mixed feelings about that myself. I don't think that cases where > this helps significantly are at all mainstream, so I'm kind of leaning > to "patch HEAD only". It looks fine to eventually backpatch, since StandbyReleaseLockTree() was optimized to only be called when the transaction had actually done some AccessExclusiveLocks. So the performance loss is minor and isolated to the users of such locks, so I see no problems with it. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Fri, 16 Sept 2022 at 13:20, Simon Riggs wrote: > > Thanks for the review. > > v10 attached v11 attached, corrected for recent commit 14ff44f80c09718d43d853363941457f5468cc03. -- Simon Riggshttp://www.EnterpriseDB.com/ 002_minimize_calls_to_SubTransSetParent.v11.patch Description: Binary data
Re: Pruning never visible changes
On Thu, 22 Sept 2022 at 15:16, Alvaro Herrera wrote: > > On 2022-Sep-22, Simon Riggs wrote: > > > On Mon, 19 Sept 2022 at 00:16, Greg Stark wrote: > > > > VACUUM was willing to remove a committed-dead tuple immediately if it > > > was > > > deleted by the same transaction that inserted it. The idea is that > > > such a > > > tuple could never have been visible to any other transaction, so we > > > don't > > > need to keep it around to satisfy MVCC snapshots. However, there was > > > already an exception for tuples that are part of an update chain, and > > > this > > > exception created a problem: we might remove TOAST tuples (which are > > > never > > > part of an update chain) while their parent tuple stayed around (if > > > it was > > > part of an update chain). This didn't pose a problem for most things, > > > since the parent tuple is indeed dead: no snapshot will ever consider > > > it > > > visible. But MVCC-safe CLUSTER had a problem, since it will try to > > > copy > > > RECENTLY_DEAD tuples to the new table. It then has to copy their > > > TOAST > > > data too, and would fail if VACUUM had already removed the toast > > > tuples. > > > Good research Greg, thank you. Only took 10 years for me to notice it > > was gone ;-) > > But this begs the question: is the proposed change safe, given that > ancient consideration? I don't think TOAST issues have been mentioned > in this thread so far, so I wonder if there is a test case that verifies > that this problem doesn't occur for some reason. Oh, completely agreed. I will submit a modified patch that adds a test case and just a comment to explain why we can't remove such rows. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Pruning never visible changes
On Mon, 19 Sept 2022 at 00:16, Greg Stark wrote: > > On Fri, 16 Sept 2022 at 10:27, Tom Lane wrote: > > > > Simon Riggs writes: > > > A user asked me whether we prune never visible changes, such as > > > BEGIN; > > > INSERT... > > > UPDATE.. (same row) > > > COMMIT; > > > > Didn't we just have this discussion in another thread? > > Well. not "just" :) > > commit 44e4bbf75d56e643b6afefd5cdcffccb68cce414 > Author: Tom Lane > Date: Fri Apr 29 16:29:42 2011 -0400 > > Remove special case for xmin == xmax in HeapTupleSatisfiesVacuum(). > > VACUUM was willing to remove a committed-dead tuple immediately if it was > deleted by the same transaction that inserted it. The idea is that such a > tuple could never have been visible to any other transaction, so we don't > need to keep it around to satisfy MVCC snapshots. However, there was > already an exception for tuples that are part of an update chain, and this > exception created a problem: we might remove TOAST tuples (which are never > part of an update chain) while their parent tuple stayed around (if it was > part of an update chain). This didn't pose a problem for most things, > since the parent tuple is indeed dead: no snapshot will ever consider it > visible. But MVCC-safe CLUSTER had a problem, since it will try to copy > RECENTLY_DEAD tuples to the new table. It then has to copy their TOAST > data too, and would fail if VACUUM had already removed the toast tuples. > > Easiest fix is to get rid of the special case for xmin == xmax. This may > delay reclaiming dead space for a little bit in some cases, but it's by > far > the most reliable way to fix the issue. > > Per bug #5998 from Mark Reid. Back-patch to 8.3, which is the oldest > version with MVCC-safe CLUSTER. Good research Greg, thank you. Only took 10 years for me to notice it was gone ;-) -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Hash index build performance tweak from sorting
On Wed, 21 Sept 2022 at 02:32, David Rowley wrote: > > On Tue, 2 Aug 2022 at 03:37, Simon Riggs wrote: > > Using the above test case, I'm getting a further 4-7% improvement on > > already committed code with the attached patch, which follows your > > proposal. > > > > The patch passes info via a state object, useful to avoid API churn in > > later patches. > > Hi Simon, > > I took this patch for a spin and saw a 2.5% performance increase using > the random INT test that Tom posted. The index took an average of > 7227.47 milliseconds on master and 7045.05 with the patch applied. Hi David, Thanks for tests and review. I'm just jumping on a plane, so may not respond in detail until next Mon. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Pruning never visible changes
On Fri, 16 Sept 2022 at 18:37, Tom Lane wrote: > > Simon Riggs writes: > > On Fri, 16 Sept 2022 at 15:26, Tom Lane wrote: > >> You cannot > >> do that, at least not without checking that the originating > >> transaction has no snapshots that could see the older row version. > > > I'm saying this is possible only AFTER the end of the originating > > xact, so there are no issues with additional snapshots. > > I see, so the point is just that we can prune even if the originating > xact hasn't yet passed the global xmin horizon. I agree that's safe, > but will it fire often enough to be worth the trouble? It is an edge case with limited utility, I agree, which is why I looked for a cheap test (xmin == xmax only). This additional test is also valid, but too expensive to apply: TransactionIdGetTopmostTranactionId(xmax) == TransactionIdGetTopmostTranactionId(xmin) > Also, why > does it need to be restricted to certain shapes of HOT chains --- > that is, why can't we do exactly what we'd do if the xact *were* > past the xmin horizon? I thought it important to maintain the integrity of the HOT chain, in that the xmax of one tuple version matches the xmin of the next. So pruning only from the root of the chain allows us to maintain that validity check. I'm on the fence myself, which is why I didn't post it immediately I had written it. If not, it would be useful to add a note in comments to the code to explain why we don't do this. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Pruning never visible changes
On Fri, 16 Sept 2022 at 21:07, Laurenz Albe wrote: > > On Fri, 2022-09-16 at 10:26 -0400, Tom Lane wrote: > > Simon Riggs writes: > > > A user asked me whether we prune never visible changes, such as > > > BEGIN; > > > INSERT... > > > UPDATE.. (same row) > > > COMMIT; > > > > Didn't we just have this discussion in another thread? > > For reference: that was > https://postgr.es/m/f6a491b32cb44bb5daaafec835364f7149348fa1.ca...@cybertec.at Thanks. I confirm I hadn't seen that, and indeed, I wrote the patch on 5 Sept before you posted. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Slow standby snapshot
On Fri, 16 Sept 2022 at 17:08, Michail Nikolaev wrote: > > Hello everyone. > > To find the best frequency for calling KnownAssignedXidsCompress in > Simon's patch, I made a set of benchmarks. It looks like each 8th xid > is a little bit often. > > Setup and method is the same as previous (1). 16-core machines, > max_connections = 5000. Tests were running for about a day, 220 runs > in total (each version for 20 times, evenly distributed throughout the > day). > > Staring from 60th second, 30 seconds-long transaction was started on primary. > > Graphs in attachment. So, looks like 64 is the best value here. It > gives even a little bit more TPS than smaller values. > > Yes, such benchmark does not cover all possible cases, but it is > better to measure at least something when selecting constants :) This is very good and clear, thank you. > If someone has an idea of different benchmark scenarios - please share them. > So, updated version (with 64 and some commit message) in attachment too. > > [1]: > https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6 I've cleaned up the comments and used a #define for the constant. IMHO this is ready for commit. I will add it to the next CF. -- Simon Riggshttp://www.EnterpriseDB.com/ v8c-new-heuristic-to-compress-KnownAssignedXids.patch Description: Binary data
Re: Pruning never visible changes
On Fri, 16 Sept 2022 at 15:26, Tom Lane wrote: > > Simon Riggs writes: > > A user asked me whether we prune never visible changes, such as > > BEGIN; > > INSERT... > > UPDATE.. (same row) > > COMMIT; > > Didn't we just have this discussion in another thread? Not that I was aware of, but it sounds like a different case anyway. > You cannot > do that, at least not without checking that the originating > transaction has no snapshots that could see the older row version. I'm saying this is possible only AFTER the end of the originating xact, so there are no issues with additional snapshots. i.e. the never visible row has to be judged RECENTLY_DEAD before we even check. -- Simon Riggshttp://www.EnterpriseDB.com/
Pruning never visible changes
A user asked me whether we prune never visible changes, such as BEGIN; INSERT... UPDATE.. (same row) COMMIT; Once committed, the original insert is no longer visible to anyone, so "ought to be able to be pruned", sayeth the user. And they also say that changing the app is much harder, as ever. After some thought, Yes, we can prune, but not in all cases - only if the never visible tuple is at the root end of the update chain. The only question is can that be done cheaply enough to bother with. The answer in one specific case is Yes, in other cases No. This patch adds a new test for this use case, and code to remove the never visible row when the changes are made by the same xid. (I'm pretty sure there used to be a test for this some years back and I'm guessing it was removed because it isn't always possible to remove the tuple, which this new patch honours.) Please let me know what you think. -- Simon Riggshttp://www.EnterpriseDB.com/ never_visible.v1.patch Description: Binary data
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Thu, 15 Sept 2022 at 12:36, Japin Li wrote: > > > On Thu, 15 Sep 2022 at 18:04, Simon Riggs > wrote: > > On Wed, 14 Sept 2022 at 15:21, Alvaro Herrera > > wrote: > >> > >> On 2022-Aug-30, Simon Riggs wrote: > >> > >> > 001_new_isolation_tests_for_subxids.v3.patch > >> > Adds new test cases to master without adding any new code, specifically > >> > addressing the two areas of code that are not tested by existing tests. > >> > This gives us a baseline from which we can do test driven development. > >> > I'm hoping this can be reviewed and committed fairly smoothly. > >> > >> I gave this a quick run to confirm the claimed increase of coverage. It > >> checks out, so pushed. > > > > Thank you. > > > > So now we just have the main part of the patch, reattached here for > > the auto patch tester's benefit. > > Hi Simon, > > Thanks for the updated patch, here are some comments. Thanks for your comments. > There is a typo, > s/SubTransGetTopMostTransaction/SubTransGetTopmostTransaction/g. > > +*call SubTransGetTopMostTransaction() if that xact > overflowed; > > > Is there a punctuation mark missing on the following first line? > > +* 2. When IsolationIsSerializable() we sometimes need to > access topxid > +*This occurs only when SERIALIZABLE is requested by app > user. > > > When we use function name in comments, some places we use parentheses, > but others do not use it. Why? I think, we should keep them consistent, > at least in the same commit. > > +* 3. When TransactionIdSetTreeStatus will use a status of > SUB_COMMITTED, > +*which then requires us to consult subtrans to find > parent, which > +*is needed to avoid race condition. In this case we ask > Clog/Xact > +*module if TransactionIdsAreOnSameXactPage(). Since we > start a new > +*clog page every 32000 xids, this is usually <<1% of > subxids. I've reworded those comments, hoping to address all of your above points. > Maybe we declaration a topxid to avoid calling GetTopTransactionId() > twice when we should set subtrans parent? > > + TransactionId subxid = > XidFromFullTransactionId(s->fullTransactionId); > + TransactionId topxid = GetTopTransactionId(); > ... > + if (MyProc->subxidStatus.overflowed || > + IsolationIsSerializable() || > + !TransactionIdsAreOnSameXactPage(topxid, subxid)) > + { > ... > + SubTransSetParent(subxid, topxid); > + } Seems a minor point, but I've done this anyway. Thanks for the review. v10 attached -- Simon Riggshttp://www.EnterpriseDB.com/ 002_minimize_calls_to_SubTransSetParent.v10.patch Description: Binary data