Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Apr 06, 2020 at 09:46:31AM +0900, Kyotaro Horiguchi wrote: > At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch wrote in > > On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote: > > > Shouldn't the CF entry get closed? > > > > Once the buildfarm is clean for a day, sure. The buildfarm has already > > revealed a missing perl2host call. > > Thank you for (re-) committing this and the following fix. I hope this > doesn't bring in another failure. I have closed the CF entry. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-04-05%2000%3A00%3A27 happened, but I doubt it is unrelated. A wait_for_catchup that usually takes <1s instead timed out after 397s. I can't reproduce it. In the past, another animal on the same machine had the same failure: sysname │ snapshot │ branch │ bfurl ───┼─┼┼─── bowerbird │ 2019-11-17 15:22:42 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird=2019-11-17%2015%3A22%3A42 bowerbird │ 2020-01-10 17:30:49 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird=2020-01-10%2017%3A30%3A49
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch wrote in > On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote: > > Shouldn't the CF entry get closed? > > Once the buildfarm is clean for a day, sure. The buildfarm has already > revealed a missing perl2host call. Thank you for (re-) committing this and the following fix. I hope this doesn't bring in another failure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote: > Shouldn't the CF entry get closed? Once the buildfarm is clean for a day, sure. The buildfarm has already revealed a missing perl2host call.
Re: [HACKERS] WAL logging problem in 9.4.3?
Robert Haas writes: > On Wed, Apr 1, 2020 at 11:51 PM Noah Misch wrote: >> I've translated the non-vote comments into estimated votes of -0.3, -0.6, >> -0.4, +0.5, and -0.3. Hence, I revoke the plan to back-patch. > FWIW, I also think that it would be better not to back-patch. FWIW, I also concur with not back-patching; the risk/reward ratio does not look favorable. Maybe later. > Last but not least, I would like to join with others in expressing my > thanks to you for your hard work on this problem. +1 on that, too. Shouldn't the CF entry get closed? regards, tom lane
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Apr 1, 2020 at 11:51 PM Noah Misch wrote: > I've translated the non-vote comments into estimated votes of -0.3, -0.6, > -0.4, +0.5, and -0.3. Hence, I revoke the plan to back-patch. FWIW, I also think that it would be better not to back-patch. The risk of back-patching is that this will break things, whereas the risk of not back-patching is that we will harm people who are affected by this bug for a longer period of time than would otherwise be the case. Because this patch is complex, the risk of breaking things seems higher than normal. On the other hand, the number of users adversely affected by the bug appears to be relatively low. Taken together, these factors persuade me that we should not back-patch at this time. It is possible that in the future things may look different. In the happy event that this patch causes no more problems following commit, while at the same time we have more complaints about the underlying problem, we can make a decision to back-patch at a later time. This brings me to another point: because this patch changes the WAL format, a straight revert will be impossible once a release has occurred. Therefore, if we hold off on back-patching for now and later decide that we erred, we can proceed at that time and it will probably not be much harder than it would be to do it now. On the other hand, if we decide to back-patch now and later decide that we have erred, we will have additional engineering work to do to cater to people who have already installed the version containing the back-patched fix and now need to upgrade again. Perhaps the WAL format changes are simple enough that this isn't likely to be a huge issue even if it happens, but it does seem better to avoid the chance that it might. A further factor is that releases which break WAL compatibility are undesirable, and should only be undertaken when necessary. Last but not least, I would like to join with others in expressing my thanks to you for your hard work on this problem. While the process of developing a fix has not been without bumps, few people would have had the time, patience, diligence, and skill to take this effort as far as you have. Kyotaro Horiguchi and others likewise deserve credit for all of the many hours that they have put into this work. The entire PostgreSQL community owes all of you a debt of gratitude, and you have my thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Mar 30, 2020 at 11:37:57PM -0700, Andres Freund wrote: > On 2020-03-30 23:28:54 -0700, Noah Misch wrote: > > On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote: > > > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote: > > > > I think attached v41nm is ready for commit. Would anyone like to vote > > > > against > > > > back-patching this? It's hard to justify lack of back-patch for a > > > > data-loss > > > > bug, but this is atypically invasive. (I'm repeating the question, > > > > since some > > > > folks missed my 2020-02-18 question.) Otherwise, I'll push this on > > > > Saturday. > > > > > > The invasiveness of the patch is a concern. Have you considered a > > > different strategy? For example, we are soon going to be in beta for > > > 13, so you could consider committing the patch only on HEAD first. > > > If there are issues to take care of, you can then leverage the beta > > > testing to address any issues found. Finally, once some dust has > > > settled on the concept and we have gained enough confidence, we could > > > consider a back-patch. > > > > No. Does anyone favor this proposal more than back-patching normally? > > I have not reviewed the patch, so I don't have a good feeling for its > riskiness. But it does sound fairly invasive. Given that we've lived > with this issue for many years by now, and that the rate of incidents > seems to have been fairly low, I think living with the issue for a bit > longer to gain confidence might be a good choice. But I'd not push back > if you, being much more informed, think the risk/reward balance favors > immediate backpatching. I've translated the non-vote comments into estimated votes of -0.3, -0.6, -0.4, +0.5, and -0.3. Hence, I revoke the plan to back-patch.
Re: [HACKERS] WAL logging problem in 9.4.3?
Hi, On 2020-03-30 23:28:54 -0700, Noah Misch wrote: > On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote: > > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote: > > > I think attached v41nm is ready for commit. Would anyone like to vote > > > against > > > back-patching this? It's hard to justify lack of back-patch for a > > > data-loss > > > bug, but this is atypically invasive. (I'm repeating the question, since > > > some > > > folks missed my 2020-02-18 question.) Otherwise, I'll push this on > > > Saturday. > > > > The invasiveness of the patch is a concern. Have you considered a > > different strategy? For example, we are soon going to be in beta for > > 13, so you could consider committing the patch only on HEAD first. > > If there are issues to take care of, you can then leverage the beta > > testing to address any issues found. Finally, once some dust has > > settled on the concept and we have gained enough confidence, we could > > consider a back-patch. > > No. Does anyone favor this proposal more than back-patching normally? I have not reviewed the patch, so I don't have a good feeling for its riskiness. But it does sound fairly invasive. Given that we've lived with this issue for many years by now, and that the rate of incidents seems to have been fairly low, I think living with the issue for a bit longer to gain confidence might be a good choice. But I'd not push back if you, being much more informed, think the risk/reward balance favors immediate backpatching. Greetings, Andres Freund
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote: > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote: > > I think attached v41nm is ready for commit. Would anyone like to vote > > against > > back-patching this? It's hard to justify lack of back-patch for a data-loss > > bug, but this is atypically invasive. (I'm repeating the question, since > > some > > folks missed my 2020-02-18 question.) Otherwise, I'll push this on > > Saturday. > > The invasiveness of the patch is a concern. Have you considered a > different strategy? For example, we are soon going to be in beta for > 13, so you could consider committing the patch only on HEAD first. > If there are issues to take care of, you can then leverage the beta > testing to address any issues found. Finally, once some dust has > settled on the concept and we have gained enough confidence, we could > consider a back-patch. No. Does anyone favor this proposal more than back-patching normally?
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote: > I think attached v41nm is ready for commit. Would anyone like to vote against > back-patching this? It's hard to justify lack of back-patch for a data-loss > bug, but this is atypically invasive. (I'm repeating the question, since some > folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday. The invasiveness of the patch is a concern. Have you considered a different strategy? For example, we are soon going to be in beta for 13, so you could consider committing the patch only on HEAD first. If there are issues to take care of, you can then leverage the beta testing to address any issues found. Finally, once some dust has settled on the concept and we have gained enough confidence, we could consider a back-patch. In short, my point is just that even if this stuff is discussed for years, I see no urgency in back-patching per the lack of complains we have in -bugs or such. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sun, 29 Mar 2020 23:08:27 -0700, Noah Misch wrote in > On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote: > > At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch wrote in > > > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed > > > some > > > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests. > > > > Sounds reasonable. In AddPendingSync, don't we put > > Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"? > > The former guarantees the relationship between XLogIsNeeded() and > > pendingSyncHash, and the existing latter assertion looks redundant as > > it is placed just after "if (pendingSyncHash)". > > The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it. I > am not inclined to replace it with Assert(!XLogIsNeeded()). This static > function is not likely to get more callers, so the chance of accidentally > calling it under XLogIsNeeded() is too low. Agreed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote: > At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch wrote in > > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed > > some > > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests. > > Sounds reasonable. In AddPendingSync, don't we put > Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"? > The former guarantees the relationship between XLogIsNeeded() and > pendingSyncHash, and the existing latter assertion looks redundant as > it is placed just after "if (pendingSyncHash)". The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it. I am not inclined to replace it with Assert(!XLogIsNeeded()). This static function is not likely to get more callers, so the chance of accidentally calling it under XLogIsNeeded() is too low.
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch wrote in > I think attached v41nm is ready for commit. Would anyone like to vote against > back-patching this? It's hard to justify lack of back-patch for a data-loss > bug, but this is atypically invasive. (I'm repeating the question, since some > folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday. > > On Mon, Mar 23, 2020 at 05:20:27PM +0900, Kyotaro Horiguchi wrote: > > At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch wrote in > > > The proximate cause is the RelFileNodeSkippingWAL() call that we added to > > > MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, > > > but > > > parallel workers have zeroes for pendingSyncHash and rd_*Subid. > > > > Kyotaro, can you look through the affected code and propose a strategy for > > > good coexistence of parallel query with the WAL skipping mechanism? > > > > Bi-directional communication between leader and workers is too-much. > > It wouldn't be acceptable to inhibit the problematic operations on > > workers such like heap-prune or btree pin removal. If we do pending > > syncs just before worker start, it won't fix the issue. > > > > The attached patch passes a list of pending-sync relfilenodes at > > worker start. > > If you were to issue pending syncs and also cease skipping WAL for affected > relations, that would fix the issue. Your design is better, though. I made > two notable changes: > > - The patch was issuing syncs or FPIs every time a parallel worker exited. I > changed it to skip most of smgrDoPendingSyncs() in parallel workers, like > AtEOXact_RelationMap() does. Exactly. Thank you for fixing it. > - PARALLEL_KEY_PENDING_SYNCS is most similar to PARALLEL_KEY_REINDEX_STATE and > PARALLEL_KEY_COMBO_CID. parallel.c, not execParallel.c, owns those. I > moved PARALLEL_KEY_PENDING_SYNCS to parallel.c, which also called for style > changes in the associated storage.c functions. That sounds better. Moving the responsibility of creating pending syncs array reduces copy. RestorePendingSyncs (And AddPendingSync()) looks better. > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests. Sounds reasonable. In AddPendingSync, don't we put Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"? The former guarantees the relationship between XLogIsNeeded() and pendingSyncHash, and the existing latter assertion looks redundant as it is placed just after "if (pendingSyncHash)". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
Thanks for the labour on this. At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch wrote in > On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote: > > Pushed, after adding a missing "break" to gist_identify() and tweaking two .. > The proximate cause is the RelFileNodeSkippingWAL() call that we added to > MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but > parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up > the attached patch to understand the scope of the problem (not to commit). It > logs a message whenever a parallel worker uses pendingSyncHash or > RelationNeedsWAL(). Some of the cases happen often enough to make logs huge, > so the patch suppresses logging for them. You can see the lower-volume calls > like this: > > printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' > 'max_wal_senders = 0' 'force_parallel_mode = regress' > >/tmp/minimal_parallel.conf > make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf > find . -name log | xargs grep -rl 'nm0 invalid' > > Not all are actual bugs. For example, get_relation_info() behaves fine: > > /* Temporary and unlogged relations are inaccessible during recovery. */ > if (!RelationNeedsWAL(relation) && RecoveryInProgress()) But the relcache entry shows wrong information about new-ness of its storage and it is the root cause of the all other problems. > Kyotaro, can you look through the affected code and propose a strategy for > good coexistence of parallel query with the WAL skipping mechanism? Bi-directional communication between leader and workers is too-much. It wouldn't be acceptable to inhibit the problematic operations on workers such like heap-prune or btree pin removal. If we do pending syncs just before worker start, it won't fix the issue. The attached patch passes a list of pending-sync relfilenodes at worker start. Workers create (immature) pending sync hash from the list and create relcache entries using the hash. Given that parallel workers don't perform transactional operations and DDL operations, workers needs only the list of relfilenodes. The list might be long, but I don't think it realistic that such many tables are truncated or created then scanned in parallel within a transaction while wal_level = minimal. > Since I don't expect one strategy to win clearly and quickly, I plan to revert > the main patch around 2020-03-22 17:30 UTC. That will give the patch about > twenty-four hours in the buildfarm, so more animals can report in. I will > leave the three smaller patches in place. Thank you for your trouble and the check code. Sorry for not responding in time. > > fairywren failed differently on 9.5; I have not yet studied it: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10 > > This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or > SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in > the test should fix this. Oops. I felt that the 2TB looks too large but didn't get it seriously. 1GB is 1048576 is less than the said limit 2097151 so the attached second patch does that. The attached is a proposal of fix of the issue on top of the reverted commit. - v36-0001-Skip-WAL-for-new-relfilenodes-under-wal_level-mi.patch The reverted patch. - v36-0002-Fix-GUC-value-in-TAP-test.patch Change wal_skip_threashold to 2TB to 2GB in TAP test. v36-0003-Fix-the-name-of-struct-pendingSyncs.patch I found that the struct of pending sync hash entry is named differently way from pending delete hash entry. Change it so that the two are in similarly naming convention. v36-0004-Propagage-pending-sync-information-to-parallel-w.patch The proposed fix for the parallel-worker problem. The make check-world above didn't fail with this patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 7aa0ba8fe7bccda5e24a8d25af925372aae8402f Mon Sep 17 00:00:00 2001 From: Noah Misch Date: Sat, 21 Mar 2020 09:38:26 -0700 Subject: [PATCH v36 1/4] Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote: > Pushed, after adding a missing "break" to gist_identify() and tweaking two > more comments. However, a diverse minority of buildfarm members are failing > like this, in most branches: > > Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint > bit' > Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231. > Mar 21 13:16:37 # got: '1' > Mar 21 13:16:37 # expected: '2' > Mar 21 13:16:46 # Looks like you failed 1 test of 34. > Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl > -- > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2020-03-21%2016%3A52%3A05 > > Since I run two of the failing animals, I expect to reproduce this soon. force_parallel_regress was the setting needed to reproduce this: printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf The proximate cause is the RelFileNodeSkippingWAL() call that we added to MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up the attached patch to understand the scope of the problem (not to commit). It logs a message whenever a parallel worker uses pendingSyncHash or RelationNeedsWAL(). Some of the cases happen often enough to make logs huge, so the patch suppresses logging for them. You can see the lower-volume calls like this: printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode = regress' >/tmp/minimal_parallel.conf make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf find . -name log | xargs grep -rl 'nm0 invalid' Not all are actual bugs. For example, get_relation_info() behaves fine: /* Temporary and unlogged relations are inaccessible during recovery. */ if (!RelationNeedsWAL(relation) && RecoveryInProgress()) Kyotaro, can you look through the affected code and propose a strategy for good coexistence of parallel query with the WAL skipping mechanism? Since I don't expect one strategy to win clearly and quickly, I plan to revert the main patch around 2020-03-22 17:30 UTC. That will give the patch about twenty-four hours in the buildfarm, so more animals can report in. I will leave the three smaller patches in place. > fairywren failed differently on 9.5; I have not yet studied it: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10 This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in the test should fix this. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index a25d539..885049d 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -7183,7 +7183,7 @@ log_heap_clean(Relation reln, Buffer buffer, XLogRecPtr recptr; /* Caller should not call me on a non-WAL-logged relation */ - Assert(RelationNeedsWAL(reln)); + Assert(RelationNeedsWALKnownProblem(reln)); xlrec.latestRemovedXid = latestRemovedXid; xlrec.nredirected = nredirected; diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 1794cfd..5be469d 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -258,7 +258,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, /* * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did */ - if (RelationNeedsWAL(relation)) + if (RelationNeedsWALKnownProblem(relation)) { XLogRecPtr recptr; diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 8ff49ce..cfa72ce 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -67,7 +67,7 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp) LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK); if (IsMVCCSnapshot(scan->xs_snapshot) && - RelationNeedsWAL(scan->indexRelation) && + RelationNeedsWALKnownProblem(scan->indexRelation) && !scan->xs_want_itup) { ReleaseBuffer(sp->buf); diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 0ed7c64..14fb36f 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -93,6 +93,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) BackendId backend; boolneeds_wal; + PreventParallelWorker(); /* can't update pendingSyncHash */ + switch (relpersistence)
Re: [HACKERS] WAL logging problem in 9.4.3?
Wow, this thread started in 2015. :-O Date: Fri, 3 Jul 2015 00:05:24 +0200 --- On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote: > On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote: > > On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote: > > > The attached is back-patches from 9.5 through master. > > > > Thanks. I've made some edits. I'll plan to push the attached patches on > > Friday or Saturday. > > Pushed, after adding a missing "break" to gist_identify() and tweaking two > more comments. However, a diverse minority of buildfarm members are failing > like this, in most branches: > > Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint > bit' > Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231. > Mar 21 13:16:37 # got: '1' > Mar 21 13:16:37 # expected: '2' > Mar 21 13:16:46 # Looks like you failed 1 test of 34. > Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl > -- > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2020-03-21%2016%3A52%3A05 > > Since I run two of the failing animals, I expect to reproduce this soon. > > fairywren failed differently on 9.5; I have not yet studied it: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10 > > > > It lacks a part of TAP infrastructure nowadays we have, but I > > > want to have the test (and it actually found a bug I made during this > > > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm > > > and RecursiveCopy.pm along with 018_wal_optimize.pl. > > > (0004-Add-TAP-test-for-WAL-skipping-feature.patch) > > > > That is a good idea. Rather than make it specific to this test, I would > > like > > to back-patch all applicable test files from 9.6 src/test/recovery. I'll > > plan > > to push that one part on Thursday. > > That push did not cause failures. > > -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote: > On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote: > > The attached is back-patches from 9.5 through master. > > Thanks. I've made some edits. I'll plan to push the attached patches on > Friday or Saturday. Pushed, after adding a missing "break" to gist_identify() and tweaking two more comments. However, a diverse minority of buildfarm members are failing like this, in most branches: Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit' Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231. Mar 21 13:16:37 # got: '1' Mar 21 13:16:37 # expected: '2' Mar 21 13:16:46 # Looks like you failed 1 test of 34. Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2020-03-21%2016%3A52%3A05 Since I run two of the failing animals, I expect to reproduce this soon. fairywren failed differently on 9.5; I have not yet studied it: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-03-21%2018%3A01%3A10 > > It lacks a part of TAP infrastructure nowadays we have, but I > > want to have the test (and it actually found a bug I made during this > > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm > > and RecursiveCopy.pm along with 018_wal_optimize.pl. > > (0004-Add-TAP-test-for-WAL-skipping-feature.patch) > > That is a good idea. Rather than make it specific to this test, I would like > to back-patch all applicable test files from 9.6 src/test/recovery. I'll plan > to push that one part on Thursday. That push did not cause failures.
Re: [HACKERS] WAL logging problem in 9.4.3?
Some fixes.. At Wed, 04 Mar 2020 16:29:19 +0900 (JST), Kyotaro Horiguchi wrote in > At first I fixed several ssues in 018_wal_optimize.pl: > > - TRUNCATE INSERT, TRUNCATE INSERT PREPARE > > It wrongly passes if finally we see the value only from the first > INSERT. I changed it so that it checks the value, not the number of > values. Finally it checks both number of values and the largest value. ... > log_newpage_range has been introduced at PG12. Fortunately the > required infrastructure is introduced at PG9.5 so what I need to do > for PG95-PG11 is back-patching the function and its counter part in - xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or + xlog_redo. It doen't change WAL format itself but XLOG_FPI gets to have 2 or > more backup pages so the compatibility is forward only. That is, newer > minor versions read WAL from older minor versions, but not vise > versea. I'm not sure it is back-patchable so in the attached the > end-of-xact WAL feature is separated for PG9.5-PG11. > (000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sun, 1 Mar 2020 11:56:32 -0800, Noah Misch wrote in > On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote: > > It sounds somewhat obscure. > > I see. I won't use that. Thanks. > > Coulnd't we enumetate examples? And if we > > could use pg_relation_filenode, I think we can use just > > "filenode". (Thuogh the word is used in the documentation, it is not > > defined anywhere..) > > func.sgml does define the term. Nonetheless, I'm not using it. Ah, "The filenode is the base component oif the file name(s) used for the relation".. So it's very similar to "on-disk file" in a sense. > > > > In minimal level, no information is logged for > > permanent relations for the remainder of a transaction that creates > > them or changes their filenode. For example, CREATE > > TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that > > category. > > > > > > # sorry for bothering you.. > > Including examples is fine. Attached v36nm has just comment and doc changes. > Would you translate this into back-patch versions for v9.5 through v12? The explicit list of commands that initiate the WAL-skipping mode works for me. I'm going to work on the tranlation right now. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote: > At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch wrote in > > On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote: > > > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch wrote > > > in > > > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote: > > > If we decide to keep the consistency there, I would like to describe > > > the code is there for consistency, not for the benefit of a specific > > > assertion. > > > > > > (cluster.c:1116) > > > -* new. The next step for rel2 is deletion, but copy rd_*Subid for the > > > -* benefit of AssertPendingSyncs_RelationCache(). > > > +* new. The next step for rel2 is deletion, but copy rd_*Subid for the > > > +* consistency of the fieles. It is checked later by > > > +* AssertPendingSyncs_RelationCache(). > > > > I think the word "consistency" is too vague for "consistency of the fields" > > to > > convey information. May I just remove the last sentence of the comment > > (everything after "* new.")? > > I'm fine with that:) > > > > I agree that relation works as the generic name of table-like > > > objects. Addition to that, doesn't using the word "storage file" make > > > it more clearly? I'm not confident on the wording itself, but it will > > > look like the following. > > > > The docs rarely use "storage file" or "on-disk file" as terms. I hesitate > > to > > put more emphasis on files, because they are part of the implementation, not > > part of the user interface. The term "rewrites"/"rewriting" has the same > > problem, though. Yet another alternative would be to talk about operations > > that change the pg_relation_filenode() return value: > > > > In minimal level, no information is logged for > > permanent > > relations for the remainder of a transaction that creates them or changes > > what pg_relation_filenode returns for them. > > > > What do you think? > > It sounds somewhat obscure. I see. I won't use that. > Coulnd't we enumetate examples? And if we > could use pg_relation_filenode, I think we can use just > "filenode". (Thuogh the word is used in the documentation, it is not > defined anywhere..) func.sgml does define the term. Nonetheless, I'm not using it. > > In minimal level, no information is logged for > permanent relations for the remainder of a transaction that creates > them or changes their filenode. For example, CREATE > TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that > category. > > > # sorry for bothering you.. Including examples is fine. Attached v36nm has just comment and doc changes. Would you translate this into back-patch versions for v9.5 through v12? Author: Noah Misch Commit: Noah Misch Fix cosmetic blemishes involving rd_createSubid. Remove an obsolete comment from AtEOXact_cleanup(). Restore formatting of a comment in struct RelationData, mangled by the pgindent run in commit 9af4159fce6654aa0e081b00d02bca40b978745c. Back-patch to 9.5 (all supported versions), because another fix stacks on this. diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 50f8912..44fa843 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3029,10 +3029,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit) * * During commit, reset the flag to zero, since we are now out of the * creating transaction. During abort, simply delete the relcache entry -* --- it isn't interesting any longer. (NOTE: if we have forgotten the -* new-ness of a new relation due to a forced cache flush, the entry will -* get deleted anyway by shared-cache-inval processing of the aborted -* pg_class insertion.) +* --- it isn't interesting any longer. */ if (relation->rd_createSubid != InvalidSubTransactionId) { diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index 31d8a1a..bf7a7df 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -63,7 +63,7 @@ typedef struct RelationData * rd_replidindex) */ boolrd_statvalid; /* is rd_statlist valid? */ - /* + /*-- * rd_createSubid is the ID of the highest subtransaction the rel has * survived into; or zero if the rel was not created in the current top * transaction. This can be now be relied on, whereas previously it could @@ -73,8 +73,13 @@ typedef struct RelationData * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten * when a relation has multiple new relfilenodes within a single * transaction, with one of them occurring in a subsequently aborted -* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t; -*
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch wrote in > On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote: > > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch wrote in > > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote: > > I aggree that the new #ifdef can invite a Heisenbug. I thought that > > you didn't want that because it doesn't make substantial difference. > > v35nm added swap_relation_files() code so AssertPendingSyncs_RelationCache() > could check rd_droppedSubid relations. v30nm, which did not have > rd_droppedSubid, removed swap_relation_files() code that wasn't making a > difference. Ok, I understand that it meant that the additional code still makes difference in --enable-cassert build. > > If we decide to keep the consistency there, I would like to describe > > the code is there for consistency, not for the benefit of a specific > > assertion. > > > > (cluster.c:1116) > > -* new. The next step for rel2 is deletion, but copy rd_*Subid for the > > -* benefit of AssertPendingSyncs_RelationCache(). > > +* new. The next step for rel2 is deletion, but copy rd_*Subid for the > > +* consistency of the fieles. It is checked later by > > +* AssertPendingSyncs_RelationCache(). > > I think the word "consistency" is too vague for "consistency of the fields" to > convey information. May I just remove the last sentence of the comment > (everything after "* new.")? I'm fine with that:) > > I agree that relation works as the generic name of table-like > > objects. Addition to that, doesn't using the word "storage file" make > > it more clearly? I'm not confident on the wording itself, but it will > > look like the following. > > The docs rarely use "storage file" or "on-disk file" as terms. I hesitate to > put more emphasis on files, because they are part of the implementation, not > part of the user interface. The term "rewrites"/"rewriting" has the same > problem, though. Yet another alternative would be to talk about operations > that change the pg_relation_filenode() return value: > > In minimal level, no information is logged for permanent > relations for the remainder of a transaction that creates them or changes > what pg_relation_filenode returns for them. > > What do you think? It sounds somewhat obscure. Coulnd't we enumetate examples? And if we could use pg_relation_filenode, I think we can use just "filenode". (Thuogh the word is used in the documentation, it is not defined anywhere..) In minimal level, no information is logged for permanent relations for the remainder of a transaction that creates them or changes their filenode. For example, CREATE TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that category. # sorry for bothering you.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote: > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch wrote in > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote: > > > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi > > > wrote in > > > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch > > > > wrote in > > > In swap_relation_files, we can remove rel2-related code when #ifndef > > > USE_ASSERT_CHECKING. > > > > When state is visible to many compilation units, we should avoid making that > > state depend on --enable-cassert. That would be a recipe for a Heisenbug. > > In > > a hot code path, it might be worth the risk. > > I aggree that the new #ifdef can invite a Heisenbug. I thought that > you didn't want that because it doesn't make substantial difference. v35nm added swap_relation_files() code so AssertPendingSyncs_RelationCache() could check rd_droppedSubid relations. v30nm, which did not have rd_droppedSubid, removed swap_relation_files() code that wasn't making a difference. > If we decide to keep the consistency there, I would like to describe > the code is there for consistency, not for the benefit of a specific > assertion. > > (cluster.c:1116) > -* new. The next step for rel2 is deletion, but copy rd_*Subid for the > -* benefit of AssertPendingSyncs_RelationCache(). > +* new. The next step for rel2 is deletion, but copy rd_*Subid for the > +* consistency of the fieles. It is checked later by > +* AssertPendingSyncs_RelationCache(). I think the word "consistency" is too vague for "consistency of the fields" to convey information. May I just remove the last sentence of the comment (everything after "* new.")? > > > config.sgml: > > > +When wal_level is minimal > > > and a > > > +transaction commits after creating or rewriting a permanent > > > table, > > > +materialized view, or index, this setting determines how to > > > persist > > > > > > "creating or truncation" a permanent table? and maybe "refreshing > > > matview and reindex". I'm not sure that they can be merged that way. > ... > > I like mentioning truncation, but I dislike how this implies that CREATE > > INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in > > scope. While I usually avoid the word "relation" in documentation, I can > > justify it here to make the sentence less complex. How about the following? > > > > --- a/doc/src/sgml/config.sgml > > +++ b/doc/src/sgml/config.sgml > > @@ -2484,9 +2484,9 @@ include_dir 'conf.d' > > In minimal level, no information is logged for > > -tables or indexes for the remainder of a transaction that creates > > or > > -truncates them. This can make bulk operations much faster (see > > -). But minimal WAL does not contain > > -enough information to reconstruct the data from a base backup and > > the > > -WAL logs, so replica or higher must be used to > > -enable WAL archiving () and > > -streaming replication. > > +permanent relations for the remainder of a transaction that > > creates, > > +rewrites, or truncates them. This can make bulk operations much > > +faster (see ). But minimal WAL does > > +not contain enough information to reconstruct the data from a base > > +backup and the WAL logs, so replica or higher > > must > > +be used to enable WAL archiving ( > linkend="guc-archive-mode"/>) > > +and streaming replication. > > > > @@ -2891,9 +2891,9 @@ include_dir 'conf.d' > > When wal_level is minimal > > and a > > -transaction commits after creating or rewriting a permanent table, > > -materialized view, or index, this setting determines how to persist > > -the new data. If the data is smaller than this setting, write it > > to > > -the WAL log; otherwise, use an fsync of the data file. Depending > > on > > -the properties of your storage, raising or lowering this value > > might > > -help if such commits are slowing concurrent transactions. The > > default > > -is two megabytes (2MB). > > +transaction commits after creating, rewriting, or truncating a > > +permanent relation, this setting determines how to persist the new > > +data. If the data is smaller than this setting, write it to the > > WAL > > +log; otherwise, use an fsync of the data file. Depending on the > > +properties of your storage, raising or lowering this value might > > help > > +if such commits are slowing concurrent transactions. The default > > is > > +two megabytes (2MB). > > > > I agree that relation works as the generic name of table-like > objects. Addition to that, doesn't using the word "storage file" make > it more clearly? I'm not confident on the wording itself, but it will > look like the
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch wrote in > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote: > > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi > > wrote in > > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch wrote > > > in > > > > - When reusing an index build, instead of storing the dropped relid in > > > > the > > > > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), > > > > store > > > > the subid fields in the IndexStmt. This is less code, and I felt > > > > RelationIdGetRelationCache() invited misuse. > > > > > > Hmm. I'm not sure that index_create having the new subid parameters is > > > good. And the last if(OidIsValid) clause handles storage persistence > > > so I did that there. But I don't strongly against it. > > Agreed. My choice there was not a clear improvement. > > > The change on alter_table.sql and create_table.sql is expecting to > > cause assertion failure. Don't we need that kind of explanation in > > the comment? > > Test comments generally describe the feature unique to that test, not how the > test might break. Some tests list bug numbers, but that doesn't apply here. Agreed. > > In swap_relation_files, we can remove rel2-related code when #ifndef > > USE_ASSERT_CHECKING. > > When state is visible to many compilation units, we should avoid making that > state depend on --enable-cassert. That would be a recipe for a Heisenbug. In > a hot code path, it might be worth the risk. I aggree that the new #ifdef can invite a Heisenbug. I thought that you didn't want that because it doesn't make substantial difference. If we decide to keep the consistency there, I would like to describe the code is there for consistency, not for the benefit of a specific assertion. (cluster.c:1116) -* new. The next step for rel2 is deletion, but copy rd_*Subid for the -* benefit of AssertPendingSyncs_RelationCache(). +* new. The next step for rel2 is deletion, but copy rd_*Subid for the +* consistency of the fieles. It is checked later by +* AssertPendingSyncs_RelationCache(). > > The patch adds the test for createSubid to pg_visibility.out. It > > doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but > > CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be > > reached. I'm not sure it is useful. > > I agree it's not clearly useful, but tests don't need to meet a "clearly > useful" standard. When a fast test is not clearly redundant with another > test, we generally accept it. In the earlier patch version that inspired this > test, RELCACHE_FORCE_RELEASE sufficed to make it fail. > > > config.sgml: > > +When wal_level is minimal > > and a > > +transaction commits after creating or rewriting a permanent table, > > +materialized view, or index, this setting determines how to persist > > > > "creating or truncation" a permanent table? and maybe "refreshing > > matview and reindex". I'm not sure that they can be merged that way. ... > I like mentioning truncation, but I dislike how this implies that CREATE > INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in > scope. While I usually avoid the word "relation" in documentation, I can > justify it here to make the sentence less complex. How about the following? > > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -2484,9 +2484,9 @@ include_dir 'conf.d' > In minimal level, no information is logged for > -tables or indexes for the remainder of a transaction that creates or > -truncates them. This can make bulk operations much faster (see > -). But minimal WAL does not contain > -enough information to reconstruct the data from a base backup and the > -WAL logs, so replica or higher must be used to > -enable WAL archiving () and > -streaming replication. > +permanent relations for the remainder of a transaction that creates, > +rewrites, or truncates them. This can make bulk operations much > +faster (see ). But minimal WAL does > +not contain enough information to reconstruct the data from a base > +backup and the WAL logs, so replica or higher must > +be used to enable WAL archiving () > +and streaming replication. > > @@ -2891,9 +2891,9 @@ include_dir 'conf.d' > When wal_level is minimal and a > -transaction commits after creating or rewriting a permanent table, > -materialized view, or index, this setting determines how to persist > -the new data. If the data is smaller than this setting, write it to > -the WAL log; otherwise, use an fsync of the data file. Depending on > -the properties of your storage, raising or lowering this value might > -help if such commits are slowing concurrent transactions. The > default > -is two megabytes (2MB). >
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote: > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch wrote in > > > - When reusing an index build, instead of storing the dropped relid in the > > > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), > > > store > > > the subid fields in the IndexStmt. This is less code, and I felt > > > RelationIdGetRelationCache() invited misuse. > > > > Hmm. I'm not sure that index_create having the new subid parameters is > > good. And the last if(OidIsValid) clause handles storage persistence > > so I did that there. But I don't strongly against it. Agreed. My choice there was not a clear improvement. > The change on alter_table.sql and create_table.sql is expecting to > cause assertion failure. Don't we need that kind of explanation in > the comment? Test comments generally describe the feature unique to that test, not how the test might break. Some tests list bug numbers, but that doesn't apply here. > In swap_relation_files, we can remove rel2-related code when #ifndef > USE_ASSERT_CHECKING. When state is visible to many compilation units, we should avoid making that state depend on --enable-cassert. That would be a recipe for a Heisenbug. In a hot code path, it might be worth the risk. > The patch adds the test for createSubid to pg_visibility.out. It > doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but > CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be > reached. I'm not sure it is useful. I agree it's not clearly useful, but tests don't need to meet a "clearly useful" standard. When a fast test is not clearly redundant with another test, we generally accept it. In the earlier patch version that inspired this test, RELCACHE_FORCE_RELEASE sufficed to make it fail. > config.sgml: > +When wal_level is minimal and a > +transaction commits after creating or rewriting a permanent table, > +materialized view, or index, this setting determines how to persist > > "creating or truncation" a permanent table? and maybe "refreshing > matview and reindex". I'm not sure that they can be merged that way. > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -2889,13 +2889,13 @@ include_dir 'conf.d' > > > When wal_level is minimal and a > -transaction commits after creating or rewriting a permanent table, > -materialized view, or index, this setting determines how to persist > -the new data. If the data is smaller than this setting, write it to > -the WAL log; otherwise, use an fsync of the data file. Depending on > -the properties of your storage, raising or lowering this value might > -help if such commits are slowing concurrent transactions. The > default > -is two megabytes (2MB). > +transaction commits after creating or truncating a permanent table, > +refreshing a materialized view, or reindexing, this setting > determines > +how to persist the new data. If the data is smaller than this > +setting, write it to the WAL log; otherwise, use an fsync of the data > +file. Depending on the properties of your storage, raising or > +lowering this value might help if such commits are slowing concurrent > +transactions. The default is two megabytes (2MB). > I like mentioning truncation, but I dislike how this implies that CREATE INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in scope. While I usually avoid the word "relation" in documentation, I can justify it here to make the sentence less complex. How about the following? --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2484,9 +2484,9 @@ include_dir 'conf.d' In minimal level, no information is logged for -tables or indexes for the remainder of a transaction that creates or -truncates them. This can make bulk operations much faster (see -). But minimal WAL does not contain -enough information to reconstruct the data from a base backup and the -WAL logs, so replica or higher must be used to -enable WAL archiving () and -streaming replication. +permanent relations for the remainder of a transaction that creates, +rewrites, or truncates them. This can make bulk operations much +faster (see ). But minimal WAL does +not contain enough information to reconstruct the data from a base +backup and the WAL logs, so replica or higher must +be used to enable WAL archiving () +and streaming replication. @@ -2891,9 +2891,9 @@ include_dir 'conf.d' When wal_level is minimal and a -transaction commits after creating or rewriting a permanent table, -materialized view, or index, this
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. I looked through the latest patch. At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi wrote in > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch wrote in > > - When reusing an index build, instead of storing the dropped relid in the > > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), > > store > > the subid fields in the IndexStmt. This is less code, and I felt > > RelationIdGetRelationCache() invited misuse. > > Hmm. I'm not sure that index_create having the new subid parameters is > good. And the last if(OidIsValid) clause handles storage persistence > so I did that there. But I don't strongly against it. > > Please give a bit more time to look it. The change on alter_table.sql and create_table.sql is expecting to cause assertion failure. Don't we need that kind of explanation in the comment? In swap_relation_files, we can remove rel2-related code when #ifndef USE_ASSERT_CHECKING. The patch adds the test for createSubid to pg_visibility.out. It doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be reached. I'm not sure it is useful. config.sgml: +When wal_level is minimal and a +transaction commits after creating or rewriting a permanent table, +materialized view, or index, this setting determines how to persist "creating or truncation" a permanent table? and maybe "refreshing matview and reindex". I'm not sure that they can be merged that way. Other than the item related to pg_visibility.sql are in the attached. The others look good to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 3068e1e94a..38a2edf860 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2889,13 +2889,13 @@ include_dir 'conf.d' When wal_level is minimal and a -transaction commits after creating or rewriting a permanent table, -materialized view, or index, this setting determines how to persist -the new data. If the data is smaller than this setting, write it to -the WAL log; otherwise, use an fsync of the data file. Depending on -the properties of your storage, raising or lowering this value might -help if such commits are slowing concurrent transactions. The default -is two megabytes (2MB). +transaction commits after creating or truncating a permanent table, +refreshing a materialized view, or reindexing, this setting determines +how to persist the new data. If the data is smaller than this +setting, write it to the WAL log; otherwise, use an fsync of the data +file. Depending on the properties of your storage, raising or +lowering this value might help if such commits are slowing concurrent +transactions. The default is two megabytes (2MB). diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 391a8a9ea3..682619c9db 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -1118,16 +1118,18 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, */ { Relationrel1, - rel2; + rel2 PG_USED_FOR_ASSERTS_ONLY; rel1 = relation_open(r1, NoLock); +#ifdef USE_ASSERT_CHECKING rel2 = relation_open(r2, NoLock); rel2->rd_createSubid = rel1->rd_createSubid; rel2->rd_newRelfilenodeSubid = rel1->rd_newRelfilenodeSubid; rel2->rd_firstRelfilenodeSubid = rel1->rd_firstRelfilenodeSubid; + relation_close(rel2, NoLock); +#endif RelationAssumeNewRelfilenode(rel1); relation_close(rel1, NoLock); - relation_close(rel2, NoLock); } /* diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index 7c2181ac2f..3c500944cd 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -1985,7 +1985,8 @@ select * from another; drop table another; -- Create an index that skips WAL, then perform a SET DATA TYPE that skips --- rewriting the index. +-- rewriting the index. Inadvertent changes on rd_createSubid causes +-- asseertion failure. begin; create table skip_wal_skip_rewrite_index (c varchar(10) primary key); alter table skip_wal_skip_rewrite_index alter c type varchar(20); diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out index 6acf31725f..dae7595957 100644 --- a/src/test/regress/expected/create_table.out +++ b/src/test/regress/expected/create_table.out @@ -332,6 +332,7 @@ ERROR: set-returning functions are not allowed in DEFAULT
Re: [HACKERS] WAL logging problem in 9.4.3?
Sorry, just one fix. (omitting some typos, though..) At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi wrote in > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch wrote in > > I think attached v35nm is ready for commit to master. Would anyone like to > > talk me out of back-patching this? I would not enjoy back-patching it, but > > it's hard to justify lack of back-patch for a data-loss bug. > > > > Notable changes since v34: > > > > - Separate a few freestanding fixes into their own patches. > > All of the three patches look fine. > > > On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote: > > > --- a/src/backend/catalog/storage.c > > > +++ b/src/backend/catalog/storage.c > > > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel) > > > /* Record largest maybe-unsynced block of files under tracking */ > > > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node), > > > HASH_FIND, NULL); > > > - if (pending) > > > - { > > > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM); > > > - > > > - if (pending->max_truncated < nblocks) > > > - pending->max_truncated = nblocks; > > > - } > > > + pending->is_truncated = true; > > > > - Fix this crashing when "pending" is NULL, as it is in this test case: > > > > begin; > > create temp table t (); > > create table t2 (); -- cause pendingSyncHash to exist > > truncate t; > > rollback; > > That's terrible... Thanks for fixint it. > > > - Fix the "deleted while still in use" problem that Thomas Munro reported, > > by > > removing the heap_create() change. Restoring the saved rd_createSubid had > > made obsolete the heap_create() change. check-world now passes with > > wal_level=minimal and CLOBBER_CACHE_ALWAYS. > > Ok, as in the previous mail. > > > - Set rd_droppedSubid in RelationForgetRelation(), not > > RelationClearRelation(). RelationForgetRelation() knows it is processing > > a > > drop, but RelationClearRelation() could only infer that from > > circumstantial > > evidence. This seems more future-proof to me. > > Agreed. Different from RelationClearRelatoin, RelationForgetRelation > is called only for "drop"ing the relation. > > > - When reusing an index build, instead of storing the dropped relid in the > > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), > > store > > the subid fields in the IndexStmt. This is less code, and I felt > > RelationIdGetRelationCache() invited misuse. > > Hmm. I'm not sure that index_create having the new subid parameters is > good. And the last if(OidIsValid) clause handles storage persistence > so I did that there. But I don't strongly against it. Hmm. I'm not sure that index_create having the new subid parameters is good. And the last if(OidIsValid) clause in AtExecAddIndex handles storage persistence so I did that there. But I don't strongly against it. > Please give a bit more time to look it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch wrote in > I think attached v35nm is ready for commit to master. Would anyone like to > talk me out of back-patching this? I would not enjoy back-patching it, but > it's hard to justify lack of back-patch for a data-loss bug. > > Notable changes since v34: > > - Separate a few freestanding fixes into their own patches. All of the three patches look fine. > On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote: > > --- a/src/backend/catalog/storage.c > > +++ b/src/backend/catalog/storage.c > > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel) > > /* Record largest maybe-unsynced block of files under tracking */ > > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node), > > HASH_FIND, NULL); > > - if (pending) > > - { > > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM); > > - > > - if (pending->max_truncated < nblocks) > > - pending->max_truncated = nblocks; > > - } > > + pending->is_truncated = true; > > - Fix this crashing when "pending" is NULL, as it is in this test case: > > begin; > create temp table t (); > create table t2 (); -- cause pendingSyncHash to exist > truncate t; > rollback; That's terrible... Thanks for fixint it. > - Fix the "deleted while still in use" problem that Thomas Munro reported, by > removing the heap_create() change. Restoring the saved rd_createSubid had > made obsolete the heap_create() change. check-world now passes with > wal_level=minimal and CLOBBER_CACHE_ALWAYS. Ok, as in the previous mail. > - Set rd_droppedSubid in RelationForgetRelation(), not > RelationClearRelation(). RelationForgetRelation() knows it is processing a > drop, but RelationClearRelation() could only infer that from circumstantial > evidence. This seems more future-proof to me. Agreed. Different from RelationClearRelatoin, RelationForgetRelation is called only for "drop"ing the relation. > - When reusing an index build, instead of storing the dropped relid in the > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store > the subid fields in the IndexStmt. This is less code, and I felt > RelationIdGetRelationCache() invited misuse. Hmm. I'm not sure that index_create having the new subid parameters is good. And the last if(OidIsValid) clause handles storage persistence so I did that there. But I don't strongly against it. Please give a bit more time to look it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
I think attached v35nm is ready for commit to master. Would anyone like to talk me out of back-patching this? I would not enjoy back-patching it, but it's hard to justify lack of back-patch for a data-loss bug. Notable changes since v34: - Separate a few freestanding fixes into their own patches. On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote: > --- a/src/backend/catalog/storage.c > +++ b/src/backend/catalog/storage.c > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel) > /* Record largest maybe-unsynced block of files under tracking */ > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node), > HASH_FIND, NULL); > - if (pending) > - { > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM); > - > - if (pending->max_truncated < nblocks) > - pending->max_truncated = nblocks; > - } > + pending->is_truncated = true; - Fix this crashing when "pending" is NULL, as it is in this test case: begin; create temp table t (); create table t2 (); -- cause pendingSyncHash to exist truncate t; rollback; - Fix the "deleted while still in use" problem that Thomas Munro reported, by removing the heap_create() change. Restoring the saved rd_createSubid had made obsolete the heap_create() change. check-world now passes with wal_level=minimal and CLOBBER_CACHE_ALWAYS. - Set rd_droppedSubid in RelationForgetRelation(), not RelationClearRelation(). RelationForgetRelation() knows it is processing a drop, but RelationClearRelation() could only infer that from circumstantial evidence. This seems more future-proof to me. - When reusing an index build, instead of storing the dropped relid in the IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store the subid fields in the IndexStmt. This is less code, and I felt RelationIdGetRelationCache() invited misuse. Author: Noah Misch Commit: Noah Misch Fix cosmetic blemishes involving rd_createSubid. Remove an obsolete comment from AtEOXact_cleanup(). Restore formatting of a comment in struct RelationData, mangled by the pgindent run in commit 9af4159fce6654aa0e081b00d02bca40b978745c. Back-patch to 9.5 (all supported versions), because another fix stacks on this. diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 50f8912..44fa843 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3029,10 +3029,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit) * * During commit, reset the flag to zero, since we are now out of the * creating transaction. During abort, simply delete the relcache entry -* --- it isn't interesting any longer. (NOTE: if we have forgotten the -* new-ness of a new relation due to a forced cache flush, the entry will -* get deleted anyway by shared-cache-inval processing of the aborted -* pg_class insertion.) +* --- it isn't interesting any longer. */ if (relation->rd_createSubid != InvalidSubTransactionId) { diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index 31d8a1a..bf7a7df 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -63,7 +63,7 @@ typedef struct RelationData * rd_replidindex) */ boolrd_statvalid; /* is rd_statlist valid? */ - /* + /*-- * rd_createSubid is the ID of the highest subtransaction the rel has * survived into; or zero if the rel was not created in the current top * transaction. This can be now be relied on, whereas previously it could @@ -73,8 +73,13 @@ typedef struct RelationData * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten * when a relation has multiple new relfilenodes within a single * transaction, with one of them occurring in a subsequently aborted -* subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t; -* ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten +* subtransaction, e.g. +* BEGIN; +* TRUNCATE t; +* SAVEPOINT save; +* TRUNCATE t; +* ROLLBACK TO save; +* -- rd_newRelfilenodeSubid is now forgotten */ SubTransactionId rd_createSubid;/* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode assigned in Author: Noah Misch Commit: Noah Misch During heap rebuild, lock any TOAST index until end of transaction. swap_relation_files() calls toast_get_valid_index() to find and lock this index, just before
Re: [HACKERS] WAL logging problem in 9.4.3?
Oops. I played on a wrong branch and got stuck in slow build on Windows... At Tue, 18 Feb 2020 00:53:37 -0800, Noah Misch wrote in > On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote: > > CREATE TYPE priv_testtype1 AS (a int, b text); > > +ERROR: relation 24844 deleted while still in use > > REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC; > > > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923 > > > > It didn't fail on the same OS a couple of days earlier: > > > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686 > > Thanks for the report. This reproduces consistently under > CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today). Removing the > heap_create() change fixes it. Since we now restore a saved rd_createSubid, > the heap_create() change is obsolete. My next version will include that fix. Yes, ATExecAddIndex correctly set createSubid without that. > The system uses rd_createSubid to mean two things. First, rd_node is new. > Second, the rel might not yet be in catalogs, so we can't rebuild its relcache > entry. The first can be false while the second is true, hence this failure. > However, the second is true in a relatively-narrow period in which we don't > run arbitrary user code. Hence, that simple fix suffices. I didn't care the second meaning. I thought it is caused by invalidation but I couldn't get a core dump on Windows 10.. The comment for RelationCacheInvalidate seems faintly explains about the second meaning. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote: > CREATE TYPE priv_testtype1 AS (a int, b text); > +ERROR: relation 24844 deleted while still in use > REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC; > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923 > > It didn't fail on the same OS a couple of days earlier: > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686 Thanks for the report. This reproduces consistently under CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today). Removing the heap_create() change fixes it. Since we now restore a saved rd_createSubid, the heap_create() change is obsolete. My next version will include that fix. The system uses rd_createSubid to mean two things. First, rd_node is new. Second, the rel might not yet be in catalogs, so we can't rebuild its relcache entry. The first can be false while the second is true, hence this failure. However, the second is true in a relatively-narrow period in which we don't run arbitrary user code. Hence, that simple fix suffices.
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Jan 27, 2020 at 11:30 PM Kyotaro Horiguchi wrote: > Hello, this is rebased then addressed version. Hi, I haven't followed this thread but I just noticed this strange looking failure: CREATE TYPE priv_testtype1 AS (a int, b text); +ERROR: relation 24844 deleted while still in use REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC; https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923 It didn't fail on the same OS a couple of days earlier: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, this is rebased then addressed version. - Now valid rd_firstRelfilenodeSubid causes drop-pending of relcache as well as rd_createSubid. The oblivion in the last example no longer happens. - Revert the (really) useless change of AssertPendingSyncs_RelationCache. - Fix several comments. Some of the fixes are just rewording and some are related to the first change above. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 91ae812abd0fcfe0172b7bb0ad563d3a7e5dd009 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v34 1/4] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml| 43 ++- doc/src/sgml/perform.sgml | 47 +-- src/backend/access/common/toast_internals.c | 4 +- src/backend/access/gist/gistutil.c | 31 +- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c| 45 +-- src/backend/access/heap/heapam_handler.c| 22 +- src/backend/access/heap/rewriteheap.c | 21 +- src/backend/access/nbtree/nbtsort.c | 41 +-- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README | 45 ++- src/backend/access/transam/xact.c | 15 + src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 18 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c | 248 - src/backend/commands/cluster.c | 12 +- src/backend/commands/copy.c | 58 +-- src/backend/commands/createas.c | 13 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c| 11 +- src/backend/storage/buffer/bufmgr.c | 125 ++- src/backend/storage/lmgr/lock.c | 12 + src/backend/storage/smgr/md.c | 36 +- src/backend/storage/smgr/smgr.c | 35 ++ src/backend/utils/cache/relcache.c | 161 +++-- src/backend/utils/misc/guc.c| 13 + src/include/access/gist_private.h | 2 + src/include/access/gistxlog.h | 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h| 2 +- src/include/access/tableam.h| 15 +- src/include/catalog/storage.h | 6 + src/include/storage/bufmgr.h| 4 + src/include/storage/lock.h | 3 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 55 ++- src/include/utils/relcache.h| 8 +- src/test/recovery/t/018_wal_optimize.pl | 374 src/test/regress/expected/alter_table.out | 6 + src/test/regress/sql/alter_table.sql| 7 + 41 files changed, 1243 insertions(+), 352 deletions(-) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e07dc01e80..ea1c866a15 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct the -data from a base backup and the WAL logs, so replica or -higher must be used to enable WAL archiving -() and streaming replication. +In minimal level, no information is logged for +tables or indexes for the remainder of a transaction that creates or +truncates them. This can make bulk operations much faster (see +). But minimal WAL does not contain +enough information to reconstruct the data from a base backup and the +WAL logs, so replica or higher must be used to +enable WAL archiving () and +
Re: [HACKERS] WAL logging problem in 9.4.3?
By the way, the previous version looks somewhat different from what I thought I posted.. At Sun, 26 Jan 2020 20:57:00 -0800, Noah Misch wrote in > On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote: > > > The purpose of this loop is to create relcache entries for rels locked in > > > the > > > current transaction. (The "r == NULL" case happens for rels no longer > > > visible > > > in catalogs. It is harmless.) Since RelationIdGetRelationCache() never > > > creates a relcache entry, calling it defeats that purpose. > > > RelationIdGetRelation() is the right function to call. > > > > I thought that the all required entry exist in the cache but actually > > it's safer that recreate dropped caches. Does the following works? > > > > r = RelationIdGetRelation(relid); > > + /* if not found, fetch a "dropped" entry if any */ > > + if (r == NULL) > > + r = RelationIdGetRelationCache(relid); > > if (r == NULL) > > continue; > > That does not materially change the function's behavior. Notice that the > function does one thing with "r", which is to call RelationClose(r). The > function calls RelationIdGetRelation() for its side effects, not for its > return value. ..Right. The following loop accesses relcache hash directly and no need for storing returned r to the array rels.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote: > At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch wrote in > > Diffing the two latest versions of one patch: > > > --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 > > > -0800 > > > +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 > > > -0800 > > > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void) > > > + LOCKTAG_RELATION) > > > + continue; > > > + relid = > > > ObjectIdGetDatum(locallock->tag.lock.locktag_field2); > > > +-r = RelationIdGetRelation(relid); > > > +-if (r == NULL) > > > ++r = RelationIdGetRelationCache(relid); > > > > The purpose of this loop is to create relcache entries for rels locked in > > the > > current transaction. (The "r == NULL" case happens for rels no longer > > visible > > in catalogs. It is harmless.) Since RelationIdGetRelationCache() never > > creates a relcache entry, calling it defeats that purpose. > > RelationIdGetRelation() is the right function to call. > > I thought that the all required entry exist in the cache but actually > it's safer that recreate dropped caches. Does the following works? > > r = RelationIdGetRelation(relid); > + /* if not found, fetch a "dropped" entry if any */ > + if (r == NULL) > + r = RelationIdGetRelationCache(relid); > if (r == NULL) > continue; That does not materially change the function's behavior. Notice that the function does one thing with "r", which is to call RelationClose(r). The function calls RelationIdGetRelation() for its side effects, not for its return value.
Re: [HACKERS] WAL logging problem in 9.4.3?
Thanks! At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch wrote in > Diffing the two latest versions of one patch: > > --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 -0800 > > +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 -0800 > > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void) > > + LOCKTAG_RELATION) > > + continue; > > + relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2); > > +- r = RelationIdGetRelation(relid); > > +- if (r == NULL) > > ++ r = RelationIdGetRelationCache(relid); > > The purpose of this loop is to create relcache entries for rels locked in the > current transaction. (The "r == NULL" case happens for rels no longer visible > in catalogs. It is harmless.) Since RelationIdGetRelationCache() never > creates a relcache entry, calling it defeats that purpose. > RelationIdGetRelation() is the right function to call. I thought that the all required entry exist in the cache but actually it's safer that recreate dropped caches. Does the following works? r = RelationIdGetRelation(relid); + /* if not found, fetch a "dropped" entry if any */ + if (r == NULL) + r = RelationIdGetRelationCache(relid); if (r == NULL) continue; > On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote: > > Three other fixes not mentined above are made. One is the useless > > rd_firstRelfilenodeSubid in the condition to dicide whether to > > preserve or not a relcache entry > > It was not useless. Test case: > > create table t (c int); > begin; > alter table t alter c type bigint; -- sets rd_firstRelfilenodeSubid > savepoint q; drop table t; rollback to q; -- forgets > rd_firstRelfilenodeSubid > commit; -- assertion failure, after > s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above Mmm? I thought somehow that that relcache entry never be dropped and I believe I considered that case, of course. But yes, you're right. I'll post upated version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
Diffing the two latest versions of one patch: > --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 -0800 > +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 -0800 > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void) > + LOCKTAG_RELATION) > + continue; > + relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2); > +-r = RelationIdGetRelation(relid); > +-if (r == NULL) > ++r = RelationIdGetRelationCache(relid); The purpose of this loop is to create relcache entries for rels locked in the current transaction. (The "r == NULL" case happens for rels no longer visible in catalogs. It is harmless.) Since RelationIdGetRelationCache() never creates a relcache entry, calling it defeats that purpose. RelationIdGetRelation() is the right function to call. On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote: > Three other fixes not mentined above are made. One is the useless > rd_firstRelfilenodeSubid in the condition to dicide whether to > preserve or not a relcache entry It was not useless. Test case: create table t (c int); begin; alter table t alter c type bigint; -- sets rd_firstRelfilenodeSubid savepoint q; drop table t; rollback to q; -- forgets rd_firstRelfilenodeSubid commit; -- assertion failure, after s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for the comment. At Sat, 18 Jan 2020 19:51:39 -0800, Noah Misch wrote in > On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote: > > At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi > > wrote in > > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote > > > in > > > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK > > > > TO ... > > This is complex than expected. The DROP TABLE unconditionally removed > > relcache entry. To fix that, I tried to use rd_isinvalid but it failed > > because there's a state that a relcache invalid but the corresponding > > catalog entry is alive. > > > > In the attached patch 0002, I added a boolean in relcache that > > indicates that the relation is already removed in catalog but not > > committed. > > This design could work, but some if its properties aren't ideal. For example, > RelationIdGetRelation() can return a !rd_isvalid relation when the relation > has been dropped. What others designs did you consider, if any? I thought that the entries with rd_isdropped is true cannot be fetched by other transactions because the relid is not seen there. Still the same session could do that by repeatedly reindexing or invalidation on the same relation and I think it is safe because the content of the entry coudln't be changed and the cache content is reusable. That being said, it is actually makes things unclear. I came up with two alternatives. One is a variant of RelationIdGetRelation for the purpose. The new function RelationIdGetRelationCache is currently used (only) in ATExecAddIndex so we could restrict it to return only dropped relations. Another is another "stashed" relcache, but it seems to make things too complex. > On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote: > > --- a/src/backend/utils/cache/relcache.c > > +++ b/src/backend/utils/cache/relcache.c > > @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit) > > */ > > if (relation->rd_createSubid != InvalidSubTransactionId) > > { > > - if (isCommit) > > - relation->rd_createSubid = InvalidSubTransactionId; > > + relation->rd_createSubid = InvalidSubTransactionId; > > + > > + if (isCommit && !relation->rd_isdropped) > > + {} /* Nothing to do */ > > What is the purpose of this particular change? This executes at the end of a > top-level transaction. We've already done any necessary syncing, and we're > clearing any flags that caused WAL skipping. I think it's no longer > productive to treat dropped relations differently. It executes the pending *relcache* drop we should have done in ATPostAlterTypeCleanup (or in RelationClearRelation) if the newness flags were false. The entry misses the chance of being removed (then bloats the relcache) if we don't do that there. I added a comment there to exlaining that. > > @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit, > > } > > } > > > > + /* > > +* If this relation registered pending sync then dropped, subxact > > rollback > > +* cancels the uncommitted drop, and commit propagates it to the parent. > > +*/ > > + if (relation->rd_isdropped) > > + { > > + Assert (!relation->rd_isvalid && > > + (relation->rd_createSubid != > > InvalidSubTransactionId || > > +relation->rd_firstRelfilenodeSubid != > > InvalidSubTransactionId)); > > + if (!isCommit) > > + relation->rd_isdropped = false; > > This does the wrong thing when there exists some subtransaction rollback that > does not rollback the DROP: Sorry for my stupid. I actually thought something like that on the way. After all, I concluded that the dropped flag ought to behave same way with rd_createSubid. > \pset null 'NULL' > begin; > create extension pg_visibility; > create table droppedtest (c int); > select 'droppedtest'::regclass::oid as oid \gset > savepoint q; drop table droppedtest; release q; -- rd_dropped==true > select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not > ideal) > savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong) > savepoint q; select 1; rollback to q; > select pg_relation_size(:oid), pg_relation_filepath(:oid), > has_table_privilege(:oid, 'SELECT'); -- all nulls, okay > select * from pg_visibility_map(:oid); -- assertion failure > rollback; And I teached RelationIdGetRelation not to return dropped relations. So the (not ideal) cases just fail as before. Three other fixes not mentined above are made. One is the useless rd_firstRelfilenodeSubid in the condition to dicide whether to preserve or not a relcache entry, and the forgotten copying of other newness flags. Another is the forgotten SWAPFIELD on rd_dropedSubid. The last is the forgotten change in out/equal/copyfuncs. Please find the attached. regards. -- Kyotaro
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote: > At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO > > > > > > A test in transactions.sql now fails in > > > AssertPendingSyncs_RelationCache(), > > > when running "make check" under wal_level=minimal. I test this way: > > > > > > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' > > > >$PWD/minimal.conf > > > make check TEMP_CONFIG=$PWD/minimal.conf > > > > > > Self-contained demonstration: > > > begin; > > > create table t (c int); > > > savepoint q; drop table t; rollback to q; -- forgets table is skipping > > > wal > > > commit; -- assertion failure > > This is complex than expected. The DROP TABLE unconditionally removed > relcache entry. To fix that, I tried to use rd_isinvalid but it failed > because there's a state that a relcache invalid but the corresponding > catalog entry is alive. > > In the attached patch 0002, I added a boolean in relcache that > indicates that the relation is already removed in catalog but not > committed. This design could work, but some if its properties aren't ideal. For example, RelationIdGetRelation() can return a !rd_isvalid relation when the relation has been dropped. What others designs did you consider, if any? On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote: > --- a/src/backend/utils/cache/relcache.c > +++ b/src/backend/utils/cache/relcache.c > @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit) >*/ > if (relation->rd_createSubid != InvalidSubTransactionId) > { > - if (isCommit) > - relation->rd_createSubid = InvalidSubTransactionId; > + relation->rd_createSubid = InvalidSubTransactionId; > + > + if (isCommit && !relation->rd_isdropped) > + {} /* Nothing to do */ What is the purpose of this particular change? This executes at the end of a top-level transaction. We've already done any necessary syncing, and we're clearing any flags that caused WAL skipping. I think it's no longer productive to treat dropped relations differently. > @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit, > } > } > > + /* > + * If this relation registered pending sync then dropped, subxact > rollback > + * cancels the uncommitted drop, and commit propagates it to the parent. > + */ > + if (relation->rd_isdropped) > + { > + Assert (!relation->rd_isvalid && > + (relation->rd_createSubid != > InvalidSubTransactionId || > + relation->rd_firstRelfilenodeSubid != > InvalidSubTransactionId)); > + if (!isCommit) > + relation->rd_isdropped = false; This does the wrong thing when there exists some subtransaction rollback that does not rollback the DROP: \pset null 'NULL' begin; create extension pg_visibility; create table droppedtest (c int); select 'droppedtest'::regclass::oid as oid \gset savepoint q; drop table droppedtest; release q; -- rd_dropped==true select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not ideal) savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong) savepoint q; select 1; rollback to q; select pg_relation_size(:oid), pg_relation_filepath(:oid), has_table_privilege(:oid, 'SELECT'); -- all nulls, okay select * from pg_visibility_map(:oid); -- assertion failure rollback;
Re: [HACKERS] WAL logging problem in 9.4.3?
All the known defects are fixed. At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only > > storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it > possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a > net size decrease? I haven't tested, but this sequence seems possible: > > TRUNCATE > reduces MAIN_FORKNUM from 100 blocks to 0 blocks > reduces FSM_FORKNUM from 3 blocks to 0 blocks > COPY > raises MAIN_FORKNUM from 0 blocks to 110 blocks > does not change FSM_FORKNUM > COMMIT > should fsync, but wrongly chooses log_newpage_range() approach > > If that's indeed a problem, beside the obvious option of tracking every fork's > max_truncated, we could convert max_truncated to a bool and use fsync anytime > the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical > for database operations, the choice to subject it to checksums entails > protecting it here.) If that's not a problem, would you explain? That causes page-load failure since FSM can offer a nonexistent heap block, which failure leads to ERROR of an SQL statement. It's not critical but surely a problem. I'd like to take the bool option because insert-truncate sequence is rarely happen. That case is not our main target of the optimization so it is enough for us to make sure that the operation doesn't lead to such errors. The attached is nm30 patch followed by the three fix patches for the three defects. The new member "RelationData.isremoved" is renamed to "isdropped" in this version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 3a811c76874ae3c596e138369766ad00888c572c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v32 1/4] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml| 43 ++- doc/src/sgml/perform.sgml | 47 +-- src/backend/access/common/toast_internals.c | 4 +- src/backend/access/gist/gistutil.c | 31 +- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c| 45 +-- src/backend/access/heap/heapam_handler.c| 22 +- src/backend/access/heap/rewriteheap.c | 21 +- src/backend/access/nbtree/nbtsort.c | 41 +-- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README | 45 ++- src/backend/access/transam/xact.c | 15 + src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 18 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c | 248 - src/backend/commands/cluster.c | 12 +- src/backend/commands/copy.c | 58 +-- src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c| 11 +- src/backend/storage/buffer/bufmgr.c | 125 ++- src/backend/storage/lmgr/lock.c | 12 + src/backend/storage/smgr/md.c | 36 +- src/backend/storage/smgr/smgr.c | 35 ++ src/backend/utils/cache/relcache.c | 159 +++-- src/backend/utils/misc/guc.c| 13 + src/include/access/gist_private.h | 2 + src/include/access/gistxlog.h | 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h| 2 +- src/include/access/tableam.h| 15 +- src/include/catalog/storage.h | 6 + src/include/storage/bufmgr.h| 4 + src/include/storage/lock.h | 3 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 ++- src/include/utils/relcache.h| 8 +- src/test/recovery/t/018_wal_optimize.pl | 374 src/test/regress/expected/alter_table.out | 6 + src/test/regress/sql/alter_table.sql| 7 + 41 files changed, 1242 insertions(+), 351 deletions(-) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d45b6f7cb..0e7a0bc0ee 100644 --- a/doc/src/sgml/config.sgml
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. I added a fix for the defect 2. At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > === Defect 2: Forgets to skip WAL due to oversimplification in heap_create() > > In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need > to transfer WAL-skipped state to the new index relation. Before v24nm, the > new index relation skipped WAL unconditionally. Since v24nm, the new index > relation never skips WAL. I've added a test to alter_table.sql that reveals > this problem under wal_level=minimal. The fix for this defect utilizes the mechanism that preserves relcache entry for dropped relation. If ATExecAddIndex can obtain such a relcache entry for the old relation, it should hold the newness flags and we can copy them to the new relcache entry. I added one member named oldRelId to the struct IndexStmt to let the function access the relcache entry for the old index relation. I forgot to assing version 31 to the last patch, I reused the number for this version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 591872bdd7b18566fe2529d20e4073900dec32fd Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v31 1/3] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml| 43 ++-- doc/src/sgml/perform.sgml | 47 +--- src/backend/access/common/toast_internals.c | 4 +- src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c| 45 +--- src/backend/access/heap/heapam_handler.c| 22 +- src/backend/access/heap/rewriteheap.c | 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README | 45 +++- src/backend/access/transam/xact.c | 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 18 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c | 248 ++-- src/backend/commands/cluster.c | 12 +- src/backend/commands/copy.c | 58 + src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c| 11 +- src/backend/storage/buffer/bufmgr.c | 125 +- src/backend/storage/lmgr/lock.c | 12 + src/backend/storage/smgr/md.c | 36 ++- src/backend/storage/smgr/smgr.c | 35 +++ src/backend/utils/cache/relcache.c | 159 ++--- src/backend/utils/misc/guc.c| 13 + src/include/access/gist_private.h | 2 + src/include/access/gistxlog.h | 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h| 2 +- src/include/access/tableam.h| 15 +- src/include/catalog/storage.h | 6 + src/include/storage/bufmgr.h| 4 + src/include/storage/lock.h | 3 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 +++-- src/include/utils/relcache.h| 8 +- src/test/regress/expected/alter_table.out | 6 + src/test/regress/sql/alter_table.sql| 7 + 40 files changed, 868 insertions(+), 351 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d45b6f7cb..0e7a0bc0ee 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct the -data from a base backup and the WAL logs, so replica or -higher
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, this is a fix for the defect 1 of 3. At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi wrote in > Thank you for the findings. > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > > By improving AssertPendingSyncs_RelationCache() and by testing with > > -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm. > > Would you fix these? > > I'd like to do that, please give me som time. > > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO > > > > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(), > > when running "make check" under wal_level=minimal. I test this way: > > > > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' > > >$PWD/minimal.conf > > make check TEMP_CONFIG=$PWD/minimal.conf > > > > Self-contained demonstration: > > begin; > > create table t (c int); > > savepoint q; drop table t; rollback to q; -- forgets table is skipping > > wal > > commit; -- assertion failure This is complex than expected. The DROP TABLE unconditionally removed relcache entry. To fix that, I tried to use rd_isinvalid but it failed because there's a state that a relcache invalid but the corresponding catalog entry is alive. In the attached patch 0002, I added a boolean in relcache that indicates that the relation is already removed in catalog but not committed. I needed to ignore invalid relcache entries in AssertPendingSyncs_RelationCache but I think it is the right thing to do. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 75ce09ae56227f9b87b1e9fcae1cad016857344c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH 1/2] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml| 43 ++-- doc/src/sgml/perform.sgml | 47 +--- src/backend/access/common/toast_internals.c | 4 +- src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c| 45 +--- src/backend/access/heap/heapam_handler.c| 22 +- src/backend/access/heap/rewriteheap.c | 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README | 45 +++- src/backend/access/transam/xact.c | 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 18 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c | 248 ++-- src/backend/commands/cluster.c | 12 +- src/backend/commands/copy.c | 58 + src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c| 11 +- src/backend/storage/buffer/bufmgr.c | 125 +- src/backend/storage/lmgr/lock.c | 12 + src/backend/storage/smgr/md.c | 36 ++- src/backend/storage/smgr/smgr.c | 35 +++ src/backend/utils/cache/relcache.c | 159 ++--- src/backend/utils/misc/guc.c| 13 + src/include/access/gist_private.h | 2 + src/include/access/gistxlog.h | 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h| 2 +- src/include/access/tableam.h| 15 +- src/include/catalog/storage.h | 6 + src/include/storage/bufmgr.h| 4 + src/include/storage/lock.h | 3 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 +++-- src/include/utils/relcache.h| 8 +- src/test/regress/expected/alter_table.out | 6 + src/test/regress/sql/alter_table.sql| 7 + 40 files changed, 868 insertions(+), 351 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d1c90282f..12d07b11e4 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, Noah. At Wed, 25 Dec 2019 20:22:04 -0800, Noah Misch wrote in > On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote: > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > > > Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. > > > Making > > > that work no matter what does ereport(ERROR) would be tricky and > > > low-value. > > > > Right about ereport, but I'm not sure remove the whole assertion from abort. > > You may think of a useful assert location that lacks the problems of asserting > at abort. For example, I considered asserting in PortalRunMulti() and > PortalRun(), just after each command, if still in a transaction. Thanks for the suggestion. I'll consider that > > > - Reverted most post-v24nm changes to swap_relation_files(). Under > > > "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the > > > rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not > > > right if > > > we're running CLUSTER for the second time in one transaction. I used > > > > I don't agree to that. As I think I have mentioned upthread, rel2 is > > wrongly marked as "new in this tranction" at that time, which hinders > > the opportunity of removal and such entries wrongly persist for the > > backend life and causes problems. (That was found by abort-time > > AssertPendingSyncs_RelationCache()..) > > I can't reproduce rel2's relcache entry wrongly persisting for the life of a > backend. If that were happening, I would expect repeating a CLUSTER command N > times to increase hash_get_num_entries(RelationIdCache) by at least N. I > tried that, but hash_get_num_entries(RelationIdCache) did not increase. In a > non-assert build, how can I reproduce problems caused by incorrect > rd_createSubid on rel2? As wrote in the another mail. I don't see such a problem and agree to the removal. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi wrote in > > - Reverted most post-v24nm changes to swap_relation_files(). Under > > "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the > > rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right > > if > > we're running CLUSTER for the second time in one transaction. I used > > I don't agree to that. As I think I have mentioned upthread, rel2 is > wrongly marked as "new in this tranction" at that time, which hinders > the opportunity of removal and such entries wrongly persist for the > backend life and causes problems. (That was found by abort-time > AssertPendingSyncs_RelationCache()..) I played with the new version for a while and I don't see such a problem. I don't recall cleary what I saw the time I thought I saw a problem but I changed my mind to agree to that. It's far reasonable and clearer as long as it works correctly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote: > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > > Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making > > that work no matter what does ereport(ERROR) would be tricky and > > low-value. > > Right about ereport, but I'm not sure remove the whole assertion from abort. You may think of a useful assert location that lacks the problems of asserting at abort. For example, I considered asserting in PortalRunMulti() and PortalRun(), just after each command, if still in a transaction. > > - Reverted most post-v24nm changes to swap_relation_files(). Under > > "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the > > rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right > > if > > we're running CLUSTER for the second time in one transaction. I used > > I don't agree to that. As I think I have mentioned upthread, rel2 is > wrongly marked as "new in this tranction" at that time, which hinders > the opportunity of removal and such entries wrongly persist for the > backend life and causes problems. (That was found by abort-time > AssertPendingSyncs_RelationCache()..) I can't reproduce rel2's relcache entry wrongly persisting for the life of a backend. If that were happening, I would expect repeating a CLUSTER command N times to increase hash_get_num_entries(RelationIdCache) by at least N. I tried that, but hash_get_num_entries(RelationIdCache) did not increase. In a non-assert build, how can I reproduce problems caused by incorrect rd_createSubid on rel2?
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for the findings. At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch wrote in > By improving AssertPendingSyncs_RelationCache() and by testing with > -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm. > Would you fix these? I'd like to do that, please give me som time. > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO > > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(), > when running "make check" under wal_level=minimal. I test this way: > > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' > >$PWD/minimal.conf > make check TEMP_CONFIG=$PWD/minimal.conf > > Self-contained demonstration: > begin; > create table t (c int); > savepoint q; drop table t; rollback to q; -- forgets table is skipping wal > commit; -- assertion failure > > > === Defect 2: Forgets to skip WAL due to oversimplification in heap_create() > > In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need > to transfer WAL-skipped state to the new index relation. Before v24nm, the > new index relation skipped WAL unconditionally. Since v24nm, the new index > relation never skips WAL. I've added a test to alter_table.sql that reveals > this problem under wal_level=minimal. > > > === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only > > storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it > possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a > net size decrease? I haven't tested, but this sequence seems possible: > > TRUNCATE > reduces MAIN_FORKNUM from 100 blocks to 0 blocks > reduces FSM_FORKNUM from 3 blocks to 0 blocks > COPY > raises MAIN_FORKNUM from 0 blocks to 110 blocks > does not change FSM_FORKNUM > COMMIT > should fsync, but wrongly chooses log_newpage_range() approach > > If that's indeed a problem, beside the obvious option of tracking every fork's > max_truncated, we could convert max_truncated to a bool and use fsync anytime > the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical > for database operations, the choice to subject it to checksums entails > protecting it here.) If that's not a problem, would you explain? > > === Non-defect notes > > Once you have a correct patch, would you run check-world with > -DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a > day or more, but that's fine. Sure. > The new smgrimmedsync() calls are potentially fragile, because they sometimes > target a file of a dropped relation. However, the mdexists() test prevents > anything bad from happening. No change is necessary. Example: > > SET wal_skip_threshold = 0; > BEGIN; > SAVEPOINT q; > CREATE TABLE t (c) AS SELECT 1; > ROLLBACK TO q; -- truncates the relfilenode > CHECKPOINT; -- unlinks the relfilenode > COMMIT; -- calls mdexists() on the relfilenode > > > === Notable changes in v30nm > > - Changed "wal_skip_threshold * 1024" to an expression that can't overflow. > Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0. Ahh.., I wrongly understood that MAX_KILOBYTES inhibits that setting. work_mem and maintenance_work_mem are casted to double or long before calculation. In this case it's enough that calculation unit becomes kilobytes instad of bytes. > - Changed AssertPendingSyncs_RelationCache() to open all relations on which > the transaction holds locks. This permits detection of cases where > RelationNeedsWAL() returns true but storage.c will sync the relation. > > Removed the assertions from RelationIdGetRelation(). Using > "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't > actually problematic, since invalidation updates rd_node while other code > updates rd_firstRelfilenodeSubid. This is not a significant loss, now that > AssertPendingSyncs_RelationCache() opens relations. (I considered making > the update of rd_firstRelfilenodeSubid more like rd_node, where we store it > somewhere until the next CommandCounterIncrement(), which would make it > actually affect RelationNeedsWAL(). That might have been better in general, > but it felt complex without clear benefits.) > > Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making > that work no matter what does ereport(ERROR) would be tricky and low-value. Right about ereport, but I'm not sure remove the whole assertion from abort. > - Extracted the RelationTruncate() changes into new function > RelationPreTruncate(), so table access methods that can't use > RelationTruncate() have another way to request that work. Sounds reasonable. Also the new behavior of max_truncated looks fine. > - Changed wal_skip_threshold default to 2MB. My second preference was for > 4MB. In your data, 2MB and 4MB had similar performance at optimal > wal_buffers, but 4MB performed worse at low
Re: [HACKERS] WAL logging problem in 9.4.3?
By improving AssertPendingSyncs_RelationCache() and by testing with -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm. Would you fix these? === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(), when running "make check" under wal_level=minimal. I test this way: printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf make check TEMP_CONFIG=$PWD/minimal.conf Self-contained demonstration: begin; create table t (c int); savepoint q; drop table t; rollback to q; -- forgets table is skipping wal commit; -- assertion failure === Defect 2: Forgets to skip WAL due to oversimplification in heap_create() In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need to transfer WAL-skipped state to the new index relation. Before v24nm, the new index relation skipped WAL unconditionally. Since v24nm, the new index relation never skips WAL. I've added a test to alter_table.sql that reveals this problem under wal_level=minimal. === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a net size decrease? I haven't tested, but this sequence seems possible: TRUNCATE reduces MAIN_FORKNUM from 100 blocks to 0 blocks reduces FSM_FORKNUM from 3 blocks to 0 blocks COPY raises MAIN_FORKNUM from 0 blocks to 110 blocks does not change FSM_FORKNUM COMMIT should fsync, but wrongly chooses log_newpage_range() approach If that's indeed a problem, beside the obvious option of tracking every fork's max_truncated, we could convert max_truncated to a bool and use fsync anytime the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical for database operations, the choice to subject it to checksums entails protecting it here.) If that's not a problem, would you explain? === Non-defect notes Once you have a correct patch, would you run check-world with -DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a day or more, but that's fine. The new smgrimmedsync() calls are potentially fragile, because they sometimes target a file of a dropped relation. However, the mdexists() test prevents anything bad from happening. No change is necessary. Example: SET wal_skip_threshold = 0; BEGIN; SAVEPOINT q; CREATE TABLE t (c) AS SELECT 1; ROLLBACK TO q; -- truncates the relfilenode CHECKPOINT; -- unlinks the relfilenode COMMIT; -- calls mdexists() on the relfilenode === Notable changes in v30nm - Changed "wal_skip_threshold * 1024" to an expression that can't overflow. Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0. - Changed AssertPendingSyncs_RelationCache() to open all relations on which the transaction holds locks. This permits detection of cases where RelationNeedsWAL() returns true but storage.c will sync the relation. Removed the assertions from RelationIdGetRelation(). Using "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't actually problematic, since invalidation updates rd_node while other code updates rd_firstRelfilenodeSubid. This is not a significant loss, now that AssertPendingSyncs_RelationCache() opens relations. (I considered making the update of rd_firstRelfilenodeSubid more like rd_node, where we store it somewhere until the next CommandCounterIncrement(), which would make it actually affect RelationNeedsWAL(). That might have been better in general, but it felt complex without clear benefits.) Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making that work no matter what does ereport(ERROR) would be tricky and low-value. - Extracted the RelationTruncate() changes into new function RelationPreTruncate(), so table access methods that can't use RelationTruncate() have another way to request that work. - Changed wal_skip_threshold default to 2MB. My second preference was for 4MB. In your data, 2MB and 4MB had similar performance at optimal wal_buffers, but 4MB performed worse at low wal_buffers. - Reverted most post-v24nm changes to swap_relation_files(). Under "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if we're running CLUSTER for the second time in one transaction. I used relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an assertion failure if we hold no lock at that point. - Change toast_get_valid_index() to retain locks until end of transaction. When I adopted relation_open(r1, NoLock) in swap_relation_files(), that revealed that we retain no lock on the TOAST index. - Ran pgindent and perltidy. Updated some comments and names. On Mon, Dec 09, 2019 at
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 24 Dec 2019 16:35:35 +0900 (JST), Kyotaro Horiguchi wrote in > I rebased the patch and changed the default value for the GUC variable > wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and > guc.c. 4096kB is choosed as it is the nice round number of 500 pages * > 8kB = 4000kB. The value in the doc was not correct. Fixed only the value from 3192 to 4096kB. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 2f184c140ab442ee29103be830b3389b71e8e609 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v29] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml | 43 ++-- doc/src/sgml/perform.sgml| 47 + src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c | 45 +--- src/backend/access/heap/heapam_handler.c | 22 +- src/backend/access/heap/rewriteheap.c| 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README| 47 - src/backend/access/transam/xact.c| 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c| 257 +-- src/backend/commands/cluster.c | 31 +++ src/backend/commands/copy.c | 58 + src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 123 ++- src/backend/storage/smgr/md.c| 35 ++- src/backend/storage/smgr/smgr.c | 37 src/backend/utils/cache/relcache.c | 122 --- src/backend/utils/misc/guc.c | 13 ++ src/bin/psql/input.c | 1 + src/include/access/gist_private.h| 2 + src/include/access/gistxlog.h| 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 18 +- src/include/catalog/storage.h| 5 + src/include/storage/bufmgr.h | 4 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 +++-- src/include/utils/relcache.h | 8 +- src/test/regress/pg_regress.c| 2 + 37 files changed, 839 insertions(+), 344 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d1c90282f..d893864c40 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct the -data from a base backup and the WAL logs, so replica or -higher must be used to enable WAL archiving -() and streaming replication. +In minimal level, no information is logged for +tables or indexes for the remainder of a transaction that creates or +truncates them. This can make bulk operations much faster (see +). But minimal WAL does not contain +enough information to reconstruct the data from a base backup and the +WAL logs, so replica or higher must be used to +enable WAL archiving () and +streaming replication. In logical level, the same information is logged as @@ -2887,6 +2880,26 @@ include_dir 'conf.d' + + wal_skip_threshold (integer) + + wal_skip_threshold configuration parameter + + + + +When wal_level is minimal and a +transaction commits after creating or
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 10 Dec 2019 16:59:25 +0900 (JST), Kyotaro Horiguchi wrote in > shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB > of storage so all of them can be loaded on shared memory. > > The attached graph shows larger benefit in TPS drop and latency > increase for HDD. The DDL pages at the corsspoint between commit-FPW > and commit-sync moves from roughly 300 to 200 in TPS and latency, and > 1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or > 512) pages seems to be the most promising candidate for the default > value of wal_skip_threshold. > regards. I rebased the patch and changed the default value for the GUC variable wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and guc.c. 4096kB is choosed as it is the nice round number of 500 pages * 8kB = 4000kB. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 64b4f63016c776cd5102b70f5562328dc5d371fa Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml | 43 ++-- doc/src/sgml/perform.sgml| 47 + src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c | 45 +--- src/backend/access/heap/heapam_handler.c | 22 +- src/backend/access/heap/rewriteheap.c| 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README| 47 - src/backend/access/transam/xact.c| 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c| 257 +-- src/backend/commands/cluster.c | 31 +++ src/backend/commands/copy.c | 58 + src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 123 ++- src/backend/storage/smgr/md.c| 35 ++- src/backend/storage/smgr/smgr.c | 37 src/backend/utils/cache/relcache.c | 122 --- src/backend/utils/misc/guc.c | 13 ++ src/bin/psql/input.c | 1 + src/include/access/gist_private.h| 2 + src/include/access/gistxlog.h| 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 18 +- src/include/catalog/storage.h| 5 + src/include/storage/bufmgr.h | 4 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 +++-- src/include/utils/relcache.h | 8 +- src/test/regress/pg_regress.c| 2 + 37 files changed, 839 insertions(+), 344 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d1c90282f..d455a42c9d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct the -data from a base backup and the WAL logs, so replica or -higher must be used to enable WAL archiving -() and streaming replication. +In minimal level, no information is logged for +tables or indexes for the remainder of a transaction that creates or +truncates them. This can make bulk operations much faster (see +). But minimal WAL does not contain +enough information to reconstruct the data from a base backup and the +WAL logs, so replica or higher must be
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Dec 9, 2019 at 4:04 AM Kyotaro Horiguchi wrote: > Yeah, only 0.5GB of shared_buffers makes the default value of > wal_buffers reach to the heaven. I think I can take numbers on that > condition. (I doubt that it's meaningful if I increase only > wal_buffers manually.) Heaven seems a bit exalted, but I think we really only have a formula because somebody might have really small shared_buffers for some reason and be unhappy about us gobbling up a comparatively large amount of memory for WAL buffers. The current limit means that normal installations get what they need without manual tuning, and small installations - where performance presumably sucks anyway for other reasons - keep a small memory footprint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Sun, 8 Dec 2019 10:09:51 -0800, Noah Misch wrote in > I reviewed your latest code, and it's nearly complete. mdimmedsync() syncs > only "active segments" (as defined in md.c), but smgrDoPendingSyncs() must > sync active and inactive segments. This matters when mdtruncate() truncated > the relation after the last checkpoint, causing active segments to become > inactive. In such cases, syncs of the inactive segments will have been queued > for execution during the next checkpoint. Since we skipped the > XLOG_SMGR_TRUNCATE record, we must complete those syncs before commit. Let's Got it! You're so great. Thanks. > just modify smgrimmedsync() to always sync active and inactive segments; > that's fine to do in other smgrimmedsync() callers, even though they operate > on relations that can't have inactive segments. Agreed and done that way. Even it's not harmful to leave inactive segments open but I choosed to close them after sync. As being mentioned in the added comment in the function, inactive segments may not be closed if error happened on file sync. md works properly even in the case (as the file comment says) and, anyway, mdnblocks leaves the first inactive segment open if there's no partial segment. I don't understand why mdclose checks for (v->mdfd_vfd >= 0) of open segment but anyway mdimmedsync is believing that that won't happen and I follow the assumption. (I suspect that the if condition in mdclose should be an assertion..) > On Tue, Dec 03, 2019 at 08:51:46PM +0900, Kyotaro Horiguchi wrote: > > At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch wrote in > > > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote: > > > > I measured the performance with the latest patch set. > > > > > > > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about > > > > > one > > > > >minute when done via syncs. > > > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > > > > > 3. Wait 10s. > > > > > 4. Start one DDL backend that runs $DDL_COUNT transactions. > > > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. > > > wal_buffers| 512 > > This value (4 MiB) is lower than a tuned production system would have. In > future benchmarks (if any) use wal_buffers=2048 (16 MiB). Yeah, only 0.5GB of shared_buffers makes the default value of wal_buffers reach to the heaven. I think I can take numbers on that condition. (I doubt that it's meaningful if I increase only wal_buffers manually.) Anyway the default value ought to be defined based on the default configuration. In the attached patch, I merged all pieces in the previous version and the change made this time (only md.c is changed this time). regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From a523eeddbd8840a02c81267a217c8f4aecc1fe5e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v27] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml | 43 ++-- doc/src/sgml/perform.sgml| 47 + src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c | 45 +--- src/backend/access/heap/heapam_handler.c | 22 +- src/backend/access/heap/rewriteheap.c| 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README| 47 - src/backend/access/transam/xact.c| 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c| 257 +-- src/backend/commands/cluster.c | 31 +++ src/backend/commands/copy.c | 58 + src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 123 ++- src/backend/storage/smgr/md.c| 35 ++- src/backend/storage/smgr/smgr.c | 37 src/backend/utils/cache/relcache.c | 122 --- src/backend/utils/misc/guc.c | 13 ++ src/bin/psql/input.c | 1 +
Re: [HACKERS] WAL logging problem in 9.4.3?
I reviewed your latest code, and it's nearly complete. mdimmedsync() syncs only "active segments" (as defined in md.c), but smgrDoPendingSyncs() must sync active and inactive segments. This matters when mdtruncate() truncated the relation after the last checkpoint, causing active segments to become inactive. In such cases, syncs of the inactive segments will have been queued for execution during the next checkpoint. Since we skipped the XLOG_SMGR_TRUNCATE record, we must complete those syncs before commit. Let's just modify smgrimmedsync() to always sync active and inactive segments; that's fine to do in other smgrimmedsync() callers, even though they operate on relations that can't have inactive segments. On Tue, Dec 03, 2019 at 08:51:46PM +0900, Kyotaro Horiguchi wrote: > At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch wrote in > > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote: > > > I measured the performance with the latest patch set. > > > > > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about > > > > one > > > >minute when done via syncs. > > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > > > > 3. Wait 10s. > > > > 4. Start one DDL backend that runs $DDL_COUNT transactions. > > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. > wal_buffers| 512 This value (4 MiB) is lower than a tuned production system would have. In future benchmarks (if any) use wal_buffers=2048 (16 MiB).
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch wrote in > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote: > > I measured the performance with the latest patch set. > > > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one > > >minute when done via syncs. > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > > > 3. Wait 10s. > > > 4. Start one DDL backend that runs $DDL_COUNT transactions. > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. > > If you have the raw data requested in (5), please share them here so folks > have the option to reproduce your graphs and calculations. Sorry, I forgot to attach the scripts. The raw data was vanished into unstable connection and the steps was quite crude. I prioritized on showing some numbers at the time. I revised the scripts into more automated way and will take numbers again. > > > 2. Start server with wal_level = replica (all other variables are not > > > changed) then run the attached ./bench.sh > > > > The bench.sh attachment was missing; please attach it. Please give the > > output > > of this command: > > > > select name, setting from pg_settings where setting <> boot_val; (I intentionally show all the results..) =# select name, setting from pg_settings where setting<> boot_val; name| setting + application_name | psql archive_command| (disabled) client_encoding| UTF8 data_directory_mode| 0700 default_text_search_config | pg_catalog.english lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 lc_messages| en_US.UTF-8 lc_monetary| en_US.UTF-8 lc_numeric | en_US.UTF-8 lc_time| en_US.UTF-8 log_checkpoints| on log_file_mode | 0600 log_timezone | Asia/Tokyo max_stack_depth| 2048 max_wal_senders| 0 max_wal_size | 10240 server_encoding| UTF8 shared_buffers | 16384 TimeZone | Asia/Tokyo unix_socket_permissions| 0777 wal_buffers| 512 wal_level | minimal (23 rows) The result for "replica" setting in the benchmark script are used as base numbers (or the denominator of the percentages). > > 3. Restart server with wal_level = replica then run the bench.sh > > twice. > > I assume this is wal_level=minimal, not wal_level=replica. Oops! It's wrong I ran once with replica, then twice with minimal. Anyway, I revised the benchmarking scripts and attached them. The parameters written in benchmain.sh were decided as ./bench2.pl 5 s with wal_level=minimal server takes around 60 seconds. I'll send the complete data tomorrow (in JST). The attached f.txt is the result of preliminary test only with pages=100 and 250 (with HDD). The attached files are: benchmain.sh- main script bench2.sh - run a benchmark with a single set of parameters bench1.pl - behchmark client program summarize.pl- script to summarize benchmain.sh's output f.txt.gz- result only for pages=100, DDL count = 2200 (not 2250) How to run: $ /..unpatched_path../initdb -D (wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB) $ /..patched_path../initdb -D (wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB) $./benchmain.sh ># output raw data $./summarize.pl [-v] <# show summary With the attached f.txt, summarize.pl gives the following output. WAL wins with the that pages. $ cat f.txt | ./summarize.pl ## params: wal_level=replica mode=none pages=100 count=353 scale=20 (% are relative to "before") before: tps 262.3 (100.0%), lat39.840 ms (100.0%) (29 samples) during: tps 120.7 ( 46.0%), lat 112.508 ms (282.4%) (35 samples) after: tps 106.3 ( 40.5%), lat 163.492 ms (410.4%) (86 samples) DDL time: 34883 ms ( 100.0% relative to mode=none) ## params: wal_level=minimal mode=sync pages=100 count=353 scale=20 (% are relative to "before") before: tps 226.3 (100.0%), lat48.091 ms (100.0%) (29 samples) during: tps 83.0 ( 36.7%), lat 184.942 ms (384.6%) (100 samples) after: tps 82.6 ( 36.5%), lat 196.863 ms (409.4%) (21 samples) DDL time: 99239 ms ( 284.5% relative to mode=none) ## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20 (% are relative to "before") before: tps 240.3 (100.0%), lat44.686 ms (100.0%) (29 samples) during: tps 129.6 ( 53.9%), lat 113.585 ms (254.2%) (31 samples) after: tps 124.5 ( 51.8%), lat 141.992 ms (317.8%) (90 samples) DDL time: 30392 ms ( 87.1% relative to mode=none) ## params: wal_level=replica mode=none pages=250 count=258 scale=20 (% are relative to "before") before: tps 266.3 (100.0%), lat45.884 ms (100.0%) (29 samples) during:
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote: > I measured the performance with the latest patch set. > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one > >minute when done via syncs. > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > > 3. Wait 10s. > > 4. Start one DDL backend that runs $DDL_COUNT transactions. > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. If you have the raw data requested in (5), please share them here so folks have the option to reproduce your graphs and calculations. > I did the following benchmarking. > > 1. Initialize bench database > > $ pgbench -i -s 20 > > 2. Start server with wal_level = replica (all other variables are not > changed) then run the attached ./bench.sh The bench.sh attachment was missing; please attach it. Please give the output of this command: select name, setting from pg_settings where setting <> boot_val; > 3. Restart server with wal_level = replica then run the bench.sh > twice. I assume this is wal_level=minimal, not wal_level=replica.
Re: [HACKERS] WAL logging problem in 9.4.3?
I measured the performance with the latest patch set. > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one >minute when done via syncs. > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > 3. Wait 10s. > 4. Start one DDL backend that runs $DDL_COUNT transactions. > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. I did the following benchmarking. 1. Initialize bench database $ pgbench -i -s 20 2. Start server with wal_level = replica (all other variables are not changed) then run the attached ./bench.sh $ ./bench.sh where count is the number of repetition, pages is the number of pages to write in a run, and mode is "s" (sync) or "w"(WAL). The doesn't affect if wal_level = replica. The script shows the following result. | before: tps 240.2, lat 44.087 ms (29 samples) | during: tps 109.1, lat 114.887 ms (14 samples) | after : tps 269.9, lat 39.557 ms (107 samples) | DDL time = 13965 ms | # transaction type: before: mean numbers before "the DDL" starts. during: mean numbers while "the DDL" is running. after : mean numbers after "the DDL" ends. DDL time: the time took to run "the DDL". 3. Restart server with wal_level = replica then run the bench.sh twice. $ ./bench.sh s $ ./bench.sh w Finally I got three graphs. (attached 1, 2, 3. PNGs) * Graph 1 - The affect of the DDL on pgbench's TPS The virtical axis means "during TPS" / "before TPS" in %. Larger is better. The horizontal axis means the table pages size. Replica and Minimal-sync are almost flat. Minimal-WAL getting worse as table size increases. 500 pages seems to be the crosspoint. * Graph 2 - The affect of the DDL on pgbench's latency. The virtical axis means "during-letency" / "before-latency" in %. Smaller is better. Like TPS but more quickly WAL-latency gets worse as table size increases. The crosspoint seems to be 300 pages or so. * Graph 3 - The affect of pgbench's work load on DDL runtime. The virtical axis means "time the DDL takes to run with pgbench" / "time the DDL to run solely". Smaller is better. Replica and Minimal-SYNC shows similar tendency. On Minimal-WAL the DDL runs quite fast with small tables. The crosspoint seems to be about 2500 pages. Seeing this, I became to be worry that the optimization might give far smaller advantage than expected. Putting aside that, it seems to me that the default value for the threshold would be 500-1000, same as the previous benchmark showed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 26 Nov 2019 21:37:52 +0900 (JST), Kyotaro Horiguchi Is is not fully checked. I didn't merged and mesured performance yet, > but I post the status-quo patch for now. It was actually inconsistency caused by swap_relation_files. 1. rd_createSubid of relcache for r2 is not turned off. This prevents the relcache entry from flushed. Commit processes pendingSyncs and leaves the relcache entry with rd_createSubid != Invalid. It is inconsistency. 2. relation_open(r1) returns a relcache entry with its relfilenode has the old value (relfilenode1) since command counter has not been incremented. On the other hand if it is incremented just before, AssertPendingSyncConsistency() aborts because of the inconsistency between relfilenode and rd_firstRel*. As the result, I returned to think that we need to modify both relcache entries with right relfilenode. I once thought that taking AEL in the function has no side effect but the code path is executed also when wal_level = replica or higher. And as I mentioned upthread, we can even get there without taking any lock on r1 or sometimes ShareLock. So upgrading to AEL emits Standby/LOCK WAL and propagates to standby. After all I'd like to take the weakest lock (AccessShareLock) there. The attached is the new version of the patch. - v26-0001-version-nm24.patch Same with v24 - v26-0002-change-swap_relation_files.patch Changes to swap_relation_files as mentioned above. - v26-0003-Improve-the-performance-of-relation-syncs.patch Do multiple pending syncs by one shared_buffers scanning. - v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch v26-0003 makes the function useless. Remove it. - v26-0005-Fix-gistGetFakeLSN.patch gistGetFakeLSN fix. - v26-0006-Sync-files-shrinked-by-truncation.patch Fix the problem of commit-time-FPI after truncation after checkpoint. I'm not sure this is the right direction but pendingSyncHash is removed from pendingDeletes list again. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From ee96bb1e14969823eab79ab1531d68e8aadc1915 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v26 1/6] version nm24 Noah Misch's version 24. --- doc/src/sgml/config.sgml | 43 +++-- doc/src/sgml/perform.sgml| 47 ++ src/backend/access/gist/gistutil.c | 7 +- src/backend/access/heap/heapam.c | 45 +- src/backend/access/heap/heapam_handler.c | 22 +-- src/backend/access/heap/rewriteheap.c| 21 +-- src/backend/access/nbtree/nbtsort.c | 41 ++--- src/backend/access/transam/README| 47 +- src/backend/access/transam/xact.c| 14 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c| 198 +-- src/backend/commands/cluster.c | 11 ++ src/backend/commands/copy.c | 58 +-- src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 37 +++-- src/backend/storage/smgr/md.c| 9 +- src/backend/utils/cache/relcache.c | 122 ++ src/backend/utils/misc/guc.c | 13 ++ src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 18 +-- src/include/catalog/storage.h| 5 + src/include/storage/bufmgr.h | 5 + src/include/utils/rel.h | 57 +-- src/include/utils/relcache.h | 8 +- src/test/regress/pg_regress.c| 2 + 30 files changed, 551 insertions(+), 349 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d4d1fe45cc..d0f7dbd7d7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2483,21 +2483,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct the -data from a base backup and the WAL logs, so replica or -higher must be used to enable WAL archiving -() and streaming replication. +In minimal level, no information is logged for +tables or indexes for the remainder of a transaction that creates or +truncates them. This
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sun, 24 Nov 2019 22:08:39 -0500, Noah Misch wrote in > On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote: > > At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch wrote in > > > I noticed an additional defect: > > > > > > BEGIN; > > > CREATE TABLE t (c) AS SELECT 1; > > > CHECKPOINT; -- write and fsync the table's one page > > > TRUNCATE t; -- no WAL > > > COMMIT; -- no FPI, just the commit record > > > > > > If we crash after the COMMIT and before the next fsync or OS-elected sync > > > of > > > the table's file, the table will stay on disk with its pre-TRUNCATE > > > content. > > > > The TRUNCATE replaces relfilenode in the catalog > > No, it does not. Since the relation is new in the transaction, the TRUNCATE > uses the heap_truncate_one_rel() strategy. .. > The zero-pages case is not special. Here's an example of the problem with a > nonzero size: I got it. That is, if the file has had blocks beyond the size at commit, we should sync the file even if it is small enough. It nees to track beore-trunction size as this patch used to have. pendingSyncHash is resurrected to do truncate-size tracking. That information cannot be stored in SMgrRelation, which will be dissapper by invalidation, or Relation, which is not available in storage layer. smgrDoPendingDeletes is needed to be called at aboft again to clean up useless hash. I'm not sure the exact cause but AssertPendingSyncs_RelationCache() fails at abort (so it is not called at abort). smgrDoPendingSyncs and RelFileNodeSkippingWAL() become simpler by using the hash. Is is not fully checked. I didn't merged and mesured performance yet, but I post the status-quo patch for now. - v25-0001-version-nm.patch Noah's v24 patch. - v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patch Remove useless function (added by this patch..). - v25-0003-Improve-the-performance-of-relation-syncs.patch Make smgrDoPendingSyncs scan shared buffer once. v25-0004-Adjust-gistGetFakeLSN.patch Amendment for gistGetFakeLSN. This uses GetXLogInsertRecPtr as long as it is different from the previous call and emits dummy WAL if we need a new LSN. Since other than switch_wal record cannot be empty so the dummy WAL has an integer content for now. v25-0005-Sync-files-shrinked-by-truncation.patch Amendment for the truncation problem. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 86d7c2dee819b1171f0a02c56e4cda065c64246f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v25 1/5] version nm --- doc/src/sgml/config.sgml | 43 +++-- doc/src/sgml/perform.sgml| 47 ++ src/backend/access/gist/gistutil.c | 7 +- src/backend/access/heap/heapam.c | 45 +- src/backend/access/heap/heapam_handler.c | 22 +-- src/backend/access/heap/rewriteheap.c| 21 +-- src/backend/access/nbtree/nbtsort.c | 41 ++--- src/backend/access/transam/README| 47 +- src/backend/access/transam/xact.c| 14 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c| 198 +-- src/backend/commands/cluster.c | 11 ++ src/backend/commands/copy.c | 58 +-- src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 37 +++-- src/backend/storage/smgr/md.c| 9 +- src/backend/utils/cache/relcache.c | 122 ++ src/backend/utils/misc/guc.c | 13 ++ src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 18 +-- src/include/catalog/storage.h| 5 + src/include/storage/bufmgr.h | 5 + src/include/utils/rel.h | 57 +-- src/include/utils/relcache.h | 8 +- src/test/regress/pg_regress.c| 2 + 30 files changed, 551 insertions(+), 349 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d4d1fe45cc..d0f7dbd7d7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2483,21 +2483,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. -In minimal level, WAL-logging of some bulk -operations can be safely skipped, which can make those -operations much faster (see ). -Operations in which this optimization can be applied include: - - CREATE TABLE AS - CREATE INDEX - CLUSTER - COPY into tables that were created or truncated in the same - transaction - -But minimal WAL does not contain enough information to reconstruct
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Nov 25, 2019 at 03:58:14PM -0500, Robert Haas wrote: > On Sat, Nov 23, 2019 at 4:21 PM Noah Misch wrote: > > I noticed an additional defect: > > > > BEGIN; > > CREATE TABLE t (c) AS SELECT 1; > > CHECKPOINT; -- write and fsync the table's one page > > TRUNCATE t; -- no WAL > > COMMIT; -- no FPI, just the commit record > > > > If we crash after the COMMIT and before the next fsync or OS-elected sync of > > the table's file, the table will stay on disk with its pre-TRUNCATE content. > > Shouldn't the TRUNCATE be triggering an fsync() to happen before > COMMIT is permitted to complete? With wal_skip_threshold=0, you do get an fsync(). The patch tries to avoid at-commit fsync of small files by WAL-logging file contents instead. However, the patch doesn't WAL-log enough to handle files that decreased in size. > You'd have the same problem if the > TRUNCATE were replaced by INSERT, unless fsync() happens in that case. I think an insert would be fine. You'd get an FPI record for the relation's one page, which fully reproduces the relation.
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sat, Nov 23, 2019 at 4:21 PM Noah Misch wrote: > I noticed an additional defect: > > BEGIN; > CREATE TABLE t (c) AS SELECT 1; > CHECKPOINT; -- write and fsync the table's one page > TRUNCATE t; -- no WAL > COMMIT; -- no FPI, just the commit record > > If we crash after the COMMIT and before the next fsync or OS-elected sync of > the table's file, the table will stay on disk with its pre-TRUNCATE content. Shouldn't the TRUNCATE be triggering an fsync() to happen before COMMIT is permitted to complete? You'd have the same problem if the TRUNCATE were replaced by INSERT, unless fsync() happens in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote: > At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch wrote in > > This benchmark procedure may help: > > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one > >minute when done via syncs. > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > > 3. Wait 10s. > > 4. Start one DDL backend that runs $DDL_COUNT transactions. > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. > > > > I would compare pgbench tps and latency between the seconds when DDL is and > > is > > not running. As you did in earlier tests, I would repeat it using various > > page counts, with and without sync. > > I understood the "DDL" is not pure DDLs but a kind of > define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY > FROM". When I wrote "DDL", I meant the four-command transaction that you already used in benchmarks. > > I noticed an additional defect: > > > > BEGIN; > > CREATE TABLE t (c) AS SELECT 1; > > CHECKPOINT; -- write and fsync the table's one page > > TRUNCATE t; -- no WAL > > COMMIT; -- no FPI, just the commit record > > > > If we crash after the COMMIT and before the next fsync or OS-elected sync of > > the table's file, the table will stay on disk with its pre-TRUNCATE content. > > The TRUNCATE replaces relfilenode in the catalog No, it does not. Since the relation is new in the transaction, the TRUNCATE uses the heap_truncate_one_rel() strategy. > Since the file has no pages, it's right that no FPI emitted. Correct. > If we don't rely on such > behavior, we can make sure thhat by turning the zero-pages case from > WAL into file sync. I'll do that in the next version. The zero-pages case is not special. Here's an example of the problem with a nonzero size: BEGIN; CREATE TABLE t (c) AS SELECT * FROM generate_series(1,10); CHECKPOINT; -- write and fsync the table's many pages TRUNCATE t; -- no WAL INSERT INTO t VALUES (0); -- no WAL COMMIT; -- FPI for one page; nothing removes the additional pages
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch wrote in > On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote: > > By the way, before finalize this, I'd like to share the result of a > > brief benchmarking. > > What non-default settings did you use? Please give the output of this or a > similar command: Only wal_level=minimal and max_wal_senders=0. > select name, setting from pg_settings where setting <> boot_val; > > If you run more benchmarks and weren't already using wal_buffers=16MB, I > recommend using it. Roger. > > With 10 pgbench sessions. > > pages SYNC WAL > > 1: 915 ms 301 ms > > 3: 1634 ms 508 ms > > 5: 1634 ms 293ms > >10: 1671 ms 1043 ms > >17: 1600 ms 333 ms > >31: 1864 ms 314 ms > >56: 1562 ms 448 ms > > 100: 1538 ms 394 ms > > 177: 1697 ms 1047 ms > > 316: 3074 ms 1788 ms > > 562: 3306 ms 1245 ms > > 1000: 3440 ms 2182 ms > > 1778: 5064 ms 6464 ms # WAL's slope becomes steep > > 3162: 8675 ms 8165 ms > > For picking a default wal_skip_threshold, it would have been more informative > to see how this changes pgbench latency statistics. Some people want DDL to > be fast, but more people want DDL not to reduce the performance of concurrent > non-DDL. This benchmark procedure may help: > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one >minute when done via syncs. > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". > 3. Wait 10s. > 4. Start one DDL backend that runs $DDL_COUNT transactions. > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. > > I would compare pgbench tps and latency between the seconds when DDL is and is > not running. As you did in earlier tests, I would repeat it using various > page counts, with and without sync. I understood the "DDL" is not pure DDLs but a kind of define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY FROM". > On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote: > > +Prefer to do the same in future access methods. However, two other > > approaches > > +can work. First, an access method can irreversibly transition a given fork > > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and > > +smgrimmedsync(). Second, an access method can opt to write WAL > > +unconditionally for permanent relations. When using the second method, do > > not > > +call RelationCopyStorage(), which skips WAL. > > > > Even using these methods, TransactionCommit flushes out buffers then > > sync files again. Isn't a description something like the following > > needed? > > > > === > > Even an access method switched a in-transaction created relfilenode to > > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the > > file then smgrimmedsync() the file. > > === > > It is enough that the text says to prefer the approach that core access > methods use. The extra flush and sync when using a non-preferred approach > wastes some performance, but it is otherwise harmless. Ah, right and I agreed. > > + rel1 = relation_open(r1, AccessExclusiveLock); > > + RelationAssumeNewRelfilenode(rel1); > > > > It cannot be accessed from other sessions. Theoretically it doesn't > > need a lock but NoLock cannot be used there since there's a path that > > doesn't take lock on the relation. But AEL seems too strong and it > > causes unecessary side effect. Couldn't we use weaker locks? > > We could use NoLock. I assumed we already hold AccessExclusiveLock, in which > case this has no side effects. I forgot that this optimization is used only in non-replication configuragion. So I agree that AEL doesn't have no side effect. > On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote: > > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch wrote in > > > === Defect 1: gistGetFakeLSN() > > > > > > When I modified pg_regress.c to use wal_level=minimal for all suites, > > > src/test/isolation/specs/predicate-gist.spec failed the assertion in > > > gistGetFakeLSN(). One could reproduce the problem just by running this > > > sequence in psql: > > > > > > begin; > > > create table gist_point_tbl(id int4, p point); > > > create index gist_pointidx on gist_point_tbl using gist(p); > > > insert into gist_point_tbl (id, p) > > > select g, point(g*10, g*10) from generate_series(1, 1000) g; > > > > > > I've included a wrong-in-general hack to make the test pass. I see two > > > main > > > options for fixing this: > > > > > > (a) Introduce an empty WAL record that reserves an LSN and has no other > > > effect. Make GiST use that for permanent relations that are skipping WAL. > > > Further optimizations are possible. For example, we could use a > > > backend-local > > > counter (like the one gistGetFakeLSN() uses for temp relations) until the > > > counter is greater a recent real LSN. That
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sat, Nov 23, 2019 at 11:35:09AM -0500, Noah Misch wrote: > That longstanding optimization is too useful to remove, but likely not useful > enough to add today if we didn't already have it. The initial-data-load use > case remains plausible. I can also imagine using wal_level=minimal for data > warehouse applications where one can quickly rebuild from the authoritative > data. I can easily imagine cases where a user would like to use the benefit of the optimization for an initial data load, and afterwards update wal_level to replica so as they avoid the initial WAL burst which serves no real purpose. So the first argument is pretty strong IMO, the second much less. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote: > By the way, before finalize this, I'd like to share the result of a > brief benchmarking. What non-default settings did you use? Please give the output of this or a similar command: select name, setting from pg_settings where setting <> boot_val; If you run more benchmarks and weren't already using wal_buffers=16MB, I recommend using it. > With 10 pgbench sessions. > pages SYNC WAL > 1: 915 ms 301 ms > 3: 1634 ms 508 ms > 5: 1634 ms 293ms >10: 1671 ms 1043 ms >17: 1600 ms 333 ms >31: 1864 ms 314 ms >56: 1562 ms 448 ms > 100: 1538 ms 394 ms > 177: 1697 ms 1047 ms > 316: 3074 ms 1788 ms > 562: 3306 ms 1245 ms > 1000: 3440 ms 2182 ms > 1778: 5064 ms 6464 ms # WAL's slope becomes steep > 3162: 8675 ms 8165 ms For picking a default wal_skip_threshold, it would have been more informative to see how this changes pgbench latency statistics. Some people want DDL to be fast, but more people want DDL not to reduce the performance of concurrent non-DDL. This benchmark procedure may help: 1. Determine $DDL_COUNT, a number of DDL transactions that take about one minute when done via syncs. 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10". 3. Wait 10s. 4. Start one DDL backend that runs $DDL_COUNT transactions. 5. Save DDL start timestamp, DDL end timestamp, and pgbench output. I would compare pgbench tps and latency between the seconds when DDL is and is not running. As you did in earlier tests, I would repeat it using various page counts, with and without sync. On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote: > +Prefer to do the same in future access methods. However, two other > approaches > +can work. First, an access method can irreversibly transition a given fork > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and > +smgrimmedsync(). Second, an access method can opt to write WAL > +unconditionally for permanent relations. When using the second method, do > not > +call RelationCopyStorage(), which skips WAL. > > Even using these methods, TransactionCommit flushes out buffers then > sync files again. Isn't a description something like the following > needed? > > === > Even an access method switched a in-transaction created relfilenode to > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the > file then smgrimmedsync() the file. > === It is enough that the text says to prefer the approach that core access methods use. The extra flush and sync when using a non-preferred approach wastes some performance, but it is otherwise harmless. > + rel1 = relation_open(r1, AccessExclusiveLock); > + RelationAssumeNewRelfilenode(rel1); > > It cannot be accessed from other sessions. Theoretically it doesn't > need a lock but NoLock cannot be used there since there's a path that > doesn't take lock on the relation. But AEL seems too strong and it > causes unecessary side effect. Couldn't we use weaker locks? We could use NoLock. I assumed we already hold AccessExclusiveLock, in which case this has no side effects. On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote: > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch wrote in > > === Defect 1: gistGetFakeLSN() > > > > When I modified pg_regress.c to use wal_level=minimal for all suites, > > src/test/isolation/specs/predicate-gist.spec failed the assertion in > > gistGetFakeLSN(). One could reproduce the problem just by running this > > sequence in psql: > > > > begin; > > create table gist_point_tbl(id int4, p point); > > create index gist_pointidx on gist_point_tbl using gist(p); > > insert into gist_point_tbl (id, p) > > select g, point(g*10, g*10) from generate_series(1, 1000) g; > > > > I've included a wrong-in-general hack to make the test pass. I see two main > > options for fixing this: > > > > (a) Introduce an empty WAL record that reserves an LSN and has no other > > effect. Make GiST use that for permanent relations that are skipping WAL. > > Further optimizations are possible. For example, we could use a > > backend-local > > counter (like the one gistGetFakeLSN() uses for temp relations) until the > > counter is greater a recent real LSN. That optimization is probably too > > clever, though it would make the new WAL record almost never appear. > > > > (b) Exempt GiST from most WAL skipping. GiST index build could still skip > > WAL, but it would do its own smgrimmedsync() in addition to the one done at > > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead > > of > > RelationNeedsWal(), and we'd need some hack for index_copy_data() and > > possibly > > other AM-independent code that skips WAL. > > > > Overall, I like the cleanliness of (a). The main argument for (b) is that > > it > > ensures we have all the
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Nov 22, 2019 at 01:21:31PM +0100, Peter Eisentraut wrote: > On 2019-11-05 22:16, Robert Haas wrote: > >First, I'd like to restate my understanding of the problem just to see > >whether I've got the right idea and whether we're all on the same > >page. When wal_level=minimal, we sometimes try to skip WAL logging on > >newly-created relations in favor of fsync-ing the relation at commit > >time. > > How useful is this behavior, relative to all the effort required? > > Even if the benefit is significant, how many users can accept running with > wal_level=minimal and thus without replication or efficient backups? That longstanding optimization is too useful to remove, but likely not useful enough to add today if we didn't already have it. The initial-data-load use case remains plausible. I can also imagine using wal_level=minimal for data warehouse applications where one can quickly rebuild from the authoritative data. > Is there perhaps an alternative approach involving unlogged tables to get a > similar performance benefit? At wal_level=replica, it seems inevitable that ALTER TABLE SET LOGGED will need to WAL-log the table contents. I suppose we could keep wal_level=minimal and change its only difference from wal_level=replica to be that ALTER TABLE SET LOGGED skips WAL. Currently, ALTER TABLE SET LOGGED also rewrites the table; that would need to change. I'd want to add ALTER INDEX SET LOGGED, too. After all that, users would need to modify their applications. Overall, it's possible, but it's not a clear win over the status quo.
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2019-11-05 22:16, Robert Haas wrote: First, I'd like to restate my understanding of the problem just to see whether I've got the right idea and whether we're all on the same page. When wal_level=minimal, we sometimes try to skip WAL logging on newly-created relations in favor of fsync-ing the relation at commit time. How useful is this behavior, relative to all the effort required? Even if the benefit is significant, how many users can accept running with wal_level=minimal and thus without replication or efficient backups? Is there perhaps an alternative approach involving unlogged tables to get a similar performance benefit? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WAL logging problem in 9.4.3?
At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi wrote in > > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for > > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is > > sophisticated about optimizing the shared buffers scan. Commit 279628a > > introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, > > to > Seems reasonable. Please wait a minite. This is the first cut of that. This makes the function FlushRelationBuffersWithoutRelcache useless, which was introduced in this work. The first patch reverts it, then the second patch adds the bulk sync feature. The new function FlushRelFileNodesAllBuffers, differently from DropRelFileNodesAllBuffers, takes SMgrRelation which is required by FlushBuffer(). So it takes somewhat tricky way, where type SMgrSortArray pointer to which is compatible with RelFileNode is used. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From c51b44734d88fb19b568c4c0240848c8be2b7cf4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 19:28:35 +0900 Subject: [PATCH 1/2] Revert FlushRelationBuffersWithoutRelcache. Succeeding patch makes the function useless and the function is no longer useful globally. Revert it. --- src/backend/storage/buffer/bufmgr.c | 27 ++- src/include/storage/bufmgr.h| 2 -- 2 files changed, 10 insertions(+), 19 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 746ce477fc..67bbb26cae 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3203,27 +3203,20 @@ PrintPinnedBufs(void) void FlushRelationBuffers(Relation rel) { - RelationOpenSmgr(rel); - - FlushRelationBuffersWithoutRelcache(rel->rd_smgr, - RelationUsesLocalBuffers(rel)); -} - -void -FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal) -{ - RelFileNode rnode = smgr->smgr_rnode.node; - int i; + int i; BufferDesc *bufHdr; - if (islocal) + /* Open rel at the smgr level if not already done */ + RelationOpenSmgr(rel); + + if (RelationUsesLocalBuffers(rel)) { for (i = 0; i < NLocBuffer; i++) { uint32 buf_state; bufHdr = GetLocalBufferDescriptor(i); - if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && + if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) && ((buf_state = pg_atomic_read_u32(>state)) & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { @@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal) PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); -smgrwrite(smgr, +smgrwrite(rel->rd_smgr, bufHdr->tag.forkNum, bufHdr->tag.blockNum, localpage, @@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal) * As in DropRelFileNodeBuffers, an unlocked precheck should be safe * and saves some cycles. */ - if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode)) + if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node)) continue; ReservePrivateRefCountEntry(); buf_state = LockBufHdr(bufHdr); - if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && + if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) && (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { PinBuffer_Locked(bufHdr); LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); - FlushBuffer(bufHdr, smgr); + FlushBuffer(bufHdr, rel->rd_smgr); LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); UnpinBuffer(bufHdr, true); } diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 8097d5ab22..8cd1cf25d9 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation, ForkNumber forkNum); extern void FlushOneBuffer(Buffer buffer); extern void FlushRelationBuffers(Relation rel); -extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr, -bool islocal); extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); -- 2.23.0 >From 882731fcf063269d0bf85c57f23c83b9570e5df5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Nov 2019 19:33:18 +0900 Subject: [PATCH 2/2] Improve the performance of relation syncs. We can improve performance of syncing multiple files at once in the same way as b41669118. This reduces the number of scans on the whole shared_bufffers from the number of synced relations to one. --- src/backend/catalog/storage.c | 28 +-- src/backend/storage/buffer/bufmgr.c | 113 src/backend/storage/smgr/smgr.c | 38 +- src/include/storage/bufmgr.h| 1 + src/include/storage/smgr.h
Re: [HACKERS] WAL logging problem in 9.4.3?
Wow.. This is embarrassing.. *^^*. At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi wrote in > I should have replied this first. > > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch wrote in > > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote: > > > I started pre-commit editing on 2019-10-28, and comment+README updates > > > have > > > been the largest part of that. I'll check my edits against the things you > > > list here, and I'll share on-list before committing. I've now marked the > > > CF > > > entry Ready for Committer. > > > > Having dedicated many days to that, I am attaching v24nm. I know of two > > remaining defects: > > > > === Defect 1: gistGetFakeLSN() > > > > When I modified pg_regress.c to use wal_level=minimal for all suites, > > src/test/isolation/specs/predicate-gist.spec failed the assertion in > > gistGetFakeLSN(). One could reproduce the problem just by running this > > sequence in psql: > > > > begin; > > create table gist_point_tbl(id int4, p point); > > create index gist_pointidx on gist_point_tbl using gist(p); > > insert into gist_point_tbl (id, p) > > select g, point(g*10, g*10) from generate_series(1, 1000) g; > > > > I've included a wrong-in-general hack to make the test pass. I see two main > > options for fixing this: > > > > (a) Introduce an empty WAL record that reserves an LSN and has no other > > effect. Make GiST use that for permanent relations that are skipping WAL. > > Further optimizations are possible. For example, we could use a > > backend-local > > counter (like the one gistGetFakeLSN() uses for temp relations) until the > > counter is greater a recent real LSN. That optimization is probably too > > clever, though it would make the new WAL record almost never appear. > > > > (b) Exempt GiST from most WAL skipping. GiST index build could still skip > > WAL, but it would do its own smgrimmedsync() in addition to the one done at > > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead > > of > > RelationNeedsWal(), and we'd need some hack for index_copy_data() and > > possibly > > other AM-independent code that skips WAL. > > > > Overall, I like the cleanliness of (a). The main argument for (b) is that > > it > > ensures we have all the features to opt-out of WAL skipping, which could be > > useful for out-of-tree index access methods. (I think we currently have the > > features for a tableam to do so, but not for an indexam to do so.) > > Overall, I > > lean toward (a). Any other ideas or preferences? > > I don't like (b), too. > > What we need there is any sequential numbers for page LSN but really > compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the > case? Or, I'm not sure but I suppose that nothing happenes when > UNLOGGED GiST index gets turned into LOGGED one. Yes, I just forgot to remove these lines when writing the following. > Rewriting table like SET LOGGED will work but not realistic. > > > === Defect 2: repetitive work when syncing many relations > > > > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for > > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is > > sophisticated about optimizing the shared buffers scan. Commit 279628a > > introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, > > to > > further reduce the chance of causing performance regressions. (One could, > > however, work around the problem by raising wal_skip_threshold.) Kyotaro, > > if > > you agree, could you modify v24nm to implement that? > > Seems reasonable. Please wait a minite. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
I should have replied this first. At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch wrote in > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote: > > I started pre-commit editing on 2019-10-28, and comment+README updates have > > been the largest part of that. I'll check my edits against the things you > > list here, and I'll share on-list before committing. I've now marked the CF > > entry Ready for Committer. > > Having dedicated many days to that, I am attaching v24nm. I know of two > remaining defects: > > === Defect 1: gistGetFakeLSN() > > When I modified pg_regress.c to use wal_level=minimal for all suites, > src/test/isolation/specs/predicate-gist.spec failed the assertion in > gistGetFakeLSN(). One could reproduce the problem just by running this > sequence in psql: > > begin; > create table gist_point_tbl(id int4, p point); > create index gist_pointidx on gist_point_tbl using gist(p); > insert into gist_point_tbl (id, p) > select g, point(g*10, g*10) from generate_series(1, 1000) g; > > I've included a wrong-in-general hack to make the test pass. I see two main > options for fixing this: > > (a) Introduce an empty WAL record that reserves an LSN and has no other > effect. Make GiST use that for permanent relations that are skipping WAL. > Further optimizations are possible. For example, we could use a backend-local > counter (like the one gistGetFakeLSN() uses for temp relations) until the > counter is greater a recent real LSN. That optimization is probably too > clever, though it would make the new WAL record almost never appear. > > (b) Exempt GiST from most WAL skipping. GiST index build could still skip > WAL, but it would do its own smgrimmedsync() in addition to the one done at > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly > other AM-independent code that skips WAL. > > Overall, I like the cleanliness of (a). The main argument for (b) is that it > ensures we have all the features to opt-out of WAL skipping, which could be > useful for out-of-tree index access methods. (I think we currently have the > features for a tableam to do so, but not for an indexam to do so.) Overall, I > lean toward (a). Any other ideas or preferences? I don't like (b), too. What we need there is any sequential numbers for page LSN but really compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the case? Or, I'm not sure but I suppose that nothing happenes when UNLOGGED GiST index gets turned into LOGGED one. Rewriting table like SET LOGGED will work but not realistic. > === Defect 2: repetitive work when syncing many relations > > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is > sophisticated about optimizing the shared buffers scan. Commit 279628a > introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to > further reduce the chance of causing performance regressions. (One could, > however, work around the problem by raising wal_skip_threshold.) Kyotaro, if > you agree, could you modify v24nm to implement that? Seems reasonable. Please wait a minite. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c index 66c52d6dd6..387b1f7d18 100644 --- a/src/backend/access/gist/gistutil.c +++ b/src/backend/access/gist/gistutil.c @@ -1017,8 +1017,7 @@ gistGetFakeLSN(Relation rel) * XXX before commit fix this. This is not correct for * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass. */ - if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP - || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) + if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP) { /* * Temporary relations are only accessible in our session, so a simple @@ -1026,6 +1025,15 @@ gistGetFakeLSN(Relation rel) */ return counter++; } + else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) + { + /* + * Even though we are skipping WAL-logging of a permanent relations, + * the LSN must be a real one because WAL-logging starts after commit. + */ + Assert(!RelationNeedsWAL(rel)); + return GetXLogInsertRecPtr(); + } else { /*
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch wrote in > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote: > > I started pre-commit editing on 2019-10-28, and comment+README updates have > > been the largest part of that. I'll check my edits against the things you > > list here, and I'll share on-list before committing. I've now marked the CF > > entry Ready for Committer. I looked the version. > Notable changes in v24nm: > > - Wrote section "Skipping WAL for New RelFileNode" in > src/backend/access/transam/README to be the main source concerning the new > coding rules. Thanks for writing this. +Prefer to do the same in future access methods. However, two other approaches +can work. First, an access method can irreversibly transition a given fork +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and +smgrimmedsync(). Second, an access method can opt to write WAL +unconditionally for permanent relations. When using the second method, do not +call RelationCopyStorage(), which skips WAL. Even using these methods, TransactionCommit flushes out buffers then sync files again. Isn't a description something like the following needed? === Even an access method switched a in-transaction created relfilenode to WAL-writing, Commit(Prepare)Transaction flushed all buffers for the file then smgrimmedsync() the file. === > - Updated numerous comments and doc sections. > > - Eliminated the pendingSyncs list in favor of a "sync" field in > pendingDeletes. I mostly did this to eliminate the possibility of the lists > getting out of sync. This removed considerable parallel code for managing a > second list at end-of-xact. We now call smgrDoPendingSyncs() only when > committing or preparing a top-level transaction. Mmm. Right. The second list was a trace of older versions, maybe that needed additional works at rollback. Actually as of v23 the function syncs no files at rollback. It is wiser to merging the two. > - Whenever code sets an rd_*Subid field of a Relation, it must call > EOXactListAdd(). swap_relation_files() was not doing so, so the field > remained set during the next transaction. I introduced > RelationAssumeNewRelfilenode() to handle both tasks, and I located the call > so it also affects the mapped relation case. Ugh.. Thanks for pointing out. By the way + /* +* Recognize that rel1's relfilenode (swapped from rel2) is new in this +* subtransaction. Since the next step for rel2 is deletion, don't bother +* recording the newness of its relfilenode. +*/ + rel1 = relation_open(r1, AccessExclusiveLock); + RelationAssumeNewRelfilenode(rel1); It cannot be accessed from other sessions. Theoretically it doesn't need a lock but NoLock cannot be used there since there's a path that doesn't take lock on the relation. But AEL seems too strong and it causes unecessary side effect. Couldn't we use weaker locks? ... Time is up. I'll continue looking this. regards. > - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, > rd_createSubid remained set. (That happened before this patch, but it has > been harmless.) I fixed this in heap_create(). > > - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary > when checksums are enabled. Observe the precedent that > RelationCopyStorage() has not been exempting FSM_FORKNUM. > > - Pass log_newpage_range() a "false" for page_std, for the same reason > RelationCopyStorage() does. > > - log_newpage_range() ignored its forkNum and page_std arguments, so we logged > the wrong data for non-main forks. Before this patch, callers always passed > MAIN_FORKNUM and "true", hence the lack of complaints. > > - Restored table_finish_bulk_insert(), though heapam no longer provides a > callback. The API is still well-defined, and other table AMs might have use > for it. Removing it feels like a separate proposal. > > - Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit > itself in light of this patch. > > - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation; > it was overcounting. > > - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER. > > - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to > "Write Ahead Log" -> "Settings", between similar settings > wal_writer_flush_after and commit_delay. The other place I considered was > "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of > backend_flush_after. > > - Gave each test a unique name. Changed test table names to be descriptive, > e.g. test7 became trunc_trig. > > - Squashed all patches into one. Split patches are good when one could > reasonably choose to push a subset, but that didn't apply here. I wouldn't > push a GUC implementation without its documentation. Since the tests fail > without the main bug fix, I
Re: [HACKERS] WAL logging problem in 9.4.3?
I'm in the benchmarking week.. Thanks for reviewing!. At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch wrote in > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote: > > I started pre-commit editing on 2019-10-28, and comment+README updates have > > been the largest part of that. I'll check my edits against the things you > > list here, and I'll share on-list before committing. I've now marked the CF > > entry Ready for Committer. I'll look into that soon. By the way, before finalize this, I'd like to share the result of a brief benchmarking. First, I measured the direct effect of WAL skipping. I measured the time required to do the following sequence for the COMMIT-FPW-WAL case and COMMIT-fsync case. WAL and heap files are on non-server spec HDD. BEGIN; TRUNCATE t; INSERT INTO t (SELECT a FROM generate_series(1, n) a); COMMIT; REPLICA means the time with wal_level = replica SYNCmeans the time with wal_level = minimal and force file sync. WAL means the time with wal_level = minimal and force commit-WAL. pages is the number of pages of the table. (REPLICA comes from run.sh 1, SYNC/WAL comes from run.sh 2) pages REPLICASYNC WAL 1: 144 ms 683 ms 217 ms 3: 303 ms 995 ms 385 ms 5: 271 ms 1007 ms 217 ms 10: 157 ms 1043 ms 224 ms 17: 189 ms 1007 ms 193 ms 31: 202 ms 1091 ms 230 ms 56: 265 ms 1175 ms 226 ms 100: 510 ms 1307 ms 270 ms 177: 790 ms 1523 ms 524 ms 316: 1827 ms 1643 ms 719 ms 562: 1904 ms 2109 ms 1148 ms 1000: 3060 ms 2979 ms 2113 ms 1778: 6077 ms 3945 ms 3618 ms 3162: 13038 ms 7078 ms 6734 ms There was a crossing point around 3000 pages. (bench1() finds that by bisecting, run.sh 3). With multiple sessions, the crossing point but does not go so small. 10 processes (run.pl 4 10) The numbers in parentheses are WAL[n]/WAL[n-1]. pagesSYNC WAL 316: 8436 ms 4694 ms 562: 12067 ms 9627 ms (x2.1) # WAL wins 1000: 19154 ms 43262 ms (x4.5) # SYNC wins. WAL's slope becomes steep. 1778: 32495 ms 63863 ms (x1.4) 100 processes (run.pl 4 100) pagesSYNC WAL 10: 13275 ms 1868 ms 17: 15919 ms 4438 ms (x2.3) 31: 17063 ms 6431 ms (x1.5) 56: 23193 ms 14276 ms (x2.2) # WAL wins 100: 35220 ms 67843 ms (x4.8) # SYNC wins. WAL's slope becomes steep. With 10 pgbench sessions. pages SYNC WAL 1: 915 ms 301 ms 3: 1634 ms 508 ms 5: 1634 ms 293ms 10: 1671 ms 1043 ms 17: 1600 ms 333 ms 31: 1864 ms 314 ms 56: 1562 ms 448 ms 100: 1538 ms 394 ms 177: 1697 ms 1047 ms 316: 3074 ms 1788 ms 562: 3306 ms 1245 ms 1000: 3440 ms 2182 ms 1778: 5064 ms 6464 ms # WAL's slope becomes steep 3162: 8675 ms 8165 ms I don't think the result of 100 processes is meaningful, so excluding the result a candidate for wal_skip_threshold can be 1000. Thoughts? The attached is the benchmark script. regards. -- Kyotaro Horiguchi NTT Open Source Software Center #! /usr/bin/perl use strict; use IPC::Open2; use Time::HiRes qw (gettimeofday tv_interval); my $tupperpage = 226; my @time = (); sub bench { my ($header, $nprocs, $ntups, $threshold) = @_; my @result = (); my @rds = (); for (my $ip = 0 ; $ip < $nprocs ; $ip++) { pipe(my $rd, my $wr); $rds[$ip] = $rd; my $pid = fork(); die "fork failed: $!\n" if ($pid < 0); if ($pid == 0) { close($rd); my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres"); print $psqlwr "SET wal_skip_threshold to $threshold;\n"; print $psqlwr "DROP TABLE IF EXISTS t$ip;"; print $psqlwr "CREATE TABLE t$ip (a int);\n"; my @st = gettimeofday(); for (my $i = 0 ; $i < 10 ; $i++) { print $psqlwr "BEGIN;"; print $psqlwr "TRUNCATE t$ip;"; print $psqlwr "INSERT INTO t$ip (SELECT a FROM generate_series(1, $ntups) a);"; print $psqlwr "COMMIT;"; } close($psqlwr); waitpid($pid, 0); print $wr $ip, " ", 1000 * tv_interval(\@st, [gettimeofday]), "\n"; exit; } close($wr); } my $rpid; while (($rpid = wait()) == 0) {} my $sum = 0; for (my $ip = 0 ; $ip < $nprocs ; $ip++) { my $ret = readline($rds[$ip]); die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/); $sum += $2; } printf "$header: procs $nprocs: time
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote: > I started pre-commit editing on 2019-10-28, and comment+README updates have > been the largest part of that. I'll check my edits against the things you > list here, and I'll share on-list before committing. I've now marked the CF > entry Ready for Committer. Having dedicated many days to that, I am attaching v24nm. I know of two remaining defects: === Defect 1: gistGetFakeLSN() When I modified pg_regress.c to use wal_level=minimal for all suites, src/test/isolation/specs/predicate-gist.spec failed the assertion in gistGetFakeLSN(). One could reproduce the problem just by running this sequence in psql: begin; create table gist_point_tbl(id int4, p point); create index gist_pointidx on gist_point_tbl using gist(p); insert into gist_point_tbl (id, p) select g, point(g*10, g*10) from generate_series(1, 1000) g; I've included a wrong-in-general hack to make the test pass. I see two main options for fixing this: (a) Introduce an empty WAL record that reserves an LSN and has no other effect. Make GiST use that for permanent relations that are skipping WAL. Further optimizations are possible. For example, we could use a backend-local counter (like the one gistGetFakeLSN() uses for temp relations) until the counter is greater a recent real LSN. That optimization is probably too clever, though it would make the new WAL record almost never appear. (b) Exempt GiST from most WAL skipping. GiST index build could still skip WAL, but it would do its own smgrimmedsync() in addition to the one done at commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly other AM-independent code that skips WAL. Overall, I like the cleanliness of (a). The main argument for (b) is that it ensures we have all the features to opt-out of WAL skipping, which could be useful for out-of-tree index access methods. (I think we currently have the features for a tableam to do so, but not for an indexam to do so.) Overall, I lean toward (a). Any other ideas or preferences? === Defect 2: repetitive work when syncing many relations For deleting relfilenodes, smgrDoPendingDeletes() collects a list for smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is sophisticated about optimizing the shared buffers scan. Commit 279628a introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to further reduce the chance of causing performance regressions. (One could, however, work around the problem by raising wal_skip_threshold.) Kyotaro, if you agree, could you modify v24nm to implement that? Notable changes in v24nm: - Wrote section "Skipping WAL for New RelFileNode" in src/backend/access/transam/README to be the main source concerning the new coding rules. - Updated numerous comments and doc sections. - Eliminated the pendingSyncs list in favor of a "sync" field in pendingDeletes. I mostly did this to eliminate the possibility of the lists getting out of sync. This removed considerable parallel code for managing a second list at end-of-xact. We now call smgrDoPendingSyncs() only when committing or preparing a top-level transaction. - Whenever code sets an rd_*Subid field of a Relation, it must call EOXactListAdd(). swap_relation_files() was not doing so, so the field remained set during the next transaction. I introduced RelationAssumeNewRelfilenode() to handle both tasks, and I located the call so it also affects the mapped relation case. - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, rd_createSubid remained set. (That happened before this patch, but it has been harmless.) I fixed this in heap_create(). - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary when checksums are enabled. Observe the precedent that RelationCopyStorage() has not been exempting FSM_FORKNUM. - Pass log_newpage_range() a "false" for page_std, for the same reason RelationCopyStorage() does. - log_newpage_range() ignored its forkNum and page_std arguments, so we logged the wrong data for non-main forks. Before this patch, callers always passed MAIN_FORKNUM and "true", hence the lack of complaints. - Restored table_finish_bulk_insert(), though heapam no longer provides a callback. The API is still well-defined, and other table AMs might have use for it. Removing it feels like a separate proposal. - Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit itself in light of this patch. - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation; it was overcounting. - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER. - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to "Write Ahead Log" -> "Settings", between similar settings
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for looking this. At Tue, 5 Nov 2019 16:16:14 -0500, Robert Haas wrote in > On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi > wrote: > > This is the fixed verison v22. > First, I'd like to restate my understanding of the problem just to see .. > Second, for anyone who is not following this thread closely but is Thanks for restating the issue and summarizing this patch. All of the description match my understanding. > perform all of the relevant fsync() calls at commit time. This is > further augmented with a mechanism that instead logs all the relation > pages in lieu of fsync()ing if the relation is very small, on the > theory that logging a few FPIs will be cheaper than an fsync(). I view > this additional mechanism as perhaps a bit much for a bug fix patch, > but I understand that the goal is to prevent a performance regression, > and it's not really over the top, so I think it's probably OK. Thanks. It would need some benchmarking as mentioned upthread. My new machine became to work steadily so I will do that. > sound. I think there are a number of places where the comments could > be better; I'll include a few points on that further down. I also > think that the code in swap_relation_files() which takes ExclusiveLock > on the relations looks quite strange. It's hard to understand why it's > needed at all, or why that lock level is used. On the flip side, I Right. It *was* a mistake of AccessExclusiveLock. On second thought, callers must have taken locks on them with required level for relfilenode swapping. However, one problematic case is toast indexes of the target relation, which are not locked at all. Finally I used AccessShareLock as it doesn't raise lock level other than NoLock. Anyway the toast relation is not accessible outside the session. (Done in the attached) > think that the test suite looks really impressive and should be of > considerable help not only in making sure that this is fixed but > detecting if it gets broken again in the future. Perhaps it doesn't > cover every scenario we care about, but if that turns out to be the > case, it seems like it would be easily to further generalize. I really > like the idea of this *kind* of test framework. The paths running swap_relation_files are not covered. CLUSTER, REFRESH MATVIEW and ALTER TABLE. CLUSTER and ALTER TABLE can interact with INSERTs but MATVIEW cannot. Copying some of the existing test cases using them will work. (Not yet done). > Comments on comments, and other nitpicking: > > - in-trasaction is mis-spelled in the doc patch. accidentially is > mis-spelled in the 0002 patch. Thanks. I found another couple of typos "issuing"->"issueing", "skipped"->"skpped" by ispell'ing git diff output and all fixed. > - I think the header comment for the new TAP test could do a far > better job explaining the overall goal of this testing than it > actually does. I rewrote it... > - I think somewhere in relcache.c or rel.h there ought to be comments > explaining the precise degree to which rd_createSubid, > rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable, > including problem scenarios. This patch removes some language of this > sort from CopyFrom(), which was a funny place to have that information > in the first place, but I don't see that it adds anything to replace > it. I also think that we ought to explain - for the fields that are > reliable - that they need to be reliable precisely for the purpose of > not breaking this stuff. There's a bit of this right now: > > + * rd_firstRelfilenodeSubid is the ID of the first subtransaction the > + * relfilenode change has took place in the current transaction. Unlike > + * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID > + * means that the currently active relfilenode is transaction-local and we > + * sync the relation at commit instead of WAL-logging. > > ...but I think that needs to be somewhat expanded and clarified. Agreed. It would be crude but I put augmenting descriptions of how the variables work and descriptions in contrast to rd_first*. # rd_first* is not a hint in the sense that it is reliable but it is # mentioned as hint in some places, which will need fix. If the fix of MarkBufferDirtyHint is ok, I'll merge it into 0002. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From c5e7243ba05677ad84bc8b6b03077cadcaadf4b8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 25 Oct 2019 13:07:41 +0900 Subject: [PATCH v23 1/5] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 321 1 file changed, 321 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 00..ac62c77a42 --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,321 @@ +# Test recovery for skipping
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Nov 05, 2019 at 04:16:14PM -0500, Robert Haas wrote: > On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi > wrote: > > This is the fixed verison v22. > > I'd like to offer a few thoughts on this thread and on these patches, > which is now more than 4 years old and more than 150 messages in > length. ... Your understanding matches mine. Thanks for studying this. I had been feeling nervous about being the sole reviewer of the latest design. > Comments on comments, and other nitpicking: I started pre-commit editing on 2019-10-28, and comment+README updates have been the largest part of that. I'll check my edits against the things you list here, and I'll share on-list before committing. I've now marked the CF entry Ready for Committer.
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi wrote: > This is the fixed verison v22. I'd like to offer a few thoughts on this thread and on these patches, which is now more than 4 years old and more than 150 messages in length. First, I'd like to restate my understanding of the problem just to see whether I've got the right idea and whether we're all on the same page. When wal_level=minimal, we sometimes try to skip WAL logging on newly-created relations in favor of fsync-ing the relation at commit time. The idea is that if the transaction aborts or is aborted by a crash, the contents of the relation don't need to be reproduced because they are irrelevant, so no WAL is needed, and if the transaction commits we can't lose any data on a crash because we've already fsync'd, and standbys don't matter because wal_level=minimal precludes having any. However, we're not entirely consistent about skipping WAL-logging: some operations do and others don't, and this causes confusion if a crash occurs, because we might try to replay some of the things that happened to that relation but not all of them. For example, the original poster complained about a sequence of steps where an index truncation was logged but subsequent index insertions were not; a badly-timed crash will replay the truncation but can't replay the index insertions because they weren't logged in the first place; consequently, while the state was actually OK at the beginning of replay, it's no longer OK by the end. Replaying nothing would've been OK, but replaying some things and not others isn't. Second, for anyone who is not following this thread closely but is interested in a summary, I'd like to summarize how I believe that the current patch proposes to solve the problem. As I understand it, the approach taken by the patch is to try to change things so that we log nothing at all for relations created or truncated in the current top-level transaction, and everything for others. To achieve this, the patch makes a number of changes, three of which seem to me to be particularly key. One, the patch changes the relcache infrastructure with the goal of making it possible to reliably identify whether a relation has been created or truncated in the current toplevel transaction; our current code does have tracking for this, but it's not 100% accurate. Two, the patch changes the definition of RelationNeedsWAL() so that it not only checks that the relation is a permanent one, but also that either wal_level != minimal or the relation was not created in the current transaction. It seems to me that if RelationNeedsWAL() is used to gate every test for whether or not to write WAL pertaining to a particular relation, this ought to achieve the desired behavior of logging either everything or nothing. It is not quite clear to me how we can be sure that we use that in every relevant place. Three, the patch replaces the various ad-hoc bits of code which fsync relations which perform unlogged operations on permanent relations with a new tracking mechanism that arranges to perform all of the relevant fsync() calls at commit time. This is further augmented with a mechanism that instead logs all the relation pages in lieu of fsync()ing if the relation is very small, on the theory that logging a few FPIs will be cheaper than an fsync(). I view this additional mechanism as perhaps a bit much for a bug fix patch, but I understand that the goal is to prevent a performance regression, and it's not really over the top, so I think it's probably OK. Third, I'd like to offer a couple of general comments on the state of these patches. Broadly, I think they look pretty good. They seem quite well-engineered to me and as far as I can see the overall approach is sound. I think there are a number of places where the comments could be better; I'll include a few points on that further down. I also think that the code in swap_relation_files() which takes ExclusiveLock on the relations looks quite strange. It's hard to understand why it's needed at all, or why that lock level is used. On the flip side, I think that the test suite looks really impressive and should be of considerable help not only in making sure that this is fixed but detecting if it gets broken again in the future. Perhaps it doesn't cover every scenario we care about, but if that turns out to be the case, it seems like it would be easily to further generalize. I really like the idea of this *kind* of test framework. Comments on comments, and other nitpicking: - in-trasaction is mis-spelled in the doc patch. accidentially is mis-spelled in the 0002 patch. - I think the header comment for the new TAP test could do a far better job explaining the overall goal of this testing than it actually does. - I think somewhere in relcache.c or rel.h there ought to be comments explaining the precise degree to which rd_createSubid, rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable, including problem scenarios.
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Oct 25, 2019 at 1:13 PM Kyotaro Horiguchi wrote: > Hello. Thanks for the comment. > > # Sorry in advance for possilbe breaking the thread. > > > MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or > > rd_createSubid is set; see attached test case. It needs to skip WAL > > whenever > > RelationNeedsWAL() returns false. > > Thanks for pointing out that. And the test patch helped me very much. > > Most of callers can tell that to the function, but SetHintBits() > cannot easily. Rather I think we shouldn't even try to do > that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c > for sync-pending state of the relfilenode for the buffer. In the > attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs > but it is called only at the time FPW is added so I believe it doesn't > affect performance so much. However, we can use hash for pendingSyncs > instead of liked list. Anyway the change is in its own file > v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into > 0002. > > AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the > non-wal_minimal code paths. > > > Cylinder and track sizes are obsolete as user-visible concepts. (They're > > not > > onstant for a given drive, and I think modern disks provide no way to read > > the relevant parameters.) I like the name "wal_skip_threshold", and my > > second > > I strongly agree. Thanks for the draft. I used it as-is. I don't come > up with an appropriate second description of the GUC so I just removed > it. > > # it was "For rotating magnetic disks, it is around the size of a > # track or sylinder." > > > the relevant parameters.) I like the name "wal_skip_threshold", and > > my second choice would be "wal_skip_min_size". Possibly documented > > as follows: > .. > > Any other opinions on the GUC name? > > I prefer the first candidate. I already used the terminology in > storage.c and the name fits more to the context. > > > * We emit newpage WAL records for smaller size of relations. > > * > > * Small WAL records have a chance to be emitted at once along with > > * other backends' WAL records. We emit WAL records instead of syncing > > * for files that are smaller than a certain threshold expecting faster > - * commit. The threshold is defined by the GUC effective_io_block_size. > + * commit. The threshold is defined by the GUC wal_skip_threshold. > It's wrong that it also skips changing flags. > I"ll fix it soon This is the fixed verison v22. The attached are: - v22-0001-TAP-test-for-copy-truncation-optimization.patch Same as v20, 21 - v22-0002-Fix-WAL-skipping-feature.patch GUC name changed. Same as v21. - v22-0003-Fix-MarkBufferDirtyHint.patch PoC of fixing the function. will be merged into 0002. (New in v21, fixed in v22) - v21-0004-Documentation-for-wal_skip_threshold.patch GUC name and description changed. (Previous 0003, same as v21) - v21-0005-Additional-test-for-new-GUC-setting.patch including adjusted version of wal-optimize-noah-tests-v3.patch Maybe test names need further adjustment. (Previous 0004, same as v21) regards. -- Kyotaro Horiguchi NTT Open Source Software Center v22-0004-Documentation-for-wal_skip_threshold.patch Description: Binary data v22-0001-TAP-test-for-copy-truncation-optimization.patch Description: Binary data v22-0002-Fix-WAL-skipping-feature.patch Description: Binary data v22-0005-Additional-test-for-new-GUC-setting.patch Description: Binary data v22-0003-Fix-MarkBufferDirtyHint.patch Description: Binary data
Re: [HACKERS] WAL logging problem in 9.4.3?
Ugh! 2019年10月25日(金) 13:13 Kyotaro Horiguchi : > that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c > for sync-pending state of the relfilenode for the buffer. In the > attached patch (0003) > regards. > It's wrong that it also skips chnging flags. I"ll fix it soon -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. Thanks for the comment. # Sorry in advance for possilbe breaking the thread. > MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or > rd_createSubid is set; see attached test case. It needs to skip WAL whenever > RelationNeedsWAL() returns false. Thanks for pointing out that. And the test patch helped me very much. Most of callers can tell that to the function, but SetHintBits() cannot easily. Rather I think we shouldn't even try to do that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c for sync-pending state of the relfilenode for the buffer. In the attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs but it is called only at the time FPW is added so I believe it doesn't affect performance so much. However, we can use hash for pendingSyncs instead of liked list. Anyway the change is in its own file v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into 0002. AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the non-wal_minimal code paths. > Cylinder and track sizes are obsolete as user-visible concepts. (They're not > onstant for a given drive, and I think modern disks provide no way to read > the relevant parameters.) I like the name "wal_skip_threshold", and my second I strongly agree. Thanks for the draft. I used it as-is. I don't come up with an appropriate second description of the GUC so I just removed it. # it was "For rotating magnetic disks, it is around the size of a # track or sylinder." > the relevant parameters.) I like the name "wal_skip_threshold", and > my second choice would be "wal_skip_min_size". Possibly documented > as follows: .. > Any other opinions on the GUC name? I prefer the first candidate. I already used the terminology in storage.c and the name fits more to the context. > * We emit newpage WAL records for smaller size of relations. > * > * Small WAL records have a chance to be emitted at once along with > * other backends' WAL records. We emit WAL records instead of syncing > * for files that are smaller than a certain threshold expecting faster - * commit. The threshold is defined by the GUC effective_io_block_size. + * commit. The threshold is defined by the GUC wal_skip_threshold. The attached are: - v21-0001-TAP-test-for-copy-truncation-optimization.patch same as v20 - v21-0002-Fix-WAL-skipping-feature.patch GUC name changed. - v21-0003-Fix-MarkBufferDirtyHint.patch PoC of fixing the function. will be merged into 0002. (New) - v21-0004-Documentation-for-wal_skip_threshold.patch GUC name and description changed. (Previous 0003) - v21-0005-Additional-test-for-new-GUC-setting.patch including adjusted version of wal-optimize-noah-tests-v3.patch Maybe test names need further adjustment. (Previous 0004) regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 34149545942480d8dcc1cc587f40091b19b5aa39 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH v21 1/5] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 312 1 file changed, 312 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 00..b041121745 --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,312 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 26; + +sub check_orphan_relfilenodes +{ + my($node, $test_name) = @_; + + my $db_oid = $node->safe_psql('postgres', + "SELECT oid FROM pg_database WHERE datname = 'postgres'"); + my $prefix = "base/$db_oid/"; + my $filepaths_referenced = $node->safe_psql('postgres', " + SELECT pg_relation_filepath(oid) FROM pg_class + WHERE reltablespace = 0 and relpersistence <> 't' and + pg_relation_filepath(oid) IS NOT NULL;"); + is_deeply([sort(map { "$prefix$_" } + grep(/^[0-9]+$/, + slurp_dir($node->data_dir . "/$prefix")))], + [sort split /\n/, $filepaths_referenced], + $test_name); + return; +} + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +max_prepared_transactions = 1 +)); + $node->start; + + # Setup + my $tablespace_dir =
Re: [HACKERS] WAL logging problem in 9.4.3?
[Casual readers with opinions on GUC naming: consider skipping to the end.] MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or rd_createSubid is set; see attached test case. It needs to skip WAL whenever RelationNeedsWAL() returns false. On Tue, Aug 27, 2019 at 03:49:32PM +0900, Kyotaro Horiguchi wrote: > At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch wrote in > <20190826050843.gb3153...@rfd.leadboat.com> > > Consider a one-page relfilenode. Doing all the things you list for a single > > page may be cheaper than locking millions of buffer headers. > > If I understand you correctly, I would say that *all* buffers > that don't belong to in-transaction-created files are skipped > before taking locks. No lock conflict happens with other > backends. > > FlushRelationBuffers uses double-checked-locking as follows: I had misread the code; you're right. > > This should be GUC-controlled, especially since this is back-patch material. > > Is this size of patch back-patchable? Its size is not an obstacle. It's not ideal to back-patch such a user-visible performance change, but it would be worse to leave back branches able to corrupt data during recovery. On Wed, Aug 28, 2019 at 03:42:10PM +0900, Kyotaro Horiguchi wrote: > - Use log_newpage instead of fsync for small tables. > I'm trying to measure performance difference on WAL/fsync. I would measure it with simultaneous pgbench instances: 1. DDL pgbench instance repeatedly creates and drops a table of X kilobytes, using --rate to make this happen a fixed number of times per second. 2. Regular pgbench instance runs the built-in script at maximum qps. For each X, try one test run with effective_io_block_size = X-1 and one with effective_io_block_size = X. If the regular pgbench instance gets materially higher qps with effective_io_block_size = X-1, the ideal default is =X. > + xreflabel="effective_io_block_size"> > + effective_io_block_size (integer) > + > + effective_io_block_size configuration > parameter > + > + > + > + > +Specifies the expected maximum size of a file for which > fsync returns in the minimum required duration. It is > approximately the size of a track or sylinder for magnetic disks. > +The value is specified in kilobytes and the default is > 64 kilobytes. > + > + > +When is minimal, > +WAL-logging is skipped for tables created in-trasaction. If a table > +is smaller than that size at commit, it is WAL-logged instead of > +issueing fsync on it. > + > + > + > + Cylinder and track sizes are obsolete as user-visible concepts. (They're not constant for a given drive, and I think modern disks provide no way to read the relevant parameters.) I like the name "wal_skip_threshold", and my second choice would be "wal_skip_min_size". Possibly documented as follows: When wal_level is minimal and a transaction commits after creating or rewriting a permanent table, materialized view, or index, this setting determines how to persist the new data. If the data is smaller than this setting, write it to the WAL log; otherwise, use an fsync of the data file. Depending on the properties of your storage, raising or lowering this value might help if such commits are slowing concurrent transactions. The default is 64 kilobytes (64kB). Any other opinions on the GUC name? diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl index 95063ab..5d476a4 100644 --- a/src/test/recovery/t/018_wal_optimize.pl +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -11,7 +11,7 @@ use warnings; use PostgresNode; use TestLib; -use Test::More tests => 28; +use Test::More tests => 32; sub check_orphan_relfilenodes { @@ -43,6 +43,8 @@ sub run_wal_optimize $node->append_conf('postgresql.conf', qq( wal_level = $wal_level max_prepared_transactions = 1 +wal_log_hints = on +effective_io_block_size = 0 )); $node->start; @@ -194,6 +196,24 @@ max_prepared_transactions = 1 is($result, qq(3), "wal_level = $wal_level, SET TABLESPACE in subtransaction"); + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a5 (c int PRIMARY KEY); + SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q; + CHECKPOINT; + INSERT INTO test3a5 VALUES (1); -- set index hint bit + INSERT INTO test3a5 VALUES (2); + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->psql('postgres', ); + my($ret, $stdout, $stderr) = $node->psql( + 'postgres', "INSERT INTO test3a5 VALUES (2);"); + is($ret, qq(3), + "wal_level = $wal_level, unique index LP_DEAD"); + like($stderr, qr/violates unique/, + "wal_level = $wal_level, unique index LP_DEAD message"); + # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not. $node->safe_psql('postgres', " BEGIN;
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Sep 02, 2019 at 05:15:00PM -0400, Alvaro Herrera wrote: > I have updated this patch's status to "needs review", since v20 has not > received any comments yet. > > Noah, you're listed as committer for this patch. Are you still on the > hook for getting it done during the v13 timeframe? Yes, assuming "getting it done" = "getting the CF entry to state other than Needs Review".
Re: [HACKERS] WAL logging problem in 9.4.3?
I have updated this patch's status to "needs review", since v20 has not received any comments yet. Noah, you're listed as committer for this patch. Are you still on the hook for getting it done during the v13 timeframe? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, Noah. At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190827.154932.250364935.horikyota@gmail.com> > I'm not sure whether the knob shows apparent performance gain and > whether we can offer the criteria to identify the proper > value. But I'll add this feature with a GUC > effective_io_block_size defaults to 64kB as the threshold in the > next version. (The name and default value are arguable, of course.) This is a new version of the patch based on the discussion. The differences from v19 are the follows. - Removed the new stuff in two-phase.c. The action on PREPARE TRANSACTION is now taken in PrepareTransaction(). Instead of storing pending syncs in two-phase files, the function immediately syncs all files that can survive the transaction end. (twophase.c, xact.c) - Separate pendingSyncs from pendingDeletes. pendingSyncs gets handled differently from pendingDeletes so it is separated. - Let smgrDoPendingSyncs() to avoid performing fsync on to-be-deleted files. In previous versions the function syncs all recorded files even if it is being deleted. Since we use WAL-logging as the alternative of fsync now, performance gets more significance g than before. Thus this version avoids uesless fsyncs. - Use log_newpage instead of fsync for small tables. As in the discussion up-thread, I think I understand how WAL-logging works better than fsync. smgrDoPendingSync issues log_newpage for all blocks in the table smaller than the GUC variable "effective_io_block_size". I found log_newpage_range() that does exact what is needed here but it requires Relation that is no available there. I removed an assertion in CreateFakeRelcacheEntry so that it works while non-recovery mode. - Rebased and fixed some bugs. I'm trying to measure performance difference on WAL/fsync. By the way, smgrDoPendingDelete is called from CommitTransaction and AbortTransaction directlry, and from AbortSubTransaction via AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is called only from AbortSubTransaction. I think these should be unified either way. Any opinions? CommitTransaction() + msgrDoPendingDelete() AbortTransaction() + msgrDoPendingDelete() AbortSubTransactoin() AtSubAbort_smgr() + msgrDoPendingDelete() # Looking around, the prefixes AtEOact/PreCommit/AtAbort don't # seem to be used keeping a principle. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH 1/4] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 312 1 file changed, 312 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 00..b041121745 --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,312 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 26; + +sub check_orphan_relfilenodes +{ + my($node, $test_name) = @_; + + my $db_oid = $node->safe_psql('postgres', + "SELECT oid FROM pg_database WHERE datname = 'postgres'"); + my $prefix = "base/$db_oid/"; + my $filepaths_referenced = $node->safe_psql('postgres', " + SELECT pg_relation_filepath(oid) FROM pg_class + WHERE reltablespace = 0 and relpersistence <> 't' and + pg_relation_filepath(oid) IS NOT NULL;"); + is_deeply([sort(map { "$prefix$_" } + grep(/^[0-9]+$/, + slurp_dir($node->data_dir . "/$prefix")))], + [sort split /\n/, $filepaths_referenced], + $test_name); + return; +} + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +max_prepared_transactions = 1 +)); + $node->start; + + # Setup + my $tablespace_dir = $node->basedir . '/tablespace_other'; + mkdir ($tablespace_dir); + $tablespace_dir = TestLib::perl2host($tablespace_dir); + $node->safe_psql('postgres', + "CREATE TABLESPACE other LOCATION '$tablespace_dir';"); + + # Test direct truncation optimization. No tuples + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test1 (id serial
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch wrote in <20190826050843.gb3153...@rfd.leadboat.com> noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote: noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch wrote in <20190820060314.ga3086...@rfd.leadboat.com> > > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > > > > I'm not sure the point of the behavior. I suppose that the "log" > > > > is a sequence of new_page records. It also needs to be synced and > > > > it is always larger than the file to be synced. I can't think of > > > > an appropriate threshold without the point. > > > > > > Yes, it would be a sequence of new-page records. FlushRelationBuffers() > > > locks > > > every buffer header containing a buffer of the current database. The > > > belief > > > has been that writing one page to xlog is cheaper than > > > FlushRelationBuffers() > > > in a busy system with large shared_buffers. > > > > I'm at a loss.. The decision between WAL and sync is made at > > commit time, when we no longer have a pin on a buffer. When > > emitting WAL, opposite to the assumption, lock needs to be > > re-acquired for every page to emit log_new_page. What is worse, > > we may need to reload evicted buffers. If the file has been > > CopyFrom'ed, ring buffer strategy makes the situnation farther > > worse. That doesn't seem cheap at all.. > > Consider a one-page relfilenode. Doing all the things you list for a single > page may be cheaper than locking millions of buffer headers. If I understand you correctly, I would say that *all* buffers that don't belong to in-transaction-created files are skipped before taking locks. No lock conflict happens with other backends. FlushRelationBuffers uses double-checked-locking as follows: FlushRelationBuffers_common(): .. if(!islocal) { for (i for all buffers) { if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) { LockBufHdr(bufHdr); if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) { PinBuffer_Locked(bubHder); LWLockAcquire(); FlushBuffer(); 128GB shared buffers contain 16M buffers. On my perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it has only 6GB, the test is ignoring the effect of cache that comes from the difference of the buffer size). (attached 1) With WAL-emitting we find every buffers of the file using buffer hash, we suffer partition locks instead of the 15ms of local latency. That seems worse. > > If there were any chance on WAL for smaller files here, it would > > be on the files smaller than the ring size of bulk-write > > strategy(16MB). > > Like you, I expect the optimal threshold is less than 16MB, though you should > benchmark to see. Under the ideal threshold, when a transaction creates a new > relfilenode just smaller than the threshold, that transaction will be somewhat > slower than it would be if the threshold were zero. Locking every buffer I looked closer on this. For a 16MB file, the cost of write-fsyncing cost is almost the same to that of WAL-emitting cost. It was about 200 ms on the Vista-era machine with non-performant rotating magnetic disks with xfs. (attached 2, 3) Although write-fsyncing of relation file makes no lock conflict with other backends, WAL-emitting delays other backends' commits at most by that milliseconds. In summary, the characteristics of the two methods on a 16MB file are as the follows. File write: - 15ms of buffer scan without locks (@128GB shared buffer) + no hash search for a buffer = take locks on all buffers only of the file one by one (to write) + plus 200ms of write-fdatasync (of whole the relation file), which doesn't conflict with other backends. (except via CPU time slots and IO bandwidth.) WAL write : + no buffer scan - 2048 times (16M/8k) of partition lock on finding every buffer for the target file, which can conflict with other backends. = take locks on all buffers only of the file one by one (to take FPW) - plus 200ms of open(create)-write-fdatasync (of a WAL file (of default size)), which can delay commits on other backends at most by that duration. > header causes a distributed slow-down for other queries, and protecting the > latency of non-DDL queries is typically more useful than accelerating > TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries; > beyond a certain relfilenode size, the extra WAL harms non-DDL queries more > than the buffer scan harms them. That's about where the threshold should be. If the discussion above is correct, we shouldn't use WAL-write even for files around 16MB. For smaller shared_buffers and file size, the delays are: Scan all buffers takes: 15 ms for 128GB shared_buffers 4.5ms for 32GB shared_buffers fdatasync takes: 200 ms for 16MB/sync 51 ms for 1MB/sync 46 ms for 512kB/sync 40 ms for 256kB/sync 37 ms for 128kB/sync 35 ms for
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190827.154932.250364935.horikyota@gmail.com> > 128GB shared buffers contain 16M buffers. On my > perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it > has only 6GB, the test is ignoring the effect of cache that comes > from the difference of the buffer size). (attached 1) ... > For a 16MB file, the cost of write-fsyncing cost is almost the > same to that of WAL-emitting cost. It was about 200 ms on the > Vista-era machine with non-performant rotating magnetic disks > with xfs. (attached 2, 3) Although write-fsyncing of relation > file makes no lock conflict with other backends, WAL-emitting > delays other backends' commits at most by that milliseconds. FWIW, the attached are the programs I used to take the numbers. testloop.c: to take time to loop over buffers in FlushRelationBuffers testfile.c: to take time to sync a heap file. (means one file for the size) testfile2.c: to take time to emit a wal record. (means 16MB per file) regards. -- Kyotaro Horiguchi NTT Open Source Software Center #include #include #include typedef struct RelFileNode { unsigned int spc; unsigned int db; unsigned int rel; } RelFileNode; typedef struct Buffer { RelFileNode rnode; } Buffer; //#define NBUFFERS ((int)((128.0 * 1024 * 1024 * 1024) / (8.0 * 1024))) #define NBUFFERS ((int)((32.0 * 1024 * 1024 * 1024) / (8.0 * 1024))) int main(void) { int i; RelFileNode t = {1,2,3}; Buffer *bufs = (Buffer *) malloc(sizeof(Buffer) * NBUFFERS); struct timeval st, ed; int matches = 0, unmatches = 0; Buffer *b; for (i = 0 ; i < NBUFFERS ; i++) { bufs[i].rnode.spc = random() * 100; bufs[i].rnode.db = random() * 100; bufs[i].rnode.rel = random() * 1; } /* start measuring */ gettimeofday(, NULL); b = bufs; for (i = 0 ; i < NBUFFERS ; i++) { if (b->rnode.spc == t.spc && b->rnode.db == t.db && b->rnode.rel == t.rel) matches++; else unmatches++; b++; } gettimeofday(, NULL); printf("%lf ms for %d loops, matches %d, unmatches %d\n", (double)((ed.tv_sec - st.tv_sec) * 1000.0 + (ed.tv_usec - st.tv_usec) / 1000.0), i, matches, unmatches); return 0; } #include #include #include #include #include #include //#define FILE_SIZE (16 * 1024 * 1024) //#define LOOPS 100 #define FILE_SIZE (64 * 1024) #define LOOPS 1000 //#define FILE_SIZE (8 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (1 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (512) //#define LOOPS 1000 //#define FILE_SIZE (128) //#define LOOPS 1000 char buf[FILE_SIZE]; char fname[256]; int main(void) { int i, j; int fd = -1; struct timeval st, ed; double accum = 0.0; int bufperfile = (int)((16.0 * 1024 * 1024) / FILE_SIZE); for (i = 0 ; i < LOOPS ; i++) { snprintf(fname, 256, "test%03d.file", i); unlink(fname); // ignore errors } for (i = 0 ; i < LOOPS ; i++) { for (j = 0 ; j < FILE_SIZE ; j++) buf[j] = random()* 256; if (i % bufperfile == 0) { if (fd >= 0) close(fd); snprintf(fname, 256, "test%03d.file", i / bufperfile); fd = open(fname, O_CREAT | O_RDWR, 0644); if (fd < 0) { fprintf(stderr, "open error: %m\n"); exit(1); } memset(buf, 0, sizeof(buf)); if (write(fd, buf, sizeof(buf)) < 0) { fprintf(stderr, "init write error: %m\n"); exit(1); } if (fsync(fd) < 0) { fprintf(stderr, "init fsync error: %m\n"); exit(1); } if (lseek(fd, 0, SEEK_SET) < 0) { fprintf(stderr, "init lseek error: %m\n"); exit(1); } } gettimeofday(, NULL); if (write(fd, buf, FILE_SIZE) < 0) { fprintf(stderr, "write error: %m\n"); exit(1); } if (fdatasync(fd) < 0) { fprintf(stderr, "fdatasync error: %m\n"); exit(1); } gettimeofday(, NULL); accum += (double)((ed.tv_sec - st.tv_sec) * 1000.0 + (ed.tv_usec - st.tv_usec) / 1000.0); } printf("%.2lf ms for %d %dkB-records (%d MB), %.2lf ms per %dkB)\n", accum, i, FILE_SIZE / 1024, i * FILE_SIZE, accum / i, FILE_SIZE / 1024); return 0; } #include #include #include #include #include //#define FILE_SIZE (16 * 1024 * 1024) //#define LOOPS 100 //#define FILE_SIZE (8 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (1 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (512) //#define LOOPS 1000 #define FILE_SIZE (128) #define LOOPS 1000 char buf[FILE_SIZE]; int main(void) { int i; int fd = -1; double accum = 0.0; struct timeval st, ed; for (i = 0 ; i <
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote: > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch wrote in > <20190820060314.ga3086...@rfd.leadboat.com> > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote > > > in <20190818035230.gb3021...@rfd.leadboat.com> > > > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another > > > > component > > > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the > > > > relation, > > > > or if it's smaller than some threshold, WAL-log the contents of the > > > > whole file > > > > at that point." Please write the part to WAL-log the contents of small > > > > files > > > > instead of syncing them. > > > > > > I'm not sure the point of the behavior. I suppose that the "log" > > > is a sequence of new_page records. It also needs to be synced and > > > it is always larger than the file to be synced. I can't think of > > > an appropriate threshold without the point. > > > > Yes, it would be a sequence of new-page records. FlushRelationBuffers() > > locks > > every buffer header containing a buffer of the current database. The belief > > has been that writing one page to xlog is cheaper than > > FlushRelationBuffers() > > in a busy system with large shared_buffers. > > I'm at a loss.. The decision between WAL and sync is made at > commit time, when we no longer have a pin on a buffer. When > emitting WAL, opposite to the assumption, lock needs to be > re-acquired for every page to emit log_new_page. What is worse, > we may need to reload evicted buffers. If the file has been > CopyFrom'ed, ring buffer strategy makes the situnation farther > worse. That doesn't seem cheap at all.. Consider a one-page relfilenode. Doing all the things you list for a single page may be cheaper than locking millions of buffer headers. > If there were any chance on WAL for smaller files here, it would > be on the files smaller than the ring size of bulk-write > strategy(16MB). Like you, I expect the optimal threshold is less than 16MB, though you should benchmark to see. Under the ideal threshold, when a transaction creates a new relfilenode just smaller than the threshold, that transaction will be somewhat slower than it would be if the threshold were zero. Locking every buffer header causes a distributed slow-down for other queries, and protecting the latency of non-DDL queries is typically more useful than accelerating TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries; beyond a certain relfilenode size, the extra WAL harms non-DDL queries more than the buffer scan harms them. That's about where the threshold should be. This should be GUC-controlled, especially since this is back-patch material. We won't necessarily pick the best value on the first attempt, and the best value could depend on factors like the filesystem, the storage hardware, and the database's latency goals. One could define the GUC as an absolute size (e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the threshold is 1MB when shared_buffers is 1GB). I'm not sure which is better.
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch wrote in <20190820060314.ga3086...@rfd.leadboat.com> > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in > > <20190818035230.gb3021...@rfd.leadboat.com> > > > For two-phase commit, PrepareTransaction() needs to execute pending syncs. > > > > Now TwoPhaseFileHeader has two new members for (commit-time) > > pending syncs. Pending-syncs are useless on wal-replay, but that > > is needed for commit-prepared. > > There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql > command, which is far too late to be syncing new relation files. (A crash may > have already destroyed their data.) PrepareTransaction(), which implements > the PREPARE TRANSACTION command, is the right place for these syncs. > > A failure in these new syncs needs to prevent the transaction from being > marked committed. Hence, in CommitTransaction(), these new syncs need to Agreed. > happen after the last step that could create assign a new relfilenode and > before RecordTransactionCommit(). I suspect it's best to do it after > PreCommit_on_commit_actions() and before AtEOXact_LargeObject(). I don't find an obvious problem there. Since pending deletes and pending syncs are separately processed, I'm planning to make a separate list for syncs from deletes. > > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL > > > for > > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There > > > may be > > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() > > > return > > > false due to this code, use RelationNeedsWAL() for multiple forks, and > > > then > > > not actually sync all forks. > > > > I agree that all forks needs syncing, but FSM and VM are checking > > RelationNeedsWAL(modified). To make sure, are you suggesting to > > sync all forks instead of emitting WAL for them, or suggesting > > that VM and FSM to emit WALs even when the modified > > RelationNeedsWAL returns false (+ sync all forks)? > > I hadn't thought that far. What do you think is best? As in the latest patch, sync ALL forks then no WALs. We could skip syncing FSM but I'm not sure it's work doing. > > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another > > > component > > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the > > > relation, > > > or if it's smaller than some threshold, WAL-log the contents of the whole > > > file > > > at that point." Please write the part to WAL-log the contents of small > > > files > > > instead of syncing them. > > > > I'm not sure the point of the behavior. I suppose that the "log" > > is a sequence of new_page records. It also needs to be synced and > > it is always larger than the file to be synced. I can't think of > > an appropriate threshold without the point. > > Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks > every buffer header containing a buffer of the current database. The belief > has been that writing one page to xlog is cheaper than FlushRelationBuffers() > in a busy system with large shared_buffers. I'm at a loss.. The decision between WAL and sync is made at commit time, when we no longer have a pin on a buffer. When emitting WAL, opposite to the assumption, lock needs to be re-acquired for every page to emit log_new_page. What is worse, we may need to reload evicted buffers. If the file has been CopyFrom'ed, ring buffer strategy makes the situnation farther worse. That doesn't seem cheap at all.. If there were any chance on WAL for smaller files here, it would be on the files smaller than the ring size of bulk-write strategy(16MB). If we pick up every buffer page of the file instead of scanning through all buffers, that makes things worse by conflicts on partition locks. Any thoughts? # Sorry time's up today. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Aug 21, 2019 at 04:32:38PM +0900, Kyotaro Horiguchi wrote: > At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi > wrote in > <20190819.185959.118543656.horikyota@gmail.com> > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in > > <20190818035230.gb3021...@rfd.leadboat.com> > > > For two-phase commit, PrepareTransaction() needs to execute pending syncs. > > Now TwoPhaseFileHeader has two new members for pending syncs. It > is useless on wal-replay, but that is needed for commit-prepared. Syncs need to happen in PrepareTransaction(), not in commit-prepared. I wrote about that in https://postgr.es/m/20190820060314.ga3086...@rfd.leadboat.com
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. New version is attached. At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190819.185959.118543656.horikyota@gmail.com> > Thank you for taking time. > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in > <20190818035230.gb3021...@rfd.leadboat.com> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs. Now TwoPhaseFileHeader has two new members for pending syncs. It is useless on wal-replay, but that is needed for commit-prepared. > > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > > --- a/src/backend/access/heap/heapam_handler.c > > > +++ b/src/backend/access/heap/heapam_handler.c > > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > > > Relation NewHeap, > ... > > > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); > > > - > > > /* use_wal off requires smgr_targblock be initially invalid */ > > > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); > > > > Since you're deleting the use_wal variable, update that last comment. Oops! Rewrote it. > > > --- a/src/backend/catalog/storage.c > > > +++ b/src/backend/catalog/storage.c > > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) > ... > > > + smgrimmedsync(srel, MAIN_FORKNUM); > > > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may > > be > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return > > false due to this code, use RelationNeedsWAL() for multiple forks, and then > > not actually sync all forks. > > I agree that all forks needs syncing, but FSM and VM are checking > RelationNeedsWAL(modified). To make sure, are you suggesting to > sync all forks instead of emitting WAL for them, or suggesting > that VM and FSM to emit WALs even when the modified > RelationNeedsWAL returns false (+ sync all forks)? All forks are synced and have no WALs emitted (as before) in the attached version 19. FSM and VM are not changed. > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the > > relation, > > or if it's smaller than some threshold, WAL-log the contents of the whole > > file > > at that point." Please write the part to WAL-log the contents of small > > files > > instead of syncing them. > > I'm not sure the point of the behavior. I suppose that the "log" > is a sequence of new_page records. It also needs to be synced and > it is always larger than the file to be synced. I can't think of > an appropriate threshold without the point. This is not included in this version. I'll continue to consider this. > > > --- a/src/backend/commands/copy.c > > > +++ b/src/backend/commands/copy.c > > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) > > >* If it does commit, we'll have done the table_finish_bulk_insert() at > > >* the bottom of this routine first. > > >* > > > - * As mentioned in comments in utils/rel.h, the in-same-transaction test > > > - * is not always set correctly, since in rare cases > > > rd_newRelfilenodeSubid > > > - * can be cleared before the end of the transaction. The exact case is > > > - * when a relation sets a new relfilenode twice in same transaction, yet > > > - * the second one fails in an aborted subtransaction, e.g. > > > - * > > > - * BEGIN; > > > - * TRUNCATE t; > > > - * SAVEPOINT save; > > > - * TRUNCATE t; > > > - * ROLLBACK TO save; > > > - * COPY ... > > > > The comment material being deleted is still correct, so don't delete it. The code is changed to use rd_firstRelfilenodeSubid instead of rd_firstRelfilenodeSubid which has the issue mentioned in the deleted section. So this is right but irrelevant to the code here. The same thing is written in the comment in RelationData. (In short, not reverted) > > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The > > attached patch adds an assertion that RelationNeedsWAL() and the > > pendingDeletes array have the same opinion about the relfilenode, and it > > expands a test case to fail that assertion. .. > > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and > > audit > > all the lines printed. Many bits of code need to look at all three, > > e.g. RelationClose(). I forgot to maintain rd_firstRelfilenode in many places and the assertion failure no longer happens after I fixed it. Opposite to my previous mail, of course useless pending entries are removed at subtransction abort and no needless syncs happen in that meaning. But another type of useless sync was seen with the previous version 18. (In short fixed.) > > This field needs to be 100% reliable. In other words, > > it must equal InvalidSubTransactionId if and only if the relfilenode matches > > the
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190819.185959.118543656.horikyota@gmail.com> > > The comment material being deleted is still correct, so don't delete it. > > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The > > attached patch adds an assertion that RelationNeedsWAL() and the > > pendingDeletes array have the same opinion about the relfilenode, and it > > expands a test case to fail that assertion. > > (Un?)Fortunately, that doesn't fail.. (with rebased version on > the recent master) I'll recheck that tomorrow. I saw the assertion failure. It's a part of intended behavior. In this patch, relcache doesn't hold the whole history of relfilenodes so we cannot remove useless pending syncs perfectly. On the other hand they are harmless except that they cause extra sync of files that are removed immediately. So I choosed that once registered pending syncs are not removed. If we want consistency here, we need to record creator subxid in PendingRelOps (PendingRelDelete) struct and rather large work at subtransaction end. > > > --- a/src/include/utils/rel.h > > > +++ b/src/include/utils/rel.h > > > @@ -74,11 +74,13 @@ typedef struct RelationData > > > SubTransactionId rd_createSubid;/* rel was created in current > > > xact */ > > > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode > > > assigned in > > > > > > * current xact */ > > > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode > > > assigned > > > + > > > * first in current xact */ > > > > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and > > audit > > all the lines printed. Many bits of code need to look at all three, > > e.g. RelationClose(). > > Agreed. I'll recheck that. > > > This field needs to be 100% reliable. In other words, > > it must equal InvalidSubTransactionId if and only if the relfilenode matches > > the relfilenode that would be in place if the top transaction rolled back. > > I don't get this. I think the variable moves as you suggested. It > is handled same way with fd_new* in AtEOSubXact_cleanup but the > difference is in assignment but rollback. rd_fist* won't change > after the first assignment so rollback of the subid means > relfilenode is also rolled back to the initial value at the > beginning of the top transaction. So I'll add this in the next version to see how it looks. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in > <20190818035230.gb3021...@rfd.leadboat.com> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs. > > Now TwoPhaseFileHeader has two new members for (commit-time) > pending syncs. Pending-syncs are useless on wal-replay, but that > is needed for commit-prepared. There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql command, which is far too late to be syncing new relation files. (A crash may have already destroyed their data.) PrepareTransaction(), which implements the PREPARE TRANSACTION command, is the right place for these syncs. A failure in these new syncs needs to prevent the transaction from being marked committed. Hence, in CommitTransaction(), these new syncs need to happen after the last step that could create assign a new relfilenode and before RecordTransactionCommit(). I suspect it's best to do it after PreCommit_on_commit_actions() and before AtEOXact_LargeObject(). > > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > > --- a/src/backend/catalog/storage.c > > > +++ b/src/backend/catalog/storage.c > > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) > ... > > > + smgrimmedsync(srel, MAIN_FORKNUM); > > > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may > > be > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return > > false due to this code, use RelationNeedsWAL() for multiple forks, and then > > not actually sync all forks. > > I agree that all forks needs syncing, but FSM and VM are checking > RelationNeedsWAL(modified). To make sure, are you suggesting to > sync all forks instead of emitting WAL for them, or suggesting > that VM and FSM to emit WALs even when the modified > RelationNeedsWAL returns false (+ sync all forks)? I hadn't thought that far. What do you think is best? > > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the > > relation, > > or if it's smaller than some threshold, WAL-log the contents of the whole > > file > > at that point." Please write the part to WAL-log the contents of small > > files > > instead of syncing them. > > I'm not sure the point of the behavior. I suppose that the "log" > is a sequence of new_page records. It also needs to be synced and > it is always larger than the file to be synced. I can't think of > an appropriate threshold without the point. Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks every buffer header containing a buffer of the current database. The belief has been that writing one page to xlog is cheaper than FlushRelationBuffers() in a busy system with large shared_buffers. > > > --- a/src/backend/commands/copy.c > > > +++ b/src/backend/commands/copy.c > > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) > > >* If it does commit, we'll have done the table_finish_bulk_insert() at > > >* the bottom of this routine first. > > >* > > > - * As mentioned in comments in utils/rel.h, the in-same-transaction test > > > - * is not always set correctly, since in rare cases > > > rd_newRelfilenodeSubid > > > - * can be cleared before the end of the transaction. The exact case is > > > - * when a relation sets a new relfilenode twice in same transaction, yet > > > - * the second one fails in an aborted subtransaction, e.g. > > > - * > > > - * BEGIN; > > > - * TRUNCATE t; > > > - * SAVEPOINT save; > > > - * TRUNCATE t; > > > - * ROLLBACK TO save; > > > - * COPY ... > > > > The comment material being deleted is still correct, so don't delete it. > > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The > > attached patch adds an assertion that RelationNeedsWAL() and the > > pendingDeletes array have the same opinion about the relfilenode, and it > > expands a test case to fail that assertion. > > (Un?)Fortunately, that doesn't fail.. (with rebased version on > the recent master) I'll recheck that tomorrow. Did you build with --enable-cassert? > > > --- a/src/include/utils/rel.h > > > +++ b/src/include/utils/rel.h > > > @@ -74,11 +74,13 @@ typedef struct RelationData > > > SubTransactionId rd_createSubid;/* rel was created in current > > > xact */ > > > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode > > > assigned in > > > > > > * current xact */ > > > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode > > > assigned > > > + > > > * first in current xact
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for taking time. At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch wrote in <20190818035230.gb3021...@rfd.leadboat.com> > For two-phase commit, PrepareTransaction() needs to execute pending syncs. Now TwoPhaseFileHeader has two new members for (commit-time) pending syncs. Pending-syncs are useless on wal-replay, but that is needed for commit-prepared. > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > --- a/src/backend/access/heap/heapam_handler.c > > +++ b/src/backend/access/heap/heapam_handler.c > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > > Relation NewHeap, ... > > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); > > - > > /* use_wal off requires smgr_targblock be initially invalid */ > > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); > > Since you're deleting the use_wal variable, update that last comment. Oops. Rewrote it. > > --- a/src/backend/catalog/storage.c > > +++ b/src/backend/catalog/storage.c > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) ... > > + smgrimmedsync(srel, MAIN_FORKNUM); > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return > false due to this code, use RelationNeedsWAL() for multiple forks, and then > not actually sync all forks. I agree that all forks needs syncing, but FSM and VM are checking RelationNeedsWAL(modified). To make sure, are you suggesting to sync all forks instead of emitting WAL for them, or suggesting that VM and FSM to emit WALs even when the modified RelationNeedsWAL returns false (+ sync all forks)? > The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation, > or if it's smaller than some threshold, WAL-log the contents of the whole file > at that point." Please write the part to WAL-log the contents of small files > instead of syncing them. I'm not sure the point of the behavior. I suppose that the "log" is a sequence of new_page records. It also needs to be synced and it is always larger than the file to be synced. I can't think of an appropriate threshold without the point. > > --- a/src/backend/commands/copy.c > > +++ b/src/backend/commands/copy.c > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) > > * If it does commit, we'll have done the table_finish_bulk_insert() at > > * the bottom of this routine first. > > * > > -* As mentioned in comments in utils/rel.h, the in-same-transaction test > > -* is not always set correctly, since in rare cases > > rd_newRelfilenodeSubid > > -* can be cleared before the end of the transaction. The exact case is > > -* when a relation sets a new relfilenode twice in same transaction, yet > > -* the second one fails in an aborted subtransaction, e.g. > > -* > > -* BEGIN; > > -* TRUNCATE t; > > -* SAVEPOINT save; > > -* TRUNCATE t; > > -* ROLLBACK TO save; > > -* COPY ... > > The comment material being deleted is still correct, so don't delete it. > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The > attached patch adds an assertion that RelationNeedsWAL() and the > pendingDeletes array have the same opinion about the relfilenode, and it > expands a test case to fail that assertion. (Un?)Fortunately, that doesn't fail.. (with rebased version on the recent master) I'll recheck that tomorrow. > > --- a/src/include/utils/rel.h > > +++ b/src/include/utils/rel.h > > @@ -74,11 +74,13 @@ typedef struct RelationData > > SubTransactionId rd_createSubid;/* rel was created in current > > xact */ > > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode > > assigned in > > > > * current xact */ > > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode > > assigned > > + > > * first in current xact */ > > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit > all the lines printed. Many bits of code need to look at all three, > e.g. RelationClose(). Agreed. I'll recheck that. > This field needs to be 100% reliable. In other words, > it must equal InvalidSubTransactionId if and only if the relfilenode matches > the relfilenode that would be in place if the top transaction rolled back. I don't get this. I think the variable moves as you suggested. It is handled same way with fd_new* in AtEOSubXact_cleanup but the difference is in assignment but rollback. rd_fist* won't change after the first assignment so rollback of the subid means
Re: [HACKERS] WAL logging problem in 9.4.3?
For two-phase commit, PrepareTransaction() needs to execute pending syncs. On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > Relation NewHeap, > /* Remember if it's a system catalog */ > is_system_catalog = IsSystemRelation(OldHeap); > > - /* > - * We need to log the copied data in WAL iff WAL archiving/streaming is > - * enabled AND it's a WAL-logged rel. > - */ > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); > - > /* use_wal off requires smgr_targblock be initially invalid */ > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); Since you're deleting the use_wal variable, update that last comment. > --- a/src/backend/catalog/storage.c > +++ b/src/backend/catalog/storage.c > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit) > { > SMgrRelation srel; > > - srel = smgropen(pending->relnode, > pending->backend); > - > - /* allocate the initial array, or extend it, if > needed */ > - if (maxrels == 0) > + if (pending->dosync) > { > - maxrels = 8; > - srels = palloc(sizeof(SMgrRelation) * > maxrels); > + /* Perform pending sync of WAL-skipped > relation */ > + > FlushRelationBuffersWithoutRelcache(pending->relnode, > + > false); > + srel = smgropen(pending->relnode, > pending->backend); > + smgrimmedsync(srel, MAIN_FORKNUM); This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be no bug today, but it's conceptually wrong to make RelationNeedsWAL() return false due to this code, use RelationNeedsWAL() for multiple forks, and then not actually sync all forks. The https://postgr.es/m/559fa0ba.3080...@iki.fi design had another component not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation, or if it's smaller than some threshold, WAL-log the contents of the whole file at that point." Please write the part to WAL-log the contents of small files instead of syncing them. > --- a/src/backend/commands/copy.c > +++ b/src/backend/commands/copy.c > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) >* If it does commit, we'll have done the table_finish_bulk_insert() at >* the bottom of this routine first. >* > - * As mentioned in comments in utils/rel.h, the in-same-transaction test > - * is not always set correctly, since in rare cases > rd_newRelfilenodeSubid > - * can be cleared before the end of the transaction. The exact case is > - * when a relation sets a new relfilenode twice in same transaction, yet > - * the second one fails in an aborted subtransaction, e.g. > - * > - * BEGIN; > - * TRUNCATE t; > - * SAVEPOINT save; > - * TRUNCATE t; > - * ROLLBACK TO save; > - * COPY ... The comment material being deleted is still correct, so don't delete it. Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The attached patch adds an assertion that RelationNeedsWAL() and the pendingDeletes array have the same opinion about the relfilenode, and it expands a test case to fail that assertion. > --- a/src/include/utils/rel.h > +++ b/src/include/utils/rel.h > @@ -74,11 +74,13 @@ typedef struct RelationData > SubTransactionId rd_createSubid;/* rel was created in current > xact */ > SubTransactionId rd_newRelfilenodeSubid;/* new relfilenode > assigned in > > * current xact */ > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode > assigned > + > * first in current xact */ In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit all the lines printed. Many bits of code need to look at all three, e.g. RelationClose(). This field needs to be 100% reliable. In other words, it must equal InvalidSubTransactionId if and only if the relfilenode matches the relfilenode that would be in place if the top transaction rolled back. nm diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 6ebe75a..d74e9a5 100644 ---
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Fri, 2 Aug 2019 11:35:06 +1200, Thomas Munro wrote in > On Sat, Jul 27, 2019 at 6:26 PM Noah Misch wrote: > > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > > No substantial change have been made by this rebasing. > > > > Thanks. I'll likely review this on 2019-08-20. If someone opts to review > > it > > earlier, I welcome that. > > Cool. That'll be in time to be marked committed in the September CF, > this patch's 16th. Yeah, this patch has been reborn far simpler and generic (or robust) thanks to Noah. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sat, Jul 27, 2019 at 6:26 PM Noah Misch wrote: > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > > No substantial change have been made by this rebasing. > > Thanks. I'll likely review this on 2019-08-20. If someone opts to review it > earlier, I welcome that. Cool. That'll be in time to be marked committed in the September CF, this patch's 16th. -- Thomas Munro https://enterprisedb.com
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote: > No substantial change have been made by this rebasing. Thanks. I'll likely review this on 2019-08-20. If someone opts to review it earlier, I welcome that.
Re: [HACKERS] WAL logging problem in 9.4.3?
I found that CF-bot complaining on this. Seems that some comment fixes by the recent 21039555cd are the cause. No substantial change have been made by this rebasing. regards. On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi wrote: > > At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi > wrote in > <20190712.173041.236938840.horikyota@gmail.com> > > The v16 seems no longer works so I'll send further rebased version. > > It's just by renaming of TestLib::real_dir to perl2host. > This is rebased version v17. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center -- Kyotaro Horiguchi NTT Open Source Software Center v18-0001-TAP-test-for-copy-truncation-optimization.patch Description: Binary data v18-0002-Fix-WAL-skipping-feature.patch Description: Binary data v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch Description: Binary data
Re: [HACKERS] WAL logging problem in 9.4.3?
At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190712.173041.236938840.horikyota@gmail.com> > The v16 seems no longer works so I'll send further rebased version. It's just by renaming of TestLib::real_dir to perl2host. This is rebased version v17. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 9bcd4acb14c5cef2d4bdf20c9be8c86597a9cf7c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH 1/3] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 291 1 file changed, 291 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 00..b26cd8efd5 --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,291 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 24; + +sub check_orphan_relfilenodes +{ + my($node, $test_name) = @_; + + my $db_oid = $node->safe_psql('postgres', + "SELECT oid FROM pg_database WHERE datname = 'postgres'"); + my $prefix = "base/$db_oid/"; + my $filepaths_referenced = $node->safe_psql('postgres', " + SELECT pg_relation_filepath(oid) FROM pg_class + WHERE reltablespace = 0 and relpersistence <> 't' and + pg_relation_filepath(oid) IS NOT NULL;"); + is_deeply([sort(map { "$prefix$_" } + grep(/^[0-9]+$/, + slurp_dir($node->data_dir . "/$prefix")))], + [sort split /\n/, $filepaths_referenced], + $test_name); + return; +} + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +)); + $node->start; + + # Setup + my $tablespace_dir = $node->basedir . '/tablespace_other'; + mkdir ($tablespace_dir); + $tablespace_dir = TestLib::perl2host($tablespace_dir); + $node->safe_psql('postgres', + "CREATE TABLESPACE other LOCATION '$tablespace_dir';"); + + # Test direct truncation optimization. No tuples + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test1 (id serial PRIMARY KEY); + TRUNCATE test1; + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;"); + is($result, qq(0), + "wal_level = $wal_level, optimized truncation with empty table"); + + # Test truncation with inserted tuples within the same transaction. + # Tuples inserted after the truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test2 (id serial PRIMARY KEY); + INSERT INTO test2 VALUES (DEFAULT); + TRUNCATE test2; + INSERT INTO test2 VALUES (DEFAULT); + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;"); + is($result, qq(1), + "wal_level = $wal_level, optimized truncation with inserted table"); + + # Data file for COPY query in follow-up tests. + my $basedir = $node->basedir; + my $copy_file = "$basedir/copy_data.txt"; + TestLib::append_to_file($copy_file, qq(2,3 +20001,30001 +20002,30002)); + + # Test truncation with inserted tuples using COPY. Tuples copied after the + # truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3 (id serial PRIMARY KEY, id2 int); + INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3; + COPY test3 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;"); + is($result, qq(3), + "wal_level = $wal_level, optimized truncation with copied table"); + + # Like previous test, but rollback SET TABLESPACE in a subtransaction. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a (id serial PRIMARY KEY, id2 int); + INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3a; + SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s; + COPY test3a FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;"); + is($result, qq(3), + "wal_level = $wal_level, SET TABLESPACE in
Re: [HACKERS] WAL logging problem in 9.4.3?
Many message seem lost during moving to new environmet.. I'm digging the archive but coudn't find the message for v15.. At Thu, 11 Jul 2019 18:03:35 -0700, Noah Misch wrote in <20190712010335.gb1610...@rfd.leadboat.com> > On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote: > > Hello. Rebased the patch to master(bd56cd75d2). > > It looks like you did more than just a rebase, because this v16 no longer > modifies many files that v14 did modify. (That's probably good, since you had > pending review comments.) What other changes did you make? Yeah.. Maybe I forgot to send pre-v15 or v16 before rebasing. v14: WAL-logging is controled by AMs and syncing at commit is controled according to the behavior. At-commit sync is still controlled per-relation basis, which means it must be processed before transaction state becomes TRNAS_COMMIT. So it needs to be separated into PreCommit_RelationSync() from AtEOXact_RelationCache(). v15: The biggest change is that at-commit sync is changed to smgr basis. At-commit sync is programmed at creation of a storage file (RelationCreateStorage), and smgrDoPendingDelete(or smgrDoPendingOperations after rename) runs syncs. AM are no longer involved and all permanent relations are WAL-skipped at all in the creation transaction while wal_level=minimal. All storages created for a relation are once synced then removed at commit. v16: rebased. The v16 seems no longer works so I'll send further rebased version. Sorry for the late reply and confusion.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote: > Hello. Rebased the patch to master(bd56cd75d2). It looks like you did more than just a rebase, because this v16 no longer modifies many files that v14 did modify. (That's probably good, since you had pending review comments.) What other changes did you make?
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. Rebased the patch to master(bd56cd75d2). regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From ac52e2c1c56a96c1745149ff4220a3a116d6c811 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH 1/3] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 291 1 file changed, 291 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 00..4fa8be728e --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,291 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 24; + +sub check_orphan_relfilenodes +{ + my($node, $test_name) = @_; + + my $db_oid = $node->safe_psql('postgres', + "SELECT oid FROM pg_database WHERE datname = 'postgres'"); + my $prefix = "base/$db_oid/"; + my $filepaths_referenced = $node->safe_psql('postgres', " + SELECT pg_relation_filepath(oid) FROM pg_class + WHERE reltablespace = 0 and relpersistence <> 't' and + pg_relation_filepath(oid) IS NOT NULL;"); + is_deeply([sort(map { "$prefix$_" } + grep(/^[0-9]+$/, + slurp_dir($node->data_dir . "/$prefix")))], + [sort split /\n/, $filepaths_referenced], + $test_name); + return; +} + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +)); + $node->start; + + # Setup + my $tablespace_dir = $node->basedir . '/tablespace_other'; + mkdir ($tablespace_dir); + $tablespace_dir = TestLib::real_dir($tablespace_dir); + $node->safe_psql('postgres', + "CREATE TABLESPACE other LOCATION '$tablespace_dir';"); + + # Test direct truncation optimization. No tuples + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test1 (id serial PRIMARY KEY); + TRUNCATE test1; + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;"); + is($result, qq(0), + "wal_level = $wal_level, optimized truncation with empty table"); + + # Test truncation with inserted tuples within the same transaction. + # Tuples inserted after the truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test2 (id serial PRIMARY KEY); + INSERT INTO test2 VALUES (DEFAULT); + TRUNCATE test2; + INSERT INTO test2 VALUES (DEFAULT); + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;"); + is($result, qq(1), + "wal_level = $wal_level, optimized truncation with inserted table"); + + # Data file for COPY query in follow-up tests. + my $basedir = $node->basedir; + my $copy_file = "$basedir/copy_data.txt"; + TestLib::append_to_file($copy_file, qq(2,3 +20001,30001 +20002,30002)); + + # Test truncation with inserted tuples using COPY. Tuples copied after the + # truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3 (id serial PRIMARY KEY, id2 int); + INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3; + COPY test3 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;"); + is($result, qq(3), + "wal_level = $wal_level, optimized truncation with copied table"); + + # Like previous test, but rollback SET TABLESPACE in a subtransaction. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a (id serial PRIMARY KEY, id2 int); + INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3a; + SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s; + COPY test3a FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;"); + is($result, qq(3), + "wal_level = $wal_level, SET TABLESPACE in subtransaction"); + + # in different subtransaction patterns + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int); + INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, May 28, 2019 at 4:33 AM Noah Misch wrote: > > On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote: > > At Fri, 24 May 2019 19:33:32 -0700, Noah Misch wrote in > > <20190525023332.ge1624...@rfd.leadboat.com> > > > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote: > > > > Following this direction, the attached PoC works *at least for* > > > > the wal_optimization TAP tests, but doing pending flush not in > > > > smgr but in relcache. > > > > > > This task, syncing files created in the current transaction, is not the > > > kind > > > of task normally assigned to a cache. We already have a module, > > > storage.c, > > > that maintains state about files created in the current transaction. Why > > > did > > > you use relcache instead of storage.c? > > > > The reason was at-commit sync needs buffer flush beforehand. But > > FlushRelationBufferWithoutRelCache() in v11 can do > > that. storage.c is reasonable as the place. > > Okay. I do want this to work in 9.5 and later, but I'm not aware of a reason > relcache.c would be a better code location in older branches. Unless you > think of a reason to prefer relcache.c, please use storage.c. > > > > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote: > > > > This is a tidier version of the patch. > > > > > > > - Move the substantial work to table/index AMs. > > > > > > > > Each AM can decide whether to support WAL skip or not. > > > > Currently heap and nbtree support it. > > > > > > Why would an AM find it important to disable WAL skip? > > > > The reason is currently it's AM's responsibility to decide > > whether to skip WAL or not. > > I see. Skipping the sync would be a mere optimization; no AM would require it > for correctness. An AM might want RelationNeedsWAL() to keep returning true > despite the sync happening, perhaps because it persists data somewhere other > than the forks of pg_class.relfilenode. Since the index and table APIs > already assume one relfilenode captures all persistent data, I'm not seeing a > use case for an AM overriding this behavior. Let's take away the AM's > responsibility for this decision, making the system simpler. A future patch > could let AM code decide, if someone find a real-world use case for > AM-specific logic around when to skip WAL. > It seems there is some feedback for this patch and the CF is going to start in 2 days. Are you planning to work on this patch for next CF, if not then it is better to bump this? It is not a good idea to see the patch in "waiting on author" in the beginning of CF unless the author is actively working on the patch and is going to produce a version in next few days. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote: > At Fri, 24 May 2019 19:33:32 -0700, Noah Misch wrote in > <20190525023332.ge1624...@rfd.leadboat.com> > > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote: > > > Following this direction, the attached PoC works *at least for* > > > the wal_optimization TAP tests, but doing pending flush not in > > > smgr but in relcache. > > > > This task, syncing files created in the current transaction, is not the kind > > of task normally assigned to a cache. We already have a module, storage.c, > > that maintains state about files created in the current transaction. Why > > did > > you use relcache instead of storage.c? > > The reason was at-commit sync needs buffer flush beforehand. But > FlushRelationBufferWithoutRelCache() in v11 can do > that. storage.c is reasonable as the place. Okay. I do want this to work in 9.5 and later, but I'm not aware of a reason relcache.c would be a better code location in older branches. Unless you think of a reason to prefer relcache.c, please use storage.c. > > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote: > > > This is a tidier version of the patch. > > > > > - Move the substantial work to table/index AMs. > > > > > > Each AM can decide whether to support WAL skip or not. > > > Currently heap and nbtree support it. > > > > Why would an AM find it important to disable WAL skip? > > The reason is currently it's AM's responsibility to decide > whether to skip WAL or not. I see. Skipping the sync would be a mere optimization; no AM would require it for correctness. An AM might want RelationNeedsWAL() to keep returning true despite the sync happening, perhaps because it persists data somewhere other than the forks of pg_class.relfilenode. Since the index and table APIs already assume one relfilenode captures all persistent data, I'm not seeing a use case for an AM overriding this behavior. Let's take away the AM's responsibility for this decision, making the system simpler. A future patch could let AM code decide, if someone find a real-world use case for AM-specific logic around when to skip WAL.
Re: [HACKERS] WAL logging problem in 9.4.3?
Thanks for the comment! At Fri, 24 May 2019 19:33:32 -0700, Noah Misch wrote in <20190525023332.ge1624...@rfd.leadboat.com> > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote: > > Following this direction, the attached PoC works *at least for* > > the wal_optimization TAP tests, but doing pending flush not in > > smgr but in relcache. > > This task, syncing files created in the current transaction, is not the kind > of task normally assigned to a cache. We already have a module, storage.c, > that maintains state about files created in the current transaction. Why did > you use relcache instead of storage.c? The reason was at-commit sync needs buffer flush beforehand. But FlushRelationBufferWithoutRelCache() in v11 can do that. storage.c is reasonable as the place. > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote: > > This is a tidier version of the patch. > > > - Move the substantial work to table/index AMs. > > > > Each AM can decide whether to support WAL skip or not. > > Currently heap and nbtree support it. > > Why would an AM find it important to disable WAL skip? The reason is currently it's AM's responsibility to decide whether to skip WAL or not. regards. -- Kyotaro Horiguchi NTT Open Source Software Center