Re: [HACKERS] WAL logging problem in 9.4.3?
At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170913.174239.25978735.horiguchi.kyot...@lab.ntt.co.jp> > filterdiff seems to did something wrong.. # to did... The patch is broken by filterdiff so I send a new patch made directly by git format-patch. I confirmed that a build completes with applying this. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 7086b5855080065f73de4d099cbaab09511f01fc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 12 Sep 2017 13:01:33 +0900 Subject: [PATCH] Fix WAL logging problem --- src/backend/access/heap/heapam.c| 113 +--- src/backend/access/heap/pruneheap.c | 3 +- src/backend/access/heap/rewriteheap.c | 4 +- src/backend/access/heap/visibilitymap.c | 3 +- src/backend/access/transam/xact.c | 7 + src/backend/catalog/storage.c | 318 +--- src/backend/commands/copy.c | 13 +- src/backend/commands/createas.c | 9 +- src/backend/commands/matview.c | 6 +- src/backend/commands/tablecmds.c| 8 +- src/backend/commands/vacuumlazy.c | 6 +- src/backend/storage/buffer/bufmgr.c | 40 +++- src/backend/utils/cache/relcache.c | 13 ++ src/include/access/heapam.h | 8 +- src/include/catalog/storage.h | 5 +- src/include/storage/bufmgr.h| 2 + src/include/utils/rel.h | 8 + 17 files changed, 476 insertions(+), 90 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index d20f038..e40254d 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -34,6 +34,28 @@ * the POSTGRES heap access method used for all POSTGRES * relations. * + * WAL CONSIDERATIONS + * All heap operations are normally WAL-logged. but there are a few + * exceptions. Temporary and unlogged relations never need to be + * WAL-logged, but we can also skip WAL-logging for a table that was + * created in the same transaction, if we don't need WAL for PITR or + * WAL archival purposes (i.e. if wal_level=minimal), and we fsync() + * the file to disk at COMMIT instead. + * + * The same-relation optimization is not employed automatically on all + * updates to a table that was created in the same transacton, because + * for a small number of changes, it's cheaper to just create the WAL + * records than fsyncing() the whole relation at COMMIT. It is only + * worthwhile for (presumably) large operations like COPY, CLUSTER, + * or VACUUM FULL. Use heap_register_sync() to initiate such an + * operation; it will cause any subsequent updates to the table to skip + * WAL-logging, if possible, and cause the heap to be synced to disk at + * COMMIT. + * + * To make that work, all modifications to heap must use + * HeapNeedsWAL() to check if WAL-logging is needed in this transaction + * for the given block. + * *- */ #include "postgres.h" @@ -56,6 +78,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "port/atomics.h" @@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * TID where the tuple was stored. But note that any toasting of fields * within the tuple data is NOT reflected into *tup. */ +extern HTAB *pendingSyncs; Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, int options, BulkInsertState bistate) @@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2681,12 +2699,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, int ndone; char *scratch = NULL; Page page; - bool needwal; Size saveFreeSpace; bool need_tuple_data = RelationIsLogicallyLogged(relation); bool need_cids =
Re: [HACKERS] WAL logging problem in 9.4.3?
At Wed, 13 Sep 2017 15:05:31 +1200, Thomas Munrowrote in
Re: [HACKERS] WAL logging problem in 9.4.3?
Kyotaro HORIGUCHI wrote: > The CF status of this patch turned into "Waiting on Author" by > automated CI checking. I object to automated turning of patches to waiting on author by machinery. Sending occasional reminder messages to authors making them know about outdated patches seems acceptable to me at this stage. It'll take some time for this machinery to get perfected; only when it is beyond experimental mode it'll be acceptable to change patches' status in an automated fashion. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Sep 13, 2017 at 1:04 PM, Kyotaro HORIGUCHIwrote: > The CF status of this patch turned into "Waiting on Author" by > automated CI checking. However, I still don't get any error even > on the current master (69835bc) after make distclean. Also I > don't see any difference between the "problematic" patch and my > working branch has nothing different other than patching line > shifts. (So I haven't post a new one.) > > I looked on the location heapam.c:2502 where the CI complains at > in my working branch and I found a different code with the > complaint. > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/27450 > > 1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in > this function) > 1364 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) > > heapam.c:2502@work branch > 2502: /* XLOG stuff */ > 2503: if (BufferNeedsWAL(relation, buffer)) > > So I conclude that the CI mechinery failed to applly the patch > correctly. Hi Horiguchi-san, Hmm. Here is that line in heamap.c in unpatched master: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;h=d20f0381f3bc23f99c505ef8609d63240ac5d44b;hb=HEAD#l2485 It says: 2485 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) After applying fix-wal-level-minimal-michael-horiguchi-3.patch from this message: https://www.postgresql.org/message-id/20170912.131441.20602611.horiguchi.kyotaro%40lab.ntt.co.jp ... that line is unchanged, although it has moved to line number 2502. It doesn't compile for me, because your patch removed the definition of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it. I'm not sure what happened. Is it possible that your patch was not created by diffing against master? -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, (does this seem to be a top post?) The CF status of this patch turned into "Waiting on Author" by automated CI checking. However, I still don't get any error even on the current master (69835bc) after make distclean. Also I don't see any difference between the "problematic" patch and my working branch has nothing different other than patching line shifts. (So I haven't post a new one.) I looked on the location heapam.c:2502 where the CI complains at in my working branch and I found a different code with the complaint. https://travis-ci.org/postgresql-cfbot/postgresql/builds/27450 1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function) 1364 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) heapam.c:2502@work branch 2502: /* XLOG stuff */ 2503: if (BufferNeedsWAL(relation, buffer)) So I conclude that the CI mechinery failed to applly the patch correctly. At Thu, 13 Apr 2017 15:29:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170413.152935.100104316.horiguchi.kyot...@lab.ntt.co.jp> > > > > I'll post new patch in this way soon. > > > > > > Here it is. > > > > It contained tariling space and missing test script. This is the > > correct patch. > > > > > - Relation has new members no_pending_sync and pending_sync that > > > works as instant cache of an entry in pendingSync hash. > > > > > > - Commit-time synchronizing is restored as Michael's patch. > > > > > > - If relfilenode is replaced, pending_sync for the old node is > > > removed. Anyway this is ignored on abort and meaningless on > > > commit. > > > > > > - TAP test is renamed to 012 since some new files have been added. > > > > > > Accessing pending sync hash occured on every calling of > > > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of > > > accessing relations has pending sync. Almost of them are > > > eliminated as the result. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170908.163001.53230385.horiguchi.kyot...@lab.ntt.co.jp> > > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl > > >> STATEMENT: ANALYZE; > > >> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs > > >> = 0x0, no_pending_sync = 0 > > >> > > >> - lsn = XLogInsert(RM_SMGR_ID, > > >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); > > >> + rel->no_pending_sync= false; > > >> + rel->pending_sync = pending; > > >> + } > > >> > > >> It seems to me that those flags and the pending_sync data should be > > >> kept in the context of backend process and not be part of the Relation > > >> data... > > > > > > I understand that the context of "backend process" means > > > storage.c local. I don't mind the context on which the data is, > > > but I found only there that can get rid of frequent hash > > > searching. For pending deletions, just appending to a list is > > > enough and costs almost nothing, on the other hand pendig syncs > > > are required to be referenced, sometimes very frequently. > > > > > >> +void > > >> +RecordPendingSync(Relation rel) > > >> I don't think that I agree that this should be part of relcache.c. The > > >> syncs are tracked should be tracked out of the relation context. > > > > > > Yeah.. It's in storage.c in the latest patch. (Sorry for the > > > duplicate name). I think it is a kind of bond between smgr and > > > relation. > > > > > >> Seeing how invasive this change is, I would also advocate for this > > >> patch as only being a HEAD-only change, not many people are > > >> complaining about this optimization of TRUNCATE missing when wal_level > > >> = minimal, and this needs a very careful review. > > > > > > Agreed. > > > > > >> Should I code something? Or Horiguchi-san, would you take care of it? > > >> The previous crash I saw has been taken care of, but it's been really > > >> some time since I looked at this patch... > > > > > > My point is hash-search on every tuple insertion should be evaded > > > even if it happens rearely. Once it was a bit apart from your > > > original patch, but in the latest patch the significant part > > > (pending-sync hash) is revived from the original one. > > > > This patch has followed along since CF 2016-03, do we think we can reach a > > conclusion in this CF? It was marked as "Waiting on Author”, based on > > developments since in this thread, I’ve changed it back to “Needs Review” > > again. > > I manged to reload its context into my head. It doesn't apply on > the current master and needs some amendment. I'm going to work on > this. Rebased and slightly modified. Michael's latest patch on which this patch is piggybacking seems works perfectly. The motive of my addition is avoiding frequent (I think specifically per tuple modification) hash accessing occurs while pending-syncs exist. The hash contains at least 6 or more entries. The attached patch emits more log messages that will be removed in the final shape to see how much the addition reduces the hash access. As a basis of determining the worthiness of the additional mechanism, I'll show an example of a set of queries below. In the log messages, "r" is relation oid, "b" is buffer number, "hash" is the pointer to the backend-global hash table for pending syncs. "ent" is the entry in the hash belongs to the relation, "neg" is a flag indicates that the existing pending sync hash doesn't have an entry for the relation. =# set log_min_message to debug2; =# begin; =# create table test1(a text primary key); > DEBUG: BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0 # relid=2608 buf=55, hash has not been created =# insert into test1 values ('inserted row'); > DEBUG: BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0 # relid=24807, fist buffer, hash has not bee created =# copy test1 from '//copy_data.txt'; > DEBUG: BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0 # hash created, pending sync entry linked, no longer needs hash acess # (repeats for the number of buffers) COPY 200 =# create table test3(a text primary key); > DEBUG: BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1 # no pending sync entry for this relation, no longer needs hash access. =# insert into test3 (select a from generate_series(0, 99) a); > DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0 > DEBUG: BufferNeedsWAL: accessing hash : not found > DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1 # This table no longer needs hash access, (repeats for the number of tuples) =# truncate test3; =# insert into test3 (select a from generate_series(0, 99) a); > DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0 > DEBUG: BufferNeedsWAL: accessing hash : found >
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for your notification. At Tue, 5 Sep 2017 12:05:01 +0200, Daniel Gustafssonwrote in > > On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI > > wrote: > > > > At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier > > wrote in > > > >> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI > >> wrote: > >>> Sorry, what I have just sent was broken. > >> > >> You can use PROVE_TESTS when running make check to select a subset of > >> tests you want to run. I use that all the time when working on patches > >> dedicated to certain code paths. > > > > Thank you for the information. Removing unwanted test scripts > > from t/ directories was annoyance. This makes me happy. > > > - Relation has new members no_pending_sync and pending_sync that > works as instant cache of an entry in pendingSync hash. > - Commit-time synchronizing is restored as Michael's patch. > - If relfilenode is replaced, pending_sync for the old node is > removed. Anyway this is ignored on abort and meaningless on > commit. > - TAP test is renamed to 012 since some new files have been added. > > Accessing pending sync hash occurred on every calling of > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of > accessing relations has pending sync. Almost of them are > eliminated as the result. > >> > >> Did you actually test this patch? One of the logs added makes the > >> tests a long time to run: > > > > Maybe this patch requires make clean since it extends the > > structure RelationData. (Perhaps I saw the same trouble.) > > > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl > >> STATEMENT: ANALYZE; > >> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs > >> = 0x0, no_pending_sync = 0 > >> > >> - lsn = XLogInsert(RM_SMGR_ID, > >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); > >> + rel->no_pending_sync= false; > >> + rel->pending_sync = pending; > >> + } > >> > >> It seems to me that those flags and the pending_sync data should be > >> kept in the context of backend process and not be part of the Relation > >> data... > > > > I understand that the context of "backend process" means > > storage.c local. I don't mind the context on which the data is, > > but I found only there that can get rid of frequent hash > > searching. For pending deletions, just appending to a list is > > enough and costs almost nothing, on the other hand pendig syncs > > are required to be referenced, sometimes very frequently. > > > >> +void > >> +RecordPendingSync(Relation rel) > >> I don't think that I agree that this should be part of relcache.c. The > >> syncs are tracked should be tracked out of the relation context. > > > > Yeah.. It's in storage.c in the latest patch. (Sorry for the > > duplicate name). I think it is a kind of bond between smgr and > > relation. > > > >> Seeing how invasive this change is, I would also advocate for this > >> patch as only being a HEAD-only change, not many people are > >> complaining about this optimization of TRUNCATE missing when wal_level > >> = minimal, and this needs a very careful review. > > > > Agreed. > > > >> Should I code something? Or Horiguchi-san, would you take care of it? > >> The previous crash I saw has been taken care of, but it's been really > >> some time since I looked at this patch... > > > > My point is hash-search on every tuple insertion should be evaded > > even if it happens rearely. Once it was a bit apart from your > > original patch, but in the latest patch the significant part > > (pending-sync hash) is revived from the original one. > > This patch has followed along since CF 2016-03, do we think we can reach a > conclusion in this CF? It was marked as "Waiting on Author”, based on > developments since in this thread, I’ve changed it back to “Needs Review” > again. I manged to reload its context into my head. It doesn't apply on the current master and needs some amendment. I'm going to work on this. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
> On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI> wrote: > > At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier > wrote in > >> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI >> wrote: >>> Sorry, what I have just sent was broken. >> >> You can use PROVE_TESTS when running make check to select a subset of >> tests you want to run. I use that all the time when working on patches >> dedicated to certain code paths. > > Thank you for the information. Removing unwanted test scripts > from t/ directories was annoyance. This makes me happy. > - Relation has new members no_pending_sync and pending_sync that works as instant cache of an entry in pendingSync hash. - Commit-time synchronizing is restored as Michael's patch. - If relfilenode is replaced, pending_sync for the old node is removed. Anyway this is ignored on abort and meaningless on commit. - TAP test is renamed to 012 since some new files have been added. Accessing pending sync hash occurred on every calling of HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of accessing relations has pending sync. Almost of them are eliminated as the result. >> >> Did you actually test this patch? One of the logs added makes the >> tests a long time to run: > > Maybe this patch requires make clean since it extends the > structure RelationData. (Perhaps I saw the same trouble.) > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl >> STATEMENT: ANALYZE; >> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs >> = 0x0, no_pending_sync = 0 >> >> - lsn = XLogInsert(RM_SMGR_ID, >> -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); >> + rel->no_pending_sync= false; >> + rel->pending_sync = pending; >> + } >> >> It seems to me that those flags and the pending_sync data should be >> kept in the context of backend process and not be part of the Relation >> data... > > I understand that the context of "backend process" means > storage.c local. I don't mind the context on which the data is, > but I found only there that can get rid of frequent hash > searching. For pending deletions, just appending to a list is > enough and costs almost nothing, on the other hand pendig syncs > are required to be referenced, sometimes very frequently. > >> +void >> +RecordPendingSync(Relation rel) >> I don't think that I agree that this should be part of relcache.c. The >> syncs are tracked should be tracked out of the relation context. > > Yeah.. It's in storage.c in the latest patch. (Sorry for the > duplicate name). I think it is a kind of bond between smgr and > relation. > >> Seeing how invasive this change is, I would also advocate for this >> patch as only being a HEAD-only change, not many people are >> complaining about this optimization of TRUNCATE missing when wal_level >> = minimal, and this needs a very careful review. > > Agreed. > >> Should I code something? Or Horiguchi-san, would you take care of it? >> The previous crash I saw has been taken care of, but it's been really >> some time since I looked at this patch... > > My point is hash-search on every tuple insertion should be evaded > even if it happens rearely. Once it was a bit apart from your > original patch, but in the latest patch the significant part > (pending-sync hash) is revived from the original one. This patch has followed along since CF 2016-03, do we think we can reach a conclusion in this CF? It was marked as "Waiting on Author”, based on developments since in this thread, I’ve changed it back to “Needs Review” again. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquierwrote in > On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI > wrote: > > Sorry, what I have just sent was broken. > > You can use PROVE_TESTS when running make check to select a subset of > tests you want to run. I use that all the time when working on patches > dedicated to certain code paths. Thank you for the information. Removing unwanted test scripts from t/ directories was annoyance. This makes me happy. > >> - Relation has new members no_pending_sync and pending_sync that > >> works as instant cache of an entry in pendingSync hash. > >> - Commit-time synchronizing is restored as Michael's patch. > >> - If relfilenode is replaced, pending_sync for the old node is > >> removed. Anyway this is ignored on abort and meaningless on > >> commit. > >> - TAP test is renamed to 012 since some new files have been added. > >> > >> Accessing pending sync hash occurred on every calling of > >> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of > >> accessing relations has pending sync. Almost of them are > >> eliminated as the result. > > Did you actually test this patch? One of the logs added makes the > tests a long time to run: Maybe this patch requires make clean since it extends the structure RelationData. (Perhaps I saw the same trouble.) > 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl > STATEMENT: ANALYZE; > 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs > = 0x0, no_pending_sync = 0 > > - lsn = XLogInsert(RM_SMGR_ID, > -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); > + rel->no_pending_sync= false; > + rel->pending_sync = pending; > + } > > It seems to me that those flags and the pending_sync data should be > kept in the context of backend process and not be part of the Relation > data... I understand that the context of "backend process" means storage.c local. I don't mind the context on which the data is, but I found only there that can get rid of frequent hash searching. For pending deletions, just appending to a list is enough and costs almost nothing, on the other hand pendig syncs are required to be referenced, sometimes very frequently. > +void > +RecordPendingSync(Relation rel) > I don't think that I agree that this should be part of relcache.c. The > syncs are tracked should be tracked out of the relation context. Yeah.. It's in storage.c in the latest patch. (Sorry for the duplicate name). I think it is a kind of bond between smgr and relation. > Seeing how invasive this change is, I would also advocate for this > patch as only being a HEAD-only change, not many people are > complaining about this optimization of TRUNCATE missing when wal_level > = minimal, and this needs a very careful review. Agreed. > Should I code something? Or Horiguchi-san, would you take care of it? > The previous crash I saw has been taken care of, but it's been really > some time since I looked at this patch... My point is hash-search on every tuple insertion should be evaded even if it happens rearely. Once it was a bit apart from your original patch, but in the latest patch the significant part (pending-sync hash) is revived from the original one. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
I'd like to put a supplimentary explanation. At Tue, 11 Apr 2017 17:38:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170411.173812.133964522.horiguchi.kyot...@lab.ntt.co.jp> > Sorry, what I have just sent was broken. > > At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20170411.173341.257028732.horiguchi.kyot...@lab.ntt.co.jp> > > At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > > wrote in > > <20170411.095606.245908357.horiguchi.kyot...@lab.ntt.co.jp> > > > Hello, thank you for looking this. > > > > > > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane wrote > > > in <27309.1491611...@sss.pgh.pa.us> > > > > Alvaro Herrera writes: > > > > > Interesting. I wonder if it's possible that a relcache invalidation > > > > > would cause these values to get lost for some reason, because that > > > > > would > > > > > be dangerous. > > > > > > > > > I suppose the rationale is that this shouldn't happen because any > > > > > operation that does things this way must hold an exclusive lock on the > > > > > relation. But that doesn't guarantee that the relcache entry is > > > > > completely stable, > > > > > > > > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of > > > > how strong a lock you hold. > > > > > > > > regards, tom lane > > > > > > Ugh. Yes, relcache invalidation happens anytime and it resets the The pending locations are not stored in relcache hash so the problem here is not invalidation but that Relation objects are created as necessary, anywhere. Even if no invalidation happens, the same thing will happen in a bit different form. > > > added values. pg_stat_info deceived me that it can store > > > transient values. But I came up with another thought. > > > > > > The reason I proposed it was I thought that hash_search for every > > > buffer is not good. Instead, like pg_stat_info, we can link the > > > > buffer => buffer modification > > > > > pending-sync hash entry to Relation. This greately reduces the > > > frequency of hash-searching. > > > > > > I'll post new patch in this way soon. > > > > Here it is. > > It contained tariling space and missing test script. This is the > correct patch. > > > - Relation has new members no_pending_sync and pending_sync that > > works as instant cache of an entry in pendingSync hash. > > > > - Commit-time synchronizing is restored as Michael's patch. > > > > - If relfilenode is replaced, pending_sync for the old node is > > removed. Anyway this is ignored on abort and meaningless on > > commit. > > > > - TAP test is renamed to 012 since some new files have been added. > > > > Accessing pending sync hash occured on every calling of > > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of > > accessing relations has pending sync. Almost of them are > > eliminated as the result. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHIwrote: > Sorry, what I have just sent was broken. You can use PROVE_TESTS when running make check to select a subset of tests you want to run. I use that all the time when working on patches dedicated to certain code paths. >> - Relation has new members no_pending_sync and pending_sync that >> works as instant cache of an entry in pendingSync hash. >> - Commit-time synchronizing is restored as Michael's patch. >> - If relfilenode is replaced, pending_sync for the old node is >> removed. Anyway this is ignored on abort and meaningless on >> commit. >> - TAP test is renamed to 012 since some new files have been added. >> >> Accessing pending sync hash occurred on every calling of >> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of >> accessing relations has pending sync. Almost of them are >> eliminated as the result. Did you actually test this patch? One of the logs added makes the tests a long time to run: 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl STATEMENT: ANALYZE; 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs = 0x0, no_pending_sync = 0 - lsn = XLogInsert(RM_SMGR_ID, -XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + rel->no_pending_sync= false; + rel->pending_sync = pending; + } It seems to me that those flags and the pending_sync data should be kept in the context of backend process and not be part of the Relation data... +void +RecordPendingSync(Relation rel) I don't think that I agree that this should be part of relcache.c. The syncs are tracked should be tracked out of the relation context. Seeing how invasive this change is, I would also advocate for this patch as only being a HEAD-only change, not many people are complaining about this optimization of TRUNCATE missing when wal_level = minimal, and this needs a very careful review. Should I code something? Or Horiguchi-san, would you take care of it? The previous crash I saw has been taken care of, but it's been really some time since I looked at this patch... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Sorry, what I have just sent was broken. At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170411.173341.257028732.horiguchi.kyot...@lab.ntt.co.jp> > At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20170411.095606.245908357.horiguchi.kyot...@lab.ntt.co.jp> > > Hello, thank you for looking this. > > > > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane wrote in > > <27309.1491611...@sss.pgh.pa.us> > > > Alvaro Herrera writes: > > > > Interesting. I wonder if it's possible that a relcache invalidation > > > > would cause these values to get lost for some reason, because that would > > > > be dangerous. > > > > > > > I suppose the rationale is that this shouldn't happen because any > > > > operation that does things this way must hold an exclusive lock on the > > > > relation. But that doesn't guarantee that the relcache entry is > > > > completely stable, > > > > > > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of > > > how strong a lock you hold. > > > > > > regards, tom lane > > > > Ugh. Yes, relcache invalidation happens anytime and it resets the > > added values. pg_stat_info deceived me that it can store > > transient values. But I came up with another thought. > > > > The reason I proposed it was I thought that hash_search for every > > buffer is not good. Instead, like pg_stat_info, we can link the > > buffer => buffer modification > > > pending-sync hash entry to Relation. This greately reduces the > > frequency of hash-searching. > > > > I'll post new patch in this way soon. > > Here it is. It contained tariling space and missing test script. This is the correct patch. > - Relation has new members no_pending_sync and pending_sync that > works as instant cache of an entry in pendingSync hash. > > - Commit-time synchronizing is restored as Michael's patch. > > - If relfilenode is replaced, pending_sync for the old node is > removed. Anyway this is ignored on abort and meaningless on > commit. > > - TAP test is renamed to 012 since some new files have been added. > > Accessing pending sync hash occured on every calling of > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of > accessing relations has pending sync. Almost of them are > eliminated as the result. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 0c3e2b0..23a6d56 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -34,6 +34,28 @@ * the POSTGRES heap access method used for all POSTGRES * relations. * + * WAL CONSIDERATIONS + * All heap operations are normally WAL-logged. but there are a few + * exceptions. Temporary and unlogged relations never need to be + * WAL-logged, but we can also skip WAL-logging for a table that was + * created in the same transaction, if we don't need WAL for PITR or + * WAL archival purposes (i.e. if wal_level=minimal), and we fsync() + * the file to disk at COMMIT instead. + * + * The same-relation optimization is not employed automatically on all + * updates to a table that was created in the same transacton, because + * for a small number of changes, it's cheaper to just create the WAL + * records than fsyncing() the whole relation at COMMIT. It is only + * worthwhile for (presumably) large operations like COPY, CLUSTER, + * or VACUUM FULL. Use heap_register_sync() to initiate such an + * operation; it will cause any subsequent updates to the table to skip + * WAL-logging, if possible, and cause the heap to be synced to disk at + * COMMIT. + * + * To make that work, all modifications to heap must use + * HeapNeedsWAL() to check if WAL-logging is needed in this transaction + * for the given block. + * *- */ #include "postgres.h" @@ -56,6 +78,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" @@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2392,6
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170411.095606.245908357.horiguchi.kyot...@lab.ntt.co.jp> > Hello, thank you for looking this. > > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane wrote in > <27309.1491611...@sss.pgh.pa.us> > > Alvaro Herrera writes: > > > Interesting. I wonder if it's possible that a relcache invalidation > > > would cause these values to get lost for some reason, because that would > > > be dangerous. > > > > > I suppose the rationale is that this shouldn't happen because any > > > operation that does things this way must hold an exclusive lock on the > > > relation. But that doesn't guarantee that the relcache entry is > > > completely stable, > > > > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of > > how strong a lock you hold. > > > > regards, tom lane > > Ugh. Yes, relcache invalidation happens anytime and it resets the > added values. pg_stat_info deceived me that it can store > transient values. But I came up with another thought. > > The reason I proposed it was I thought that hash_search for every > buffer is not good. Instead, like pg_stat_info, we can link the buffer => buffer modification > pending-sync hash entry to Relation. This greately reduces the > frequency of hash-searching. > > I'll post new patch in this way soon. Here it is. - Relation has new members no_pending_sync and pending_sync that works as instant cache of an entry in pendingSync hash. - Commit-time synchronizing is restored as Michael's patch. - If relfilenode is replaced, pending_sync for the old node is removed. Anyway this is ignored on abort and meaningless on commit. - TAP test is renamed to 012 since some new files have been added. Accessing pending sync hash occured on every calling of HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of accessing relations has pending sync. Almost of them are eliminated as the result. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 0c3e2b0..aa1b97d 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -34,6 +34,28 @@ * the POSTGRES heap access method used for all POSTGRES * relations. * + * WAL CONSIDERATIONS + * All heap operations are normally WAL-logged. but there are a few + * exceptions. Temporary and unlogged relations never need to be + * WAL-logged, but we can also skip WAL-logging for a table that was + * created in the same transaction, if we don't need WAL for PITR or + * WAL archival purposes (i.e. if wal_level=minimal), and we fsync() + * the file to disk at COMMIT instead. + * + * The same-relation optimization is not employed automatically on all + * updates to a table that was created in the same transacton, because + * for a small number of changes, it's cheaper to just create the WAL + * records than fsyncing() the whole relation at COMMIT. It is only + * worthwhile for (presumably) large operations like COPY, CLUSTER, + * or VACUUM FULL. Use heap_register_sync() to initiate such an + * operation; it will cause any subsequent updates to the table to skip + * WAL-logging, if possible, and cause the heap to be synced to disk at + * COMMIT. + * + * To make that work, all modifications to heap must use + * HeapNeedsWAL() to check if WAL-logging is needed in this transaction + * for the given block. + * *- */ #include "postgres.h" @@ -56,6 +78,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" @@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, int ndone; char
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, thank you for looking this. At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lanewrote in <27309.1491611...@sss.pgh.pa.us> > Alvaro Herrera writes: > > Interesting. I wonder if it's possible that a relcache invalidation > > would cause these values to get lost for some reason, because that would > > be dangerous. > > > I suppose the rationale is that this shouldn't happen because any > > operation that does things this way must hold an exclusive lock on the > > relation. But that doesn't guarantee that the relcache entry is > > completely stable, > > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of > how strong a lock you hold. > > regards, tom lane Ugh. Yes, relcache invalidation happens anytime and it resets the added values. pg_stat_info deceived me that it can store transient values. But I came up with another thought. The reason I proposed it was I thought that hash_search for every buffer is not good. Instead, like pg_stat_info, we can link the pending-sync hash entry to Relation. This greately reduces the frequency of hash-searching. I'll post new patch in this way soon. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Alvaro Herrerawrites: > Interesting. I wonder if it's possible that a relcache invalidation > would cause these values to get lost for some reason, because that would > be dangerous. > I suppose the rationale is that this shouldn't happen because any > operation that does things this way must hold an exclusive lock on the > relation. But that doesn't guarantee that the relcache entry is > completely stable, It ABSOLUTELY is not safe. Relcache flushes can happen regardless of how strong a lock you hold. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Alvaro Herrera wrote: > I suppose the rationale is that this shouldn't happen because any > operation that does things this way must hold an exclusive lock on the > relation. But that doesn't guarantee that the relcache entry is > completely stable, does it? If we can get proof of that, then this > technique should be safe, I think. It occurs to me that in order to test this we could run the recovery tests (including Michael's new 006 file, which you didn't include in your patch) under -D CLOBBER_CACHE_ALWAYS. I think that'd be sufficient proof that it is solid. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
I have claimed this patch as committer FWIW. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Kyotaro HORIGUCHI wrote: > The attached patch is quiiiccck-and-dirty-hack of Michael's patch > just as a PoC of my proposal quoted above. This also passes the > 006 test. The major changes are the following. > > - Moved sync_above and truncted_to into RelationData. Interesting. I wonder if it's possible that a relcache invalidation would cause these values to get lost for some reason, because that would be dangerous. I suppose the rationale is that this shouldn't happen because any operation that does things this way must hold an exclusive lock on the relation. But that doesn't guarantee that the relcache entry is completely stable, does it? If we can get proof of that, then this technique should be safe, I think. In your version of the patch, which I spent some time skimming, I am missing comments on various functions. I added some as I went along, including one XXX indicating it must be filled. RecordPendingSync() should really live in relcache.c (and probably get a different name). > X I feel that I have dropped one of the features of the origitnal > patch during the hack, but I don't recall it clearly now:( Hah :-) > X I haven't consider relfilenode replacement, which didn't matter > for the original patch. (but there's few places to consider). Hmm ... Please provide. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 0c3e2b0..aa1b97d 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -34,6 +34,28 @@ * the POSTGRES heap access method used for all POSTGRES * relations. * + * WAL CONSIDERATIONS + * All heap operations are normally WAL-logged. but there are a few + * exceptions. Temporary and unlogged relations never need to be + * WAL-logged, but we can also skip WAL-logging for a table that was + * created in the same transaction, if we don't need WAL for PITR or + * WAL archival purposes (i.e. if wal_level=minimal), and we fsync() + * the file to disk at COMMIT instead. + * + * The same-relation optimization is not employed automatically on all + * updates to a table that was created in the same transacton, because + * for a small number of changes, it's cheaper to just create the WAL + * records than fsyncing() the whole relation at COMMIT. It is only + * worthwhile for (presumably) large operations like COPY, CLUSTER, + * or VACUUM FULL. Use heap_register_sync() to initiate such an + * operation; it will cause any subsequent updates to the table to skip + * WAL-logging, if possible, and cause the heap to be synced to disk at + * COMMIT. + * + * To make that work, all modifications to heap must use + * HeapNeedsWAL() to check if WAL-logging is needed in this transaction + * for the given block. + * *- */ #include "postgres.h" @@ -56,6 +78,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" @@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, int ndone; char *scratch = NULL; Pagepage; - boolneedwal; SizesaveFreeSpace; boolneed_tuple_data = RelationIsLogicallyLogged(relation); boolneed_cids = RelationIsAccessibleInLogicalDecoding(relation); - needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation); saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
Re: [HACKERS] WAL logging problem in 9.4.3?
On 1/30/17 11:33 PM, Michael Paquier wrote: > On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi> wrote: >> The latest proposed patch still having problems. >> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug >> fix patch. >> Please feel free to update the status once you submit the updated patch. > And moved to CF 2017-03... Are there any plans to post a new patch? This thread is now 18 months old and it would be good to get a resolution in this CF. Thanks, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommiwrote: > The latest proposed patch still having problems. > Closed in 2016-11 commitfest with "moved to next CF" status because of a bug > fix patch. > Please feel free to update the status once you submit the updated patch. And moved to CF 2017-03... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Nov 9, 2016 at 5:55 PM, Michael Paquierwrote: > > > On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier > wrote: > > On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas > wrote: > >> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas > wrote: > >>> I dropped the ball on this one back in July, so here's an attempt to > revive > >>> this thread. > >>> > >>> I spent some time fixing the remaining issues with the prototype patch > I > >>> posted earlier, and rebased that on top of current git master. See > attached. > >>> > >>> Some review of that would be nice. If there are no major issues with > it, I'm > >>> going to create backpatchable versions of this for 9.4 and below. > >> > >> Are you going to do commit something here? This thread and patch are > >> now 14 months old, which is a long time to make people wait for a bug > >> fix. The status in the CF is "Ready for Committer" although I am not > >> sure if that's accurate. > > > > "Needs Review" is definitely a better definition of its current state. > > The last time I had a look at this patch I thought that it was in > > pretty good shape (not Horiguchi-san's version, but the one in > > https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R= > axxkxcyeb-q77u-erw7_ukajctwnt...@mail.gmail.com). > > With some of the recent changes, surely it needs a second look, things > > related to heap handling tend to rot quickly. > > > > I'll look into it once again by the end of this week if Heikki does > > not show up, the rest will be on him I am afraid... > > I have been able to hit a crash with recovery test 008: > (lldb) bt > * thread #1: tid = 0x, 0x7fff96d48f06 > libsystem_kernel.dylib`__pthread_kill > + 10, stop reason = signal SIGSTOP > * frame #0: 0x7fff96d48f06 libsystem_kernel.dylib`__pthread_kill + > 10 > frame #1: 0x7fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90 > frame #2: 0x7fff8e5cc6df libsystem_c.dylib`abort + 129 > frame #3: 0x000106ef10f0 > postgres`ExceptionalCondition(conditionName="!(( > !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) > || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >= > -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0, (buffer) > != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) : > (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion", > fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54 > frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) > + 204 at bufmgr.c:2593 > frame #5: 0x00010694e6ad postgres`HeapNeedsWAL(rel=0x7f9454804118, > buf=0) + 61 at heapam.c:9234 > frame #6: 0x00010696d8bd > postgres`visibilitymap_set(rel=0x7f9454804118, > heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866, > flags='\x01') + 989 at visibilitymap.c:310 > frame #7: 0x00010695d020 > postgres`heap_xlog_visible(record=0x7f94520035d0) > + 896 at heapam.c:8148 > frame #8: 0x00010695c582 > postgres`heap2_redo(record=0x7f94520035d0) > + 242 at heapam.c:9107 > frame #9: 0x0001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950 > frame #10: 0x000106c9d783 postgres`StartupProcessMain + 339 at > startup.c:216 > frame #11: 0x0001069ee6ec postgres`AuxiliaryProcessMain(argc=2, > argv=0x7fff59316d80) + 1676 at bootstrap.c:420 > frame #12: 0x000106c98002 > postgres`StartChildProcess(type=StartupProcess) > + 322 at postmaster.c:5221 > frame #13: 0x000106c96031 postgres`PostmasterMain(argc=3, > argv=0x7f9451c04210) + 6033 at postmaster.c:1301 > frame #14: 0x000106bc30cf postgres`main(argc=3, > argv=0x7f9451c04210) + 751 at main.c:228 > (lldb) up 1 > frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + > 204 at bufmgr.c:2593 >2590{ >2591BufferDesc *bufHdr; >2592 > -> 2593Assert(BufferIsPinned(buffer)); >2594 >2595if (BufferIsLocal(buffer)) >2596bufHdr = GetLocalBufferDescriptor(-buffer - 1); > The latest proposed patch still having problems. Closed in 2016-11 commitfest with "moved to next CF" status because of a bug fix patch. Please feel free to update the status once you submit the updated patch. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquierwrote: > On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas wrote: >> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas wrote: >>> I dropped the ball on this one back in July, so here's an attempt to revive >>> this thread. >>> >>> I spent some time fixing the remaining issues with the prototype patch I >>> posted earlier, and rebased that on top of current git master. See attached. >>> >>> Some review of that would be nice. If there are no major issues with it, I'm >>> going to create backpatchable versions of this for 9.4 and below. >> >> Are you going to do commit something here? This thread and patch are >> now 14 months old, which is a long time to make people wait for a bug >> fix. The status in the CF is "Ready for Committer" although I am not >> sure if that's accurate. > > "Needs Review" is definitely a better definition of its current state. > The last time I had a look at this patch I thought that it was in > pretty good shape (not Horiguchi-san's version, but the one in > https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=axxkxcyeb-q77u-erw7_ukajctwnt...@mail.gmail.com ). > With some of the recent changes, surely it needs a second look, things > related to heap handling tend to rot quickly. > > I'll look into it once again by the end of this week if Heikki does > not show up, the rest will be on him I am afraid... I have been able to hit a crash with recovery test 008: (lldb) bt * thread #1: tid = 0x, 0x7fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP * frame #0: 0x7fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x7fff8e5cc6df libsystem_c.dylib`abort + 129 frame #3: 0x000106ef10f0 postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0, (buffer) != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) : (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion", fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54 frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593 frame #5: 0x00010694e6ad postgres`HeapNeedsWAL(rel=0x7f9454804118, buf=0) + 61 at heapam.c:9234 frame #6: 0x00010696d8bd postgres`visibilitymap_set(rel=0x7f9454804118, heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at visibilitymap.c:310 frame #7: 0x00010695d020 postgres`heap_xlog_visible(record=0x7f94520035d0) + 896 at heapam.c:8148 frame #8: 0x00010695c582 postgres`heap2_redo(record=0x7f94520035d0) + 242 at heapam.c:9107 frame #9: 0x0001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950 frame #10: 0x000106c9d783 postgres`StartupProcessMain + 339 at startup.c:216 frame #11: 0x0001069ee6ec postgres`AuxiliaryProcessMain(argc=2, argv=0x7fff59316d80) + 1676 at bootstrap.c:420 frame #12: 0x000106c98002 postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221 frame #13: 0x000106c96031 postgres`PostmasterMain(argc=3, argv=0x7f9451c04210) + 6033 at postmaster.c:1301 frame #14: 0x000106bc30cf postgres`main(argc=3, argv=0x7f9451c04210) + 751 at main.c:228 (lldb) up 1 frame #4: 0x000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593 2590{ 2591BufferDesc *bufHdr; 2592 -> 2593Assert(BufferIsPinned(buffer)); 2594 2595if (BufferIsLocal(buffer)) 2596bufHdr = GetLocalBufferDescriptor(-buffer - 1); -- Michael
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Nov 9, 2016 at 5:39 AM, Robert Haaswrote: > On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas wrote: >> I dropped the ball on this one back in July, so here's an attempt to revive >> this thread. >> >> I spent some time fixing the remaining issues with the prototype patch I >> posted earlier, and rebased that on top of current git master. See attached. >> >> Some review of that would be nice. If there are no major issues with it, I'm >> going to create backpatchable versions of this for 9.4 and below. > > Are you going to do commit something here? This thread and patch are > now 14 months old, which is a long time to make people wait for a bug > fix. The status in the CF is "Ready for Committer" although I am not > sure if that's accurate. "Needs Review" is definitely a better definition of its current state. The last time I had a look at this patch I thought that it was in pretty good shape (not Horiguchi-san's version, but the one in https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=axxkxcyeb-q77u-erw7_ukajctwnt...@mail.gmail.com). With some of the recent changes, surely it needs a second look, things related to heap handling tend to rot quickly. I'll look into it once again by the end of this week if Heikki does not show up, the rest will be on him I am afraid... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangaswrote: > I dropped the ball on this one back in July, so here's an attempt to revive > this thread. > > I spent some time fixing the remaining issues with the prototype patch I > posted earlier, and rebased that on top of current git master. See attached. > > Some review of that would be nice. If there are no major issues with it, I'm > going to create backpatchable versions of this for 9.4 and below. Heikki: Are you going to do commit something here? This thread and patch are now 14 months old, which is a long time to make people wait for a bug fix. The status in the CF is "Ready for Committer" although I am not sure if that's accurate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hi, At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquierwrote in > On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI > wrote: > > Hello, > > > > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier > > wrote in > > > >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI > >> wrote: > >> > Hello, I return to this before my things:) > >> > > >> > Though I haven't played with the patch yet.. > >> > >> Be sure to run the test cases in the patch or base your tests on them then! > > > > All items of 006_truncate_opt fail on ed0b228 and they are fixed > > with the patch. > > > >> > Though I don't know how it actually impacts the perfomance, it > >> > seems to me that we can live with truncated_to and sync_above in > >> > RelationData and BufferNeedsWAL(rel, buf) instead of > >> > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation > >> > seems to exist at once in the hash. > >> > >> TBH, I still think that the design of this patch as proposed is pretty > >> cool and easy to follow. > > > > It is clean from certain viewpoint but additional hash, > > especially hash-searching on every HeapNeedsWAL seems to me to be > > unacceptable. Do you see it accetable? > > > > > > The attached patch is quiiiccck-and-dirty-hack of Michael's patch > > just as a PoC of my proposal quoted above. This also passes the > > 006 test. The major changes are the following. > > > > - Moved sync_above and truncted_to into RelationData. > > > > - Cleaning up is done in AtEOXact_cleanup instead of explicit > > calling to smgrDoPendingSyncs(). > > > > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires > > hash_search. It just refers to the additional members in the > > given Relation. > > > > X I feel that I have dropped one of the features of the origitnal > > patch during the hack, but I don't recall it clearly now:( > > > > X I haven't consider relfilenode replacement, which didn't matter > > for the original patch. (but there's few places to consider). > > > > What do you think about this? > > I have moved this patch to next CF. (I still need to look at your patch.) Thanks for considering that. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHIwrote: > Hello, > > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier > wrote in > >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI >> wrote: >> > Hello, I return to this before my things:) >> > >> > Though I haven't played with the patch yet.. >> >> Be sure to run the test cases in the patch or base your tests on them then! > > All items of 006_truncate_opt fail on ed0b228 and they are fixed > with the patch. > >> > Though I don't know how it actually impacts the perfomance, it >> > seems to me that we can live with truncated_to and sync_above in >> > RelationData and BufferNeedsWAL(rel, buf) instead of >> > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation >> > seems to exist at once in the hash. >> >> TBH, I still think that the design of this patch as proposed is pretty >> cool and easy to follow. > > It is clean from certain viewpoint but additional hash, > especially hash-searching on every HeapNeedsWAL seems to me to be > unacceptable. Do you see it accetable? > > > The attached patch is quiiiccck-and-dirty-hack of Michael's patch > just as a PoC of my proposal quoted above. This also passes the > 006 test. The major changes are the following. > > - Moved sync_above and truncted_to into RelationData. > > - Cleaning up is done in AtEOXact_cleanup instead of explicit > calling to smgrDoPendingSyncs(). > > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires > hash_search. It just refers to the additional members in the > given Relation. > > X I feel that I have dropped one of the features of the origitnal > patch during the hack, but I don't recall it clearly now:( > > X I haven't consider relfilenode replacement, which didn't matter > for the original patch. (but there's few places to consider). > > What do you think about this? I have moved this patch to next CF. (I still need to look at your patch.) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquierwrote in > On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI > wrote: > > Hello, I return to this before my things:) > > > > Though I haven't played with the patch yet.. > > Be sure to run the test cases in the patch or base your tests on them then! All items of 006_truncate_opt fail on ed0b228 and they are fixed with the patch. > > Though I don't know how it actually impacts the perfomance, it > > seems to me that we can live with truncated_to and sync_above in > > RelationData and BufferNeedsWAL(rel, buf) instead of > > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation > > seems to exist at once in the hash. > > TBH, I still think that the design of this patch as proposed is pretty > cool and easy to follow. It is clean from certain viewpoint but additional hash, especially hash-searching on every HeapNeedsWAL seems to me to be unacceptable. Do you see it accetable? The attached patch is quiiiccck-and-dirty-hack of Michael's patch just as a PoC of my proposal quoted above. This also passes the 006 test. The major changes are the following. - Moved sync_above and truncted_to into RelationData. - Cleaning up is done in AtEOXact_cleanup instead of explicit calling to smgrDoPendingSyncs(). * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires hash_search. It just refers to the additional members in the given Relation. X I feel that I have dropped one of the features of the origitnal patch during the hack, but I don't recall it clearly now:( X I haven't consider relfilenode replacement, which didn't matter for the original patch. (but there's few places to consider). What do you think about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 38bba16..02e33cc 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -34,6 +34,28 @@ * the POSTGRES heap access method used for all POSTGRES * relations. * + * WAL CONSIDERATIONS + * All heap operations are normally WAL-logged. but there are a few + * exceptions. Temporary and unlogged relations never need to be + * WAL-logged, but we can also skip WAL-logging for a table that was + * created in the same transaction, if we don't need WAL for PITR or + * WAL archival purposes (i.e. if wal_level=minimal), and we fsync() + * the file to disk at COMMIT instead. + * + * The same-relation optimization is not employed automatically on all + * updates to a table that was created in the same transacton, because + * for a small number of changes, it's cheaper to just create the WAL + * records than fsyncing() the whole relation at COMMIT. It is only + * worthwhile for (presumably) large operations like COPY, CLUSTER, + * or VACUUM FULL. Use heap_register_sync() to initiate such an + * operation; it will cause any subsequent updates to the table to skip + * WAL-logging, if possible, and cause the heap to be synced to disk at + * COMMIT. + * + * To make that work, all modifications to heap must use + * HeapNeedsWAL() to check if WAL-logging is needed in this transaction + * for the given block. + * *- */ #include "postgres.h" @@ -55,6 +77,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" @@ -2331,12 +2354,6 @@ FreeBulkInsertState(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2440,7 +2457,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (BufferNeedsWAL(relation, buffer)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2639,12 +2656,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, int ndone; char *scratch = NULL; Page page; - bool needwal; Size saveFreeSpace; bool need_tuple_data = RelationIsLogicallyLogged(relation); bool need_cids = RelationIsAccessibleInLogicalDecoding(relation); - needwal
Re: [HACKERS] WAL logging problem in 9.4.3?
On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHIwrote: > Hello, I return to this before my things:) > > Though I haven't played with the patch yet.. Be sure to run the test cases in the patch or base your tests on them then! > Though I don't know how it actually impacts the perfomance, it > seems to me that we can live with truncated_to and sync_above in > RelationData and BufferNeedsWAL(rel, buf) instead of > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation > seems to exist at once in the hash. TBH, I still think that the design of this patch as proposed is pretty cool and easy to follow. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, I return to this before my things:) Though I haven't played with the patch yet.. At Fri, 29 Jul 2016 16:54:42 +0900, Michael Paquierwrote in > > Well, not that soon at the end, but I am back on that... I have not > > completely reviewed all the code yet, and the case of index relation > > referring to a relation optimized with truncate is still broken, but > > for now here is a rebased patch if people are interested. I am going > > to get as well a TAP tests out of my pocket to ease testing. > > The patch I sent yesterday was based on an incorrect version. Attached > is a slightly-modified version of the last one I found here > (https://www.postgresql.org/message-id/56b342f5.1050...@iki.fi), which > is rebased on HEAD at ed0b228. I have also converted the test case > script of upthread into a TAP test in src/test/recovery that covers 3 > cases and I included that in the patch: > 1) CREATE + INSERT + COPY => crash > 2) CREATE + trigger + COPY => crash > 3) CREATE + TRUNCATE + COPY => incorrect number of rows. > The first two tests make the system crash, the third one reports an > incorrect number of rows. At the first glance, managing sync_above and truncate_to is workable for these cases, but seems too complicated for the problem to be resolved. This provides smgr with a capability to manage pending page synchs. But the postpone-page-syncs-or-not issue rather seems to be a matter of the users of that, who are responsible for WAL issueing. Anyway heap_resgister_sync doesn't use any secret of smgr. So I think this approach binds smgr with Relation too tightly. By this patch, many RelationNeedsWALs, which just accesses local struct, are replaced with HeapNeedsWAL, which eventually accesses a hash added by this patch. Especially in log_heap_update, it is called for every update of single tuple (on a relation that needs WAL). Though I don't know how it actually impacts the perfomance, it seems to me that we can live with truncated_to and sync_above in RelationData and BufferNeedsWAL(rel, buf) instead of HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation seems to exist at once in the hash. What do you think? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Jul 28, 2016 at 4:59 PM, Michael Paquierwrote: > On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier > wrote: >> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier >> wrote: >>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele wrote: I would prefer not to bump it to the next CF unless we decide this will not get fixed for 9.6. >>> >>> It may make sense to add that to the list of open items for 9.6 >>> instead. That's not a feature. >> >> So I have moved this patch to the next CF for now, and will work on >> fixing it rather soonishly as an effort to stabilize 9.6 as well as >> back-branches. > > Well, not that soon at the end, but I am back on that... I have not > completely reviewed all the code yet, and the case of index relation > referring to a relation optimized with truncate is still broken, but > for now here is a rebased patch if people are interested. I am going > to get as well a TAP tests out of my pocket to ease testing. The patch I sent yesterday was based on an incorrect version. Attached is a slightly-modified version of the last one I found here (https://www.postgresql.org/message-id/56b342f5.1050...@iki.fi), which is rebased on HEAD at ed0b228. I have also converted the test case script of upthread into a TAP test in src/test/recovery that covers 3 cases and I included that in the patch: 1) CREATE + INSERT + COPY => crash 2) CREATE + trigger + COPY => crash 3) CREATE + TRUNCATE + COPY => incorrect number of rows. The first two tests make the system crash, the third one reports an incorrect number of rows. This is registered in next CF by the way: https://commitfest.postgresql.org/10/528/ Thoughts? -- Michael fix-wal-level-minimal-michael-2.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquierwrote: > On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier > wrote: >> On Wed, Mar 23, 2016 at 11:11 AM, David Steele wrote: >>> I would prefer not to bump it to the next CF unless we decide this will >>> not get fixed for 9.6. >> >> It may make sense to add that to the list of open items for 9.6 >> instead. That's not a feature. > > So I have moved this patch to the next CF for now, and will work on > fixing it rather soonishly as an effort to stabilize 9.6 as well as > back-branches. Well, not that soon at the end, but I am back on that... I have not completely reviewed all the code yet, and the case of index relation referring to a relation optimized with truncate is still broken, but for now here is a rebased patch if people are interested. I am going to get as well a TAP tests out of my pocket to ease testing. -- Michael diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 38bba16..bbc09cd 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -55,6 +55,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" @@ -2331,12 +2332,6 @@ FreeBulkInsertState(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2440,7 +2435,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (HeapNeedsWAL(relation, buffer)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2639,12 +2634,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, int ndone; char *scratch = NULL; Page page; - bool needwal; Size saveFreeSpace; bool need_tuple_data = RelationIsLogicallyLogged(relation); bool need_cids = RelationIsAccessibleInLogicalDecoding(relation); - needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation); saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR); @@ -2659,7 +2652,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, * palloc() within a critical section is not safe, so we allocate this * beforehand. */ - if (needwal) + if (RelationNeedsWAL(relation)) scratch = palloc(BLCKSZ); /* @@ -2727,7 +2720,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, * We don't use heap_multi_insert for catalog tuples yet, but * better be prepared... */ - if (needwal && need_cids) + if (HeapNeedsWAL(relation, buffer) && need_cids) log_heap_new_cid(relation, heaptup); } @@ -2747,7 +2740,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, MarkBufferDirty(buffer); /* XLOG stuff */ - if (needwal) + if (HeapNeedsWAL(relation, buffer)) { XLogRecPtr recptr; xl_heap_multi_insert *xlrec; @@ -3261,7 +3254,7 @@ l1: * NB: heap_abort_speculative() uses the same xlog record and replay * routines. */ - if (RelationNeedsWAL(relation)) + if (HeapNeedsWAL(relation, buffer)) { xl_heap_delete xlrec; XLogRecPtr recptr; @@ -3982,7 +3975,7 @@ l2: MarkBufferDirty(buffer); - if (RelationNeedsWAL(relation)) + if (HeapNeedsWAL(relation, buffer)) { xl_heap_lock xlrec; XLogRecPtr recptr; @@ -4194,7 +4187,7 @@ l2: MarkBufferDirty(buffer); /* XLOG stuff */ - if (RelationNeedsWAL(relation)) + if (HeapNeedsWAL(relation, buffer)) { XLogRecPtr recptr; @@ -5148,7 +5141,7 @@ failed: * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG * entries for everything anyway.) */ - if (RelationNeedsWAL(relation)) + if (HeapNeedsWAL(relation, *buffer)) { xl_heap_lock xlrec; XLogRecPtr recptr; @@ -5825,7 +5818,7 @@ l4: MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (HeapNeedsWAL(rel, buf)) { xl_heap_lock_updated xlrec; XLogRecPtr recptr; @@ -5980,7 +5973,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple) htup->t_ctid = tuple->t_self; /* XLOG stuff */ - if (RelationNeedsWAL(relation)) + if (HeapNeedsWAL(relation, buffer)) { xl_heap_confirm xlrec; XLogRecPtr recptr; @@
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquierwrote: > On Wed, Mar 23, 2016 at 11:11 AM, David Steele wrote: >> I would prefer not to bump it to the next CF unless we decide this will >> not get fixed for 9.6. > > It may make sense to add that to the list of open items for 9.6 > instead. That's not a feature. So I have moved this patch to the next CF for now, and will work on fixing it rather soonishly as an effort to stabilize 9.6 as well as back-branches. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Mar 23, 2016 at 11:11 AM, David Steelewrote: > I would prefer not to bump it to the next CF unless we decide this will > not get fixed for 9.6. It may make sense to add that to the list of open items for 9.6 instead. That's not a feature. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 3/22/16 8:54 PM, Michael Paquier wrote: > On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier >wrote: >> On Wed, Mar 23, 2016 at 1:38 AM, David Steele wrote: >>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote: >>> Ok, I understand that this is not an issue in a hurry. I'll go to another patch that needs review. >>> >>> Since we're getting towards the end of the CF is it time to pick this up >>> again? >> >> Perhaps not. This is a legit bug with an unfinished patch (see index >> relation truncation) that is going to need a careful review. I don't >> think that this should be impacted by the 4/8 feature freeze, so we >> could still work on that after the embargo and we've had this bug for >> months actually. FWIW, I am still planning to work on it once the CF >> is done, in order to keep my manpower focused on actual patch reviews >> as much as possible... > > In short, we may want to bump that to next CF... I have already marked > this ticket as something to work on soonish on my side, so it does not > change much seen from here if it's part of the next CF. What we should > just be sure is not to lose track of its existence. I would prefer not to bump it to the next CF unless we decide this will not get fixed for 9.6. -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquierwrote: > On Wed, Mar 23, 2016 at 1:38 AM, David Steele wrote: >> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote: >> >>> Ok, I understand that this is not an issue in a hurry. I'll go to >>> another patch that needs review. >> >> Since we're getting towards the end of the CF is it time to pick this up >> again? > > Perhaps not. This is a legit bug with an unfinished patch (see index > relation truncation) that is going to need a careful review. I don't > think that this should be impacted by the 4/8 feature freeze, so we > could still work on that after the embargo and we've had this bug for > months actually. FWIW, I am still planning to work on it once the CF > is done, in order to keep my manpower focused on actual patch reviews > as much as possible... In short, we may want to bump that to next CF... I have already marked this ticket as something to work on soonish on my side, so it does not change much seen from here if it's part of the next CF. What we should just be sure is not to lose track of its existence. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Mar 23, 2016 at 1:38 AM, David Steelewrote: > On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote: > >> Ok, I understand that this is not an issue in a hurry. I'll go to >> another patch that needs review. > > Since we're getting towards the end of the CF is it time to pick this up > again? Perhaps not. This is a legit bug with an unfinished patch (see index relation truncation) that is going to need a careful review. I don't think that this should be impacted by the 4/8 feature freeze, so we could still work on that after the embargo and we've had this bug for months actually. FWIW, I am still planning to work on it once the CF is done, in order to keep my manpower focused on actual patch reviews as much as possible... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote: > Ok, I understand that this is not an issue in a hurry. I'll go to > another patch that needs review. Since we're getting towards the end of the CF is it time to pick this up again? Thanks, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Thank you for the comment. I understand that this is not an issue in a hurry so don't bother to reply. At Tue, 15 Mar 2016 18:21:34 +0100, Michael Paquierwrote in
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHIwrote: > At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier > wrote in > >> > Worth noting that this patch does not address the problem with index >> > relations when a TRUNCATE is used in the same transaction as its > > Focusing this issue, what we should do is somehow building empty > index just after a index truncation. The attached patch does the > following things to fix this. > > - make index_build use ambuildempty when the relation on which > the index will be built is apparently empty. That is, when the > relation has no block. > - add one parameter "persistent" to ambuildempty(). It behaves as > before if the parameter is false. If not, it creates an empty > index on MAIN_FORK and emits logs even if wal_level is minimal. Hm. It seems to me that this patch is just a bandaid for the real problem which is that we should not TRUNCATE the underlying index relations when the TRUNCATE optimization is running. In short I would let the empty routines in AM code paths alone, and just continue using them for the generation of INIT_FORKNUM with unlogged relations. Your patch is not something backpatchable anyway I think. > The new parameter 'persistent' would be better be forknum because > it actually represents the persistency of the index to be > created. But I'm out of time now.. I actually have some users running with wal_level to minimal, even if I don't think they use this optimization, we had better fix even index relations at the same time as table relations.. I'll try to get some time once the patch review storm goes down a little, except if someone beats me to it first. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, I considered on the original issue. At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquierwrote in > > Worth noting that this patch does not address the problem with index > > relations when a TRUNCATE is used in the same transaction as its Focusing this issue, what we should do is somehow building empty index just after a index truncation. The attached patch does the following things to fix this. - make index_build use ambuildempty when the relation on which the index will be built is apparently empty. That is, when the relation has no block. - add one parameter "persistent" to ambuildempty(). It behaves as before if the parameter is false. If not, it creates an empty index on MAIN_FORK and emits logs even if wal_level is minimal. Creation of an index for an empty table can be safely done by ambuildempty, since it creates the image for init fork, which can be simply copied as main fork on initialization. And the heap is always empty when RelationTruncateIndexes calls index_build. For nonempty tables, ambuild properly initializes the new index. The new parameter 'persistent' would be better be forknum because it actually represents the persistency of the index to be created. But I'm out of time now.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index c740952..7f0d3f9 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -675,13 +675,14 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo) } void -brinbuildempty(Relation index) +brinbuildempty(Relation index, bool persistent) { Buffer metabuf; + ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM); /* An empty BRIN index has a metapage only. */ metabuf = - ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL); LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); /* Initialize and xlog metabuffer. */ diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index cd21e0e..c041360 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -430,20 +430,23 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo) } /* - * ginbuildempty() -- build an empty gin index in the initialization fork + * ginbuildempty() -- build an empty gin index + * the new index is built in the intialization fork or main fork according + * to the parameter persistent. */ void -ginbuildempty(Relation index) +ginbuildempty(Relation index, bool persistent) { Buffer RootBuffer, MetaBuffer; + ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM); /* An empty GIN index has two pages. */ MetaBuffer = - ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL); LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE); RootBuffer = - ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL); LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE); /* Initialize and xlog metabuffer and root buffer. */ diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index 996363c..3d73083 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -110,15 +110,18 @@ createTempGistContext(void) } /* - * gistbuildempty() -- build an empty gist index in the initialization fork + * gistbuildempty() -- build an empty gist index. + * the new index is built in the intialization fork or main fork according + * to the parameter persistent. */ void -gistbuildempty(Relation index) +gistbuildempty(Relation index, bool persistent) { Buffer buffer; + ForkNumber forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM); /* Initialize the root page */ - buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + buffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL); LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* Initialize and xlog buffer */ diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 3d48c4f..3b9cd66 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -156,12 +156,14 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo) } /* - * hashbuildempty() -- build an empty hash index in the initialization fork + * hashbuildempty() -- build an empty hash index + * the new index is built in the intialization fork or main fork according + * to the parameter persistent. */ void -hashbuildempty(Relation index) +hashbuildempty(Relation index, bool persistent) { - _hash_metapinit(index, 0, INIT_FORKNUM); + _hash_metapinit(index, 0, persistent ? MAIN_FORKNUM :
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Feb 19, 2016 at 4:33 PM, Michael Paquierwrote: > On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier > wrote: >> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas wrote: >>> I dropped the ball on this one back in July, so here's an attempt to revive >>> this thread. >>> >>> I spent some time fixing the remaining issues with the prototype patch I >>> posted earlier, and rebased that on top of current git master. See attached. >>> >>> Some review of that would be nice. If there are no major issues with it, I'm >>> going to create backpatchable versions of this for 9.4 and below. >> >> I am going to look into that very soon. For now and to not forget >> about this bug, I have added an entry in the CF app: >> https://commitfest.postgresql.org/9/528/ > > Worth noting that this patch does not address the problem with index > relations when a TRUNCATE is used in the same transaction as its > CREATE TABLE, take that for example when wal_level = minimal: > 1) Run transaction > =# begin; > BEGIN > =# create table ab (a int primary key); > CREATE TABLE > =# truncate ab; > TRUNCATE TABLE > =# commit; > COMMIT > 2) Restart server with immediate mode. > 3) Failure: > =# table ab; > ERROR: XX001: could not read block 0 in file "base/16384/16388": read > only 0 of 8192 bytes > LOCATION: mdread, md.c:728 > > The case where a COPY is issued after TRUNCATE is fixed though, so > that's still an improvement. > > Here are other comments. > > + /* Flush updates to relations that we didn't WAL-logged */ > + smgrDoPendingSyncs(true); > "Flush updates to relations there were not WAL-logged"? > > +void > +FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal) > +{ > + FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal); > +} > islocal is always set as false, I'd rather remove it this argument > from FlushRelationBuffersWithoutRelCache. > > for (i = 0; i < nrels; i++) > + { > smgrclose(srels[i]); > + } > Looks like noise. > > + if (!found) > + { > + pending->truncated_to = InvalidBlockNumber; > + pending->sync_above = nblocks; > + > + elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at > block %u", > +rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks); > + > + } > + else if (pending->sync_above == InvalidBlockNumber) > + { > + elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u", > +rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks); > + pending->sync_above = nblocks; > + } > + else > Here couldn't it be possible that when (sync_above != > InvalidBlockNumber), nblocks can be higher than sync_above? In which > case we had better increase sync_above to nblocks, no? > > + if (!pendingSyncs) > + createPendingSyncsHash(); > + pending = (PendingRelSync *) hash_search(pendingSyncs, > +(void *) >rd_node, > +HASH_ENTER, ); > This is lacking comments. > > - if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT)) > + BufferGetTag(buffer, , , ); > + if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) && > + !smgrIsSyncPending(rnode, blknum)) > Here as well explaining in more details why the buffer does not need > to go through XLogSaveBufferForHint would be nice. An additional one: - XLogRegisterBuffer(0, newbuf, bufflags); - if (oldbuf != newbuf) - XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD); In log_heap_update, the new buffer is now conditionally logged depending on if the heap needs WAL or not. Now during replay the following thing is done: - oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1, - ); + if (oldblk == newblk) + oldaction = XLogReadBufferForRedo(record, 0, ); + else if (XLogRecHasBlockRef(record, 1)) + oldaction = XLogReadBufferForRedo(record, 1, ); + else + oldaction = BLK_DONE; Shouldn't we check for XLogRecHasBlockRef(record, 0) when the tuple is updated on the same page? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquierwrote: > On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas wrote: >> I dropped the ball on this one back in July, so here's an attempt to revive >> this thread. >> >> I spent some time fixing the remaining issues with the prototype patch I >> posted earlier, and rebased that on top of current git master. See attached. >> >> Some review of that would be nice. If there are no major issues with it, I'm >> going to create backpatchable versions of this for 9.4 and below. > > I am going to look into that very soon. For now and to not forget > about this bug, I have added an entry in the CF app: > https://commitfest.postgresql.org/9/528/ Worth noting that this patch does not address the problem with index relations when a TRUNCATE is used in the same transaction as its CREATE TABLE, take that for example when wal_level = minimal: 1) Run transaction =# begin; BEGIN =# create table ab (a int primary key); CREATE TABLE =# truncate ab; TRUNCATE TABLE =# commit; COMMIT 2) Restart server with immediate mode. 3) Failure: =# table ab; ERROR: XX001: could not read block 0 in file "base/16384/16388": read only 0 of 8192 bytes LOCATION: mdread, md.c:728 The case where a COPY is issued after TRUNCATE is fixed though, so that's still an improvement. Here are other comments. + /* Flush updates to relations that we didn't WAL-logged */ + smgrDoPendingSyncs(true); "Flush updates to relations there were not WAL-logged"? +void +FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal) +{ + FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal); +} islocal is always set as false, I'd rather remove it this argument from FlushRelationBuffersWithoutRelCache. for (i = 0; i < nrels; i++) + { smgrclose(srels[i]); + } Looks like noise. + if (!found) + { + pending->truncated_to = InvalidBlockNumber; + pending->sync_above = nblocks; + + elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at block %u", +rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks); + + } + else if (pending->sync_above == InvalidBlockNumber) + { + elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u", +rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks); + pending->sync_above = nblocks; + } + else Here couldn't it be possible that when (sync_above != InvalidBlockNumber), nblocks can be higher than sync_above? In which case we had better increase sync_above to nblocks, no? + if (!pendingSyncs) + createPendingSyncsHash(); + pending = (PendingRelSync *) hash_search(pendingSyncs, +(void *) >rd_node, +HASH_ENTER, ); This is lacking comments. - if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT)) + BufferGetTag(buffer, , , ); + if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) && + !smgrIsSyncPending(rnode, blknum)) Here as well explaining in more details why the buffer does not need to go through XLogSaveBufferForHint would be nice. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangaswrote: > I dropped the ball on this one back in July, so here's an attempt to revive > this thread. > > I spent some time fixing the remaining issues with the prototype patch I > posted earlier, and rebased that on top of current git master. See attached. > > Some review of that would be nice. If there are no major issues with it, I'm > going to create backpatchable versions of this for 9.4 and below. I am going to look into that very soon. For now and to not forget about this bug, I have added an entry in the CF app: https://commitfest.postgresql.org/9/528/ -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 22/10/15 03:56, Michael Paquier wrote: On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrerawrote: Heikki Linnakangas wrote: Thanks. For comparison, I wrote a patch to implement what I had in mind. When a WAL-skipping COPY begins, we add an entry for that relation in a "pending-fsyncs" hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so. I think this wasn't applied, was it? No, it was not applied. I dropped the ball on this one back in July, so here's an attempt to revive this thread. I spent some time fixing the remaining issues with the prototype patch I posted earlier, and rebased that on top of current git master. See attached. Some review of that would be nice. If there are no major issues with it, I'm going to create backpatchable versions of this for 9.4 and below. - Heikki >From 063e1aa258800873783190a9678d551b43c0e39e Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 4 Feb 2016 15:21:09 +0300 Subject: [PATCH 1/1] Fix the optimization to skip WAL-logging on table created in same xact. There were several bugs in the optimization to skip WAL-logging for a table that was created (or truncated) in the same transaction, with wal_level=minimal, leading to data loss if a crash happened after the optimization was used: * If the table was created, and then truncated, and then loaded with COPY, we would replay the truncate record at commit, and the table would end up being empty after replay. * If there is a trigger on a table that modifies the same table, and you COPY to the table in the transaction that created it, you might have some WAL-logged operations on a page, performed by the trigger, intermixed with the non-WAL-logged inserts done by the COPY. That can lead to crash at recovery, because we might try to replay a WAL record that e.g. updates a tuple, but insertion of the tuple was not WAL-logged. --- src/backend/access/heap/heapam.c| 254 +++- src/backend/access/heap/pruneheap.c | 2 +- src/backend/access/heap/rewriteheap.c | 4 +- src/backend/access/heap/visibilitymap.c | 2 +- src/backend/access/transam/xact.c | 7 + src/backend/catalog/storage.c | 250 --- src/backend/commands/copy.c | 14 +- src/backend/commands/createas.c | 9 +- src/backend/commands/matview.c | 6 +- src/backend/commands/tablecmds.c| 5 +- src/backend/commands/vacuumlazy.c | 6 +- src/backend/storage/buffer/bufmgr.c | 47 -- src/include/access/heapam.h | 8 +- src/include/access/heapam_xlog.h| 2 + src/include/catalog/storage.h | 3 + src/include/storage/bufmgr.h| 2 + 16 files changed, 487 insertions(+), 134 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index f443742..79298e2 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -34,6 +34,28 @@ * the POSTGRES heap access method used for all POSTGRES * relations. * + * WAL CONSIDERATIONS + * All heap operations are normally WAL-logged. but there are a few + * exceptions. Temporary and unlogged relations never need to be + * WAL-logged, but we can also skip WAL-logging for a table that was + * created in the same transaction, if we don't need WAL for PITR or + * WAL archival purposes (i.e. if wal_level=minimal), and we fsync() + * the file to disk at COMMIT instead. + * + * The same-relation optimization is not employed automatically on all + * updates to a table that was created in the same transacton, because + * for a small number of changes, it's cheaper to just create the WAL + * records than fsyncing() the whole relation at COMMIT. It is only + * worthwhile for (presumably) large operations like COPY, CLUSTER, + * or VACUUM FULL. Use heap_register_sync() to initiate such an + * operation; it will cause any subsequent updates to the table to skip + * WAL-logging, if possible, and cause the heap to be synced to disk at + * COMMIT. + * + * To make that work, all modifications to heap must use + * HeapNeedsWAL() to check if WAL-logging is needed in this transaction + * for the given block. + * *- */ #include "postgres.h" @@ -55,6 +77,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" @@ -2332,12 +2355,6 @@ FreeBulkInsertState(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is
Re: [HACKERS] WAL logging problem in 9.4.3?
Heikki Linnakangas wrote: > Thanks. For comparison, I wrote a patch to implement what I had in mind. > > When a WAL-skipping COPY begins, we add an entry for that relation in a > "pending-fsyncs" hash table. Whenever we perform any action on a heap that > would normally be WAL-logged, we check if the relation is in the hash table, > and skip WAL-logging if so. I think this wasn't applied, was it? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrerawrote: > Heikki Linnakangas wrote: > >> Thanks. For comparison, I wrote a patch to implement what I had in mind. >> >> When a WAL-skipping COPY begins, we add an entry for that relation in a >> "pending-fsyncs" hash table. Whenever we perform any action on a heap that >> would normally be WAL-logged, we check if the relation is in the hash table, >> and skip WAL-logging if so. > > I think this wasn't applied, was it? No, it was not applied. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 07/23/2015 09:38 PM, Robert Haas wrote: On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote: This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have. Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()? Yes. But it's still very cheap, especially in the common case that the pending syncs hash table is empty. Should we be worried about that? It doesn't worry me. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote: This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have. Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()? Should we be worried about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 22 July 2015 at 17:21, Heikki Linnakangas hlinn...@iki.fi wrote: When a WAL-skipping COPY begins, we add an entry for that relation in a pending-fsyncs hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so. That was a simplified explanation. In reality, when WAL-skipping COPY begins, we also memorize the current size of the relation. Any actions on blocks greater than the old size are not WAL-logged, and any actions on smaller-numbered blocks are. This ensures that if you did any INSERTs on the table before the COPY, any new actions on the blocks that were already WAL-logged by the INSERT are also WAL-logged. And likewise if you perform any INSERTs after (or during, by trigger) the COPY, and they modify the new pages, those actions are not WAL-logged. So starting a WAL-skipping COPY splits the relation into two parts: the first part that is WAL-logged as usual, and the later part that is not WAL-logged. (there is one loose end marked with XXX in the patch on this, when one of the pages involved in a cold UPDATE is before the watermark and the other is after) The actual fsync() has been moved to the end of transaction, as we are now skipping WAL-logging of any actions after the COPY as well. And truncations complicate things further. If we emit a truncation WAL record in the transaction, we also make an entry in the hash table to record that. All operations on a relation that has been truncated must be WAL-logged as usual, because replaying the truncate record will destroy all data even if we fsync later. But we still optimize for BEGIN; CREATE; COPY; TRUNCATE; COPY; style patterns, because if we truncate a relation that has already been marked for fsync-at-COMMIT, we don't need to WAL-log the truncation either. This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have. I didn't like it when I first read this, but I do now. As a by product of fixing the bug it actually extends the optimization. You can optimize this approach so we always write WAL unless one of the two subid fields are set, so there is no need to call smgrIsSyncPending() every time. I couldn't see where this depended upon wal_level, but I guess its there somewhere. I'm unhappy about the call during MarkBufferDirtyHint() which is just too costly. The only way to do this cheaply is to specifically mark buffers as being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be removed when we flush the buffers for the relation. And what reason is there to think that this would fix all the problems? I don't think either suggested fix could be claimed to be a great solution, since there is little principle here, only heuristic. Heikki's solution would be the only safe way, but is not backpatchable. I can't get too excited about a half-fix that leaves you with data corruption in some scenarios. On further consideration, it seems obvious that Andres' suggestion would not work for UPDATE or DELETE, so I now agree. It does seem a big thing to backpatch; alternative suggestions? -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] WAL logging problem in 9.4.3?
On 07/22/2015 11:18 AM, Simon Riggs wrote: On 10 July 2015 at 00:06, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@anarazel.de writes: On 2015-07-06 11:49:54 -0400, Tom Lane wrote: Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. I'm not convinced that cab9a0656c36739f needs to survive in that form. To me only allowing one COPY to benefit from the wal_level = minimal optimization has a significantly higher cost than cab9a0656c36739f. What evidence have you got to base that value judgement on? cab9a0656c36739f was based on an actual user complaint, so we have good evidence that there are people out there who care about the cost of truncating a table many times in one transaction. On the other hand, I know of no evidence that anyone's depending on multiple sequential COPYs, nor intermixed COPY and INSERT, to be fast. The original argument for having this COPY optimization at all was to make restoring pg_dump scripts in a single transaction fast; and that use-case doesn't care about anything but a single COPY into a virgin table. We have to backpatch this fix, so it must be both simple and effective. Heikki's suggestions may be best, maybe not, but they don't seem backpatchable. Tom's suggestion looks good. So does Andres' suggestion. I have coded both. Thanks. For comparison, I wrote a patch to implement what I had in mind. When a WAL-skipping COPY begins, we add an entry for that relation in a pending-fsyncs hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so. That was a simplified explanation. In reality, when WAL-skipping COPY begins, we also memorize the current size of the relation. Any actions on blocks greater than the old size are not WAL-logged, and any actions on smaller-numbered blocks are. This ensures that if you did any INSERTs on the table before the COPY, any new actions on the blocks that were already WAL-logged by the INSERT are also WAL-logged. And likewise if you perform any INSERTs after (or during, by trigger) the COPY, and they modify the new pages, those actions are not WAL-logged. So starting a WAL-skipping COPY splits the relation into two parts: the first part that is WAL-logged as usual, and the later part that is not WAL-logged. (there is one loose end marked with XXX in the patch on this, when one of the pages involved in a cold UPDATE is before the watermark and the other is after) The actual fsync() has been moved to the end of transaction, as we are now skipping WAL-logging of any actions after the COPY as well. And truncations complicate things further. If we emit a truncation WAL record in the transaction, we also make an entry in the hash table to record that. All operations on a relation that has been truncated must be WAL-logged as usual, because replaying the truncate record will destroy all data even if we fsync later. But we still optimize for BEGIN; CREATE; COPY; TRUNCATE; COPY; style patterns, because if we truncate a relation that has already been marked for fsync-at-COMMIT, we don't need to WAL-log the truncation either. This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have. And what reason is there to think that this would fix all the problems? I don't think either suggested fix could be claimed to be a great solution, since there is little principle here, only heuristic. Heikki's solution would be the only safe way, but is not backpatchable. I can't get too excited about a half-fix that leaves you with data corruption in some scenarios. I wrote a little test script to test all these different scenarios (attached). Both of your patches fail with the script. - Heikki diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 2e3b9d2..9ef688b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2021,12 +2021,6 @@ FreeBulkInsertState(BulkInsertState bistate) * The new tuple is stamped with current transaction ID and the specified * command ID. * - * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not - * logged in WAL, even for a non-temp relation. Safe usage of this behavior - * requires that we arrange that all new tuples go into new pages not - * containing any tuples from other transactions, and that the relation gets - * fsync'd before commit. (See also heap_sync() comments) - * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. * @@ -2115,7 +2109,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer);
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-21 21:37:41 +0200, Martijn van Oosterhout wrote: On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote: Hi, This thread seemed to trail off without a resolution. Was anything done? Not that I can tell. Heikki and I had some in-person conversation about it at a conference, but we didn't really find anything we both liked... I was the original poster of this thread. We've worked around the issue by placing a CHECKPOINT command at the end of the migration script. For us it's not a performance issue, more a correctness one, tables were empty when they shouldn't have been. If it's just correctness, you could just use wal_level = archive. I'm hoping a fix will appear in the 9.5 release, since we're intending to release with that version. A forced checkpoint every now and them probably won't be a serious problem though. We're imo going to have to fix this in the back branches. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 10 July 2015 at 00:06, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@anarazel.de writes: On 2015-07-06 11:49:54 -0400, Tom Lane wrote: Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. I'm not convinced that cab9a0656c36739f needs to survive in that form. To me only allowing one COPY to benefit from the wal_level = minimal optimization has a significantly higher cost than cab9a0656c36739f. What evidence have you got to base that value judgement on? cab9a0656c36739f was based on an actual user complaint, so we have good evidence that there are people out there who care about the cost of truncating a table many times in one transaction. On the other hand, I know of no evidence that anyone's depending on multiple sequential COPYs, nor intermixed COPY and INSERT, to be fast. The original argument for having this COPY optimization at all was to make restoring pg_dump scripts in a single transaction fast; and that use-case doesn't care about anything but a single COPY into a virgin table. We have to backpatch this fix, so it must be both simple and effective. Heikki's suggestions may be best, maybe not, but they don't seem backpatchable. Tom's suggestion looks good. So does Andres' suggestion. I have coded both. And what reason is there to think that this would fix all the problems? I don't think either suggested fix could be claimed to be a great solution, since there is little principle here, only heuristic. Heikki's solution would be the only safe way, but is not backpatchable. Forcing SKIP_FSM to always extend has no negative side effects in other code paths, AFAICS. Patches attached. Martijn, please verify. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services fix_wal_logging_copy_truncate.v1.patch Description: Binary data fix_copy_zero_blocks.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Hi, This thread seemed to trail off without a resolution. Was anything done? (See more below.) On 07/09/15 19:06, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On 2015-07-06 11:49:54 -0400, Tom Lane wrote: Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. I'm not convinced that cab9a0656c36739f needs to survive in that form. To me only allowing one COPY to benefit from the wal_level = minimal optimization has a significantly higher cost than cab9a0656c36739f. What evidence have you got to base that value judgement on? cab9a0656c36739f was based on an actual user complaint, so we have good evidence that there are people out there who care about the cost of truncating a table many times in one transaction. I'm the complainer mentioned in the cab9a0656c36739f commit message. :) FWIW, we use a temp table to split a join across 4 largish tables (10^8 rows or more each) and 2 small tables (10^6 rows each). We write the results of joining the 2 largest tables into the temp table, and then join that to the other 4. This gave significant performance benefits because the planner would know the exact row count of the 2-way join heading into the 4-way join. After commit cab9a0656c36739f, we got another noticeable performance improvement (I did timings before and after, but I can't seem to put my hands on the numbers right now). We do millions of these queries every day in batches. Each batch reuses a single temp table (truncating it before each pair of joins) so as to reduce the churn in the system catalogs. In case it matters, the temp table is created with ON COMMIT DROP. This was (and still is) done on 9.2.x. HTH. -- todd cook -- tc...@blackducksoftware.com On the other hand, I know of no evidence that anyone's depending on multiple sequential COPYs, nor intermixed COPY and INSERT, to be fast. The original argument for having this COPY optimization at all was to make restoring pg_dump scripts in a single transaction fast; and that use-case doesn't care about anything but a single COPY into a virgin table. I think you're worrying about exactly the wrong case. My tentative guess is that the best course is to a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the truncation replay issue. b) Force new pages to be used when using the heap_sync mode in COPY. That avoids the INIT danger you found. It seems rather reasonable to avoid using pages that have already been the target of WAL logging here in general. And what reason is there to think that this would fix all the problems? We know of those two, but we've not exactly looked hard for other cases. Again, the only known field usage for the COPY optimization is the pg_dump scenario; were that not so, we'd have noticed the problem long since. So I don't have any faith that this is a well-tested area. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote: Hi, This thread seemed to trail off without a resolution. Was anything done? Not that I can tell. I was the original poster of this thread. We've worked around the issue by placing a CHECKPOINT command at the end of the migration script. For us it's not a performance issue, more a correctness one, tables were empty when they shouldn't have been. I'm hoping a fix will appear in the 9.5 release, since we're intending to release with that version. A forced checkpoint every now and them probably won't be a serious problem though. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On 07/10/2015 02:06 AM, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On 2015-07-06 11:49:54 -0400, Tom Lane wrote: Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. I'm not convinced that cab9a0656c36739f needs to survive in that form. To me only allowing one COPY to benefit from the wal_level = minimal optimization has a significantly higher cost than cab9a0656c36739f. What evidence have you got to base that value judgement on? cab9a0656c36739f was based on an actual user complaint, so we have good evidence that there are people out there who care about the cost of truncating a table many times in one transaction. Yeah, if we specifically made that case cheap, in response to a complaint, it would be a regression to make it expensive again. We might get away with it in a major version, but would hate to backpatch that. My tentative guess is that the best course is to a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the truncation replay issue. b) Force new pages to be used when using the heap_sync mode in COPY. That avoids the INIT danger you found. It seems rather reasonable to avoid using pages that have already been the target of WAL logging here in general. And what reason is there to think that this would fix all the problems? We know of those two, but we've not exactly looked hard for other cases. Hmm. Perhaps that could be made to work, but it feels pretty fragile. For example, you could have an insert trigger on the table that inserts additional rows to the same table, and those inserts would be intermixed with the rows inserted by COPY. You'll have to avoid that somehow. Full-page images in general are a problem. If a checkpoint happens, and a trigger modifies the page we're COPYing to in any way, you have the same issue. Even reading a page can cause a full-page image of it to be written: If you update a hint bit on the page while reading it, and checksums are enabled, and a checkpoint happened since the page was last updated, bang. I don't think that's a problem in this case because there are no hint bits to be set on pages that we're COPYing to, but it's a whole new subtle assumption. I think we should 1. reliably and explicitly keep track of whether we've WAL-logged any TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on the relation, and 2. make sure we never skip WAL-logging again if we have. Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set when a new relfilenode is created, i.e. whenever rd_createSubid or rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image (including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c, only skip WAL-logging if the flag is still set. To deal with the case that the flag gets cleared in the middle of COPY, also check the flag whenever we're about to skip WAL-logging in heap_insert, and if it's been cleared, ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway. Compared to the status quo, that disables the WAL-skipping optimization in the scenario where you CREATE, INSERT, then COPY to a table in the same transaction. I think that's acceptable. (Alternatively, to handle the case that the flag gets cleared in the middle of COPY, add another flag to RelationData indicating that a WAL-skipping COPY is in-progress, and refrain from WAL-logging any FPW-writing operations on the table when it's set (or any operations whatsoever). That'd be more efficient, but it's such a rare corner case that it hardly matters.) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-09 19:06:11 -0400, Tom Lane wrote: What evidence have you got to base that value judgement on? cab9a0656c36739f was based on an actual user complaint, so we have good evidence that there are people out there who care about the cost of truncating a table many times in one transaction. On the other hand, I know of no evidence that anyone's depending on multiple sequential COPYs, nor intermixed COPY and INSERT, to be fast. The original argument for having this COPY optimization at all was to make restoring pg_dump scripts in a single transaction fast; and that use-case doesn't care about anything but a single COPY into a virgin table. Well, you'll hardly have heard complaints about COPY, given that we behaved like currently for a long while. I definitely know of ETL like processes that have relied on subsequent COPYs into truncates relations being cheaper. Can't remember the same for intermixed COPY and INSERT, but it'd not surprise me if somebody mixed COPY and UPDATEs rather freely for ETL. I think you're worrying about exactly the wrong case. My tentative guess is that the best course is to a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the truncation replay issue. b) Force new pages to be used when using the heap_sync mode in COPY. That avoids the INIT danger you found. It seems rather reasonable to avoid using pages that have already been the target of WAL logging here in general. And what reason is there to think that this would fix all the problems? Yea, that's the big problem. Again, the only known field usage for the COPY optimization is the pg_dump scenario; were that not so, we'd have noticed the problem long since. So I don't have any faith that this is a well-tested area. You need to crash in the right moment. I don't think that's that frequently exercised... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-10 19:23:28 +0900, Fujii Masao wrote: Maybe I'm missing something. But I start wondering why TRUNCATE and INSERT (or even all the operations on the table created at the current transaction) need to be WAL-logged while COPY can be optimized. If no WAL records are generated on that table, the problem we're talking about seems not to occur. Also this seems safe and doesn't degrade the performance of data loading. Thought? Skipping WAL logging means that you need to scan through the whole shrared buffers to write out dirty buffers and fsync the segments. A single insert wal record is a couple orders of magnitudes cheaper than that. Essentially doing this juts for COPY is a heuristic. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-10 13:38:50 +0300, Heikki Linnakangas wrote: In the long-term, I'd like to refactor this whole thing so that we never WAL-log any operations on a relation that's created in the same transaction (when wal_level=minimal). Instead, at COMMIT, we'd fsync() the relation, or if it's smaller than some threshold, WAL-log the contents of the whole file at that point. That would move all that more-difficult-than-it-seems-at-first-glance logic from COPY and indexam's to a central location, and it would allow the same optimization for all operations, not just COPY. But that probably isn't feasible to backpatch. I don't think that's really realistic until we have a buffer manager that lets you efficiently scan for all pages of a relation :( -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 10, 2015 at 2:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: Fujii Masao masao.fu...@gmail.com writes: On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane t...@sss.pgh.pa.us wrote: One idea I had was to allow the COPY optimization only if the heap file is physically zero-length at the time the COPY starts. This seems not helpful for the case where TRUNCATE is executed before COPY. No? Huh? The heap file would be zero length in that case. So, if COPY is executed multiple times at the same transaction, only first COPY can be optimized? This is true, and I don't think we should care, especially not if we're going to take risks of incorrect behavior in order to optimize that third-order case. The fact that we're dealing with this bug at all should remind us that this stuff is harder than it looks. I want a simple, reliable, back-patchable fix, and I do not believe that what you are suggesting would be any of those. Maybe I'm missing something. But I start wondering why TRUNCATE and INSERT (or even all the operations on the table created at the current transaction) need to be WAL-logged while COPY can be optimized. If no WAL records are generated on that table, the problem we're talking about seems not to occur. Also this seems safe and doesn't degrade the performance of data loading. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Fujii Masao masao.fu...@gmail.com writes: On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane t...@sss.pgh.pa.us wrote: One idea I had was to allow the COPY optimization only if the heap file is physically zero-length at the time the COPY starts. This seems not helpful for the case where TRUNCATE is executed before COPY. No? Huh? The heap file would be zero length in that case. So, if COPY is executed multiple times at the same transaction, only first COPY can be optimized? This is true, and I don't think we should care, especially not if we're going to take risks of incorrect behavior in order to optimize that third-order case. The fact that we're dealing with this bug at all should remind us that this stuff is harder than it looks. I want a simple, reliable, back-patchable fix, and I do not believe that what you are suggesting would be any of those. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-06 11:49:54 -0400, Tom Lane wrote: One idea I had was to allow the COPY optimization only if the heap file is physically zero-length at the time the COPY starts. That would still be able to optimize in all the cases we care about making COPY fast for. Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. I'm not convinced that cab9a0656c36739f needs to survive in that form. To me only allowing one COPY to benefit from the wal_level = minimal optimization has a significantly higher cost than cab9a0656c36739f. My tentative guess is that the best course is to a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the truncation replay issue. b) Force new pages to be used when using the heap_sync mode in COPY. That avoids the INIT danger you found. It seems rather reasonable to avoid using pages that have already been the target of WAL logging here in general. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@anarazel.de writes: On 2015-07-06 11:14:40 -0400, Tom Lane wrote: The COUNT() correctly says 11 rows, but after crash-and-recover, only the row with -1 is there. This is because the INSERT writes out an INSERT+INIT WAL record, which we happily replay, clobbering the data added later by COPY. ISTM any WAL logged action that touches a relfilenode essentially needs to disable further optimization based on the knowledge that the relation is new. After a bit more thought, I think it's not so much any WAL logged action as any unconditionally-replayed action. INSERT+INIT breaks this example because heap_xlog_insert will unconditionally replay the action, even if the page is valid and has same or newer LSN. Similarly, TRUNCATE is problematic because we redo it unconditionally (and in that case it's hard to see an alternative). It'd not be impossible to add more state to the relcache entry for the relation. Whether it's likely that we'd find all the places that'd need updating that state, I'm not sure. Yeah, the sticking point is mainly being sure that the state is correctly tracked, both now and after future changes. We'd need to identify a state invariant that we could be pretty confident we'd not break. One idea I had was to allow the COPY optimization only if the heap file is physically zero-length at the time the COPY starts. This seems not helpful for the case where TRUNCATE is executed before COPY. No? That would still be able to optimize in all the cases we care about making COPY fast for. Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. So, if COPY is executed multiple times at the same transaction, only first COPY can be optimized? After second thought, I'm thinking that we can safely optimize COPY if no problematic WAL records like INSERT+INIT or TRUNCATE are generated since current REDO location or the table was created at the same transaction. That is, if INSERT or TRUNCATE is executed after the table creation, but if CHECKPOINT happens subsequently, we don't need to log COPY. The subsequent crash recovery will not replay such problematic WAL records. So the example cases where we can optimize COPY are: BEGIN CREATE TABLE COPY COPY-- subsequent COPY also can be optimized BEGIN CREATE TABLE TRUNCATE CHECKPOINT COPY BEGIN CREATE TABLE INSERT CHECKPOINT COPY A crash recovery can start from previous REDO location (i.e., REDO location of the last checkpoint record). So we might need to check whether such problematic WAL records are generated since the previous REDO location instead of current one. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Andres Freund and...@anarazel.de writes: On 2015-07-06 11:49:54 -0400, Tom Lane wrote: Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. I'm not convinced that cab9a0656c36739f needs to survive in that form. To me only allowing one COPY to benefit from the wal_level = minimal optimization has a significantly higher cost than cab9a0656c36739f. What evidence have you got to base that value judgement on? cab9a0656c36739f was based on an actual user complaint, so we have good evidence that there are people out there who care about the cost of truncating a table many times in one transaction. On the other hand, I know of no evidence that anyone's depending on multiple sequential COPYs, nor intermixed COPY and INSERT, to be fast. The original argument for having this COPY optimization at all was to make restoring pg_dump scripts in a single transaction fast; and that use-case doesn't care about anything but a single COPY into a virgin table. I think you're worrying about exactly the wrong case. My tentative guess is that the best course is to a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the truncation replay issue. b) Force new pages to be used when using the heap_sync mode in COPY. That avoids the INIT danger you found. It seems rather reasonable to avoid using pages that have already been the target of WAL logging here in general. And what reason is there to think that this would fix all the problems? We know of those two, but we've not exactly looked hard for other cases. Again, the only known field usage for the COPY optimization is the pg_dump scenario; were that not so, we'd have noticed the problem long since. So I don't have any faith that this is a well-tested area. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund and...@anarazel.de wrote: On 2015-07-03 19:02:29 +0200, Andres Freund wrote: Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm right now missing how the whole skip wal logging if relation has just been truncated optimization can ever actually be crashsafe unless we use a new relfilenode (which we don't!). Agreed... When I ran the following test scenario, I found that the loaded data disappeared after the crash recovery. 1. start PostgreSQL server with wal_level = minimal 2. execute the following SQL statements \copy (SELECT num FROM generate_series(1,10) num) to /tmp/num.csv with csv BEGIN; CREATE TABLE test (i int primary key); TRUNCATE TABLE test; \copy test from /tmp/num.csv with csv COMMIT; SELECT COUNT(*) FROM test; -- returns 10 3. shutdown the server with immediate mode 4. restart the server 5. execute the following SQL statement after crash recovery ends SELECT COUNT(*) FROM test; -- returns 0.. In #2, 10 rows were copied and the transaction was committed. The subsequent statement of select count(*) obviously returned 10. However, after crash recovery, in #5, the same statement returned 0. That is, the loaded (+ committed) 10 data was lost after the crash. We actually used to use a different relfilenode, but optimized that away: cab9a0656c36739f59277b34fea8ab9438395869 commit cab9a0656c36739f59277b34fea8ab9438395869 Author: Tom Lane t...@sss.pgh.pa.us Date: Sun Aug 23 19:23:41 2009 + Make TRUNCATE do truncate-in-place when processing a relation that was created or previously truncated in the current (sub)transaction. This is safe since if the (sub)transaction later rolls back, we'd just discard the rel's current physical file anyway. This avoids unreasonable growth in the number of transient files when a relation is repeatedly truncated. Per a performance gripe a couple weeks ago from Todd Cook. to me the reasoning here looks flawed. Before this commit, when I ran the above test scenario, no data loss happened. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Fujii Masao masao.fu...@gmail.com writes: On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund and...@anarazel.de wrote: We actually used to use a different relfilenode, but optimized that away: cab9a0656c36739f59277b34fea8ab9438395869 commit cab9a0656c36739f59277b34fea8ab9438395869 Author: Tom Lane t...@sss.pgh.pa.us Date: Sun Aug 23 19:23:41 2009 + Make TRUNCATE do truncate-in-place when processing a relation that was created or previously truncated in the current (sub)transaction. This is safe since if the (sub)transaction later rolls back, we'd just discard the rel's current physical file anyway. This avoids unreasonable growth in the number of transient files when a relation is repeatedly truncated. Per a performance gripe a couple weeks ago from Todd Cook. to me the reasoning here looks flawed. Before this commit, when I ran the above test scenario, no data loss happened. Actually I think what is broken here is COPY's test to decide whether it can omit writing WAL: * Check to see if we can avoid writing WAL * * If archive logging/streaming is not enabled *and* either *- table was created in same transaction as this COPY *- data is being written to relfilenode created in this transaction * then we can skip writing WAL. It's safe because if the transaction * doesn't commit, we'll discard the table (or the new relfilenode file). * If it does commit, we'll have done the heap_sync at the bottom of this * routine first. The problem with that analysis is that it supposes that, if we crash and recover, the WAL replay sequence will not touch the data. What's killing us in this example is the replay of the TRUNCATE, but that is not the only possibility. For example consider this modification of Fujii-san's test case: BEGIN; CREATE TABLE test (i int primary key); INSERT INTO test VALUES(-1); \copy test from /tmp/num.csv with csv COMMIT; SELECT COUNT(*) FROM test; The COUNT() correctly says 11 rows, but after crash-and-recover, only the row with -1 is there. This is because the INSERT writes out an INSERT+INIT WAL record, which we happily replay, clobbering the data added later by COPY. We might have to give up on this COPY optimization :-(. I'm not sure what would be a safe rule for deciding that we can skip WAL logging in this situation, but I am pretty sure that it would require keeping information we don't currently keep about what's happened earlier in the transaction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Andres Freund and...@anarazel.de writes: On 2015-07-06 11:14:40 -0400, Tom Lane wrote: The COUNT() correctly says 11 rows, but after crash-and-recover, only the row with -1 is there. This is because the INSERT writes out an INSERT+INIT WAL record, which we happily replay, clobbering the data added later by COPY. ISTM any WAL logged action that touches a relfilenode essentially needs to disable further optimization based on the knowledge that the relation is new. After a bit more thought, I think it's not so much any WAL logged action as any unconditionally-replayed action. INSERT+INIT breaks this example because heap_xlog_insert will unconditionally replay the action, even if the page is valid and has same or newer LSN. Similarly, TRUNCATE is problematic because we redo it unconditionally (and in that case it's hard to see an alternative). It'd not be impossible to add more state to the relcache entry for the relation. Whether it's likely that we'd find all the places that'd need updating that state, I'm not sure. Yeah, the sticking point is mainly being sure that the state is correctly tracked, both now and after future changes. We'd need to identify a state invariant that we could be pretty confident we'd not break. One idea I had was to allow the COPY optimization only if the heap file is physically zero-length at the time the COPY starts. That would still be able to optimize in all the cases we care about making COPY fast for. Rather than reverting cab9a0656c36739f, which would re-introduce a different performance problem, perhaps we could have COPY create a new relfilenode when it does this. That should be safe if the table was previously empty. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-06 11:14:40 -0400, Tom Lane wrote: BEGIN; CREATE TABLE test (i int primary key); INSERT INTO test VALUES(-1); \copy test from /tmp/num.csv with csv COMMIT; SELECT COUNT(*) FROM test; The COUNT() correctly says 11 rows, but after crash-and-recover, only the row with -1 is there. This is because the INSERT writes out an INSERT+INIT WAL record, which we happily replay, clobbering the data added later by COPY. ISTM any WAL logged action that touches a relfilenode essentially needs to disable further optimization based on the knowledge that the relation is new. We might have to give up on this COPY optimization :-(. A crazy, not well though through, bandaid for the INSERT+INIT case would be to force COPY to use a new page when using the SKIP_WAL codepath. I'm not sure what would be a safe rule for deciding that we can skip WAL logging in this situation, but I am pretty sure that it would require keeping information we don't currently keep about what's happened earlier in the transaction. It'd not be impossible to add more state to the relcache entry for the relation. Whether it's likely that we'd find all the places that'd need updating that state, I'm not sure. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 3, 2015 at 3:01 PM, Martijn van Oosterhout klep...@svana.org wrote: On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote: Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. I could reproduce the problem in the master branch by doing the following steps. Thank you, I wasn't sure if you could kill the server fast enough without containers, but it looks like immediate mode is enough. 1. start the PostgreSQL server with wal_level = minimal 2. execute the following SQL statements begin; create table test(id serial primary key); truncate table test; commit; 3. shutdown the server with immediate mode 4. restart the server (crash recovery occurs) 5. execute the following SQL statement select * from test; The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. In 9.2 or before, this problem doesn't occur because no such error is thrown even if an index file size is zero. But in 9.3 or later, since the planner tries to read a meta page of an index to get the height of the btree tree, an empty index file causes such error. The planner was changed that way by commit 31f38f28, and the problem seems to be an oversight of that commit. I'm not familiar with that change of the planner, but ISTM that we can simply change _bt_getrootheight() so that 0 is returned if an index file is empty, i.e., meta page cannot be read, in order to work around the problem. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Fujii Masao masao.fu...@gmail.com writes: The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. In 9.2 or before, this problem doesn't occur because no such error is thrown even if an index file size is zero. But in 9.3 or later, since the planner tries to read a meta page of an index to get the height of the btree tree, an empty index file causes such error. The planner was changed that way by commit 31f38f28, and the problem seems to be an oversight of that commit. What? You want to blame the planner for failing because the index was left corrupt by broken WAL replay? A failure would occur anyway at execution. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote: Fujii Masao masao.fu...@gmail.com writes: Okay, so probably we need to change WAL replay of TRUNCATE so that the index file is truncated to one containing only meta page instead of empty one. That is, the WAL replay of TRUNCATE would need to call index_build() after smgrtruncate() maybe. That seems completely unworkable. For one thing, index_build would expect to be able to do catalog lookups, but we can't assume that the catalogs are in a good state yet. I think the responsibility has to be on the WAL-writing end to emit WAL instructions that lead to a correct on-disk state. Putting complex behavior into the reading side is fundamentally misguided. Am I missing something. ISTM that if the truncate record was simply not logged at all everything would work fine. The whole point is that the table was created in this transaction and so if it exists the table on disk must be the correct representation. The broken index is just one symptom. The heap also shouldn't be truncated at all. If you insert a row before commit then after replay the tuple should be there still. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
Fujii Masao masao.fu...@gmail.com writes: Okay, so probably we need to change WAL replay of TRUNCATE so that the index file is truncated to one containing only meta page instead of empty one. That is, the WAL replay of TRUNCATE would need to call index_build() after smgrtruncate() maybe. That seems completely unworkable. For one thing, index_build would expect to be able to do catalog lookups, but we can't assume that the catalogs are in a good state yet. I think the responsibility has to be on the WAL-writing end to emit WAL instructions that lead to a correct on-disk state. Putting complex behavior into the reading side is fundamentally misguided. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-03 18:49:31 +0200, Andres Freund wrote: But the more interesting question is why that's not hhappening today. RelationTruncateIndexes() does call the index_build() which should end up WAL logging the index creation. So that's because there's an XLogIsNeeded() preventing it. Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm right now missing how the whole skip wal logging if relation has just been truncated optimization can ever actually be crashsafe unless we use a new relfilenode (which we don't!). Sure, we do an heap_sync() at the the end of the transaction. That's nice and all. But it doesn't help if we crash and re-start WAL apply from a checkpoint before the table was created. Because that'll replay the truncation. That's much worse than just the indexes - the rows added by a COPY without WAL logging will also be truncated away, no? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote: Am I missing something. ISTM that if the truncate record was simply not logged at all everything would work fine. The whole point is that the table was created in this transaction and so if it exists the table on disk must be the correct representation. That'd not work either. Consider: BEGIN; CREATE TABLE ... INSERT; TRUNCATE; INSERT; COMMIT; If you replay that without a truncation wal record the second INSERT will try to add stuff to already occupied space. And they can have different lengths and stuff, so you cannot just ignore that fact. The broken index is just one symptom. Agreed. I think the problem is something else though. Namely that we reuse the relfilenode for heap_truncate_one_rel(). That's just entirely broken afaics. We need to allocate a new relfilenode and write stuff into that. Then we can forgo WAL logging the truncation record. If you insert a row before commit then after replay the tuple should be there still. The insert would be WAL logged. COPY skips wal logging tho. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-03 19:02:29 +0200, Andres Freund wrote: Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm right now missing how the whole skip wal logging if relation has just been truncated optimization can ever actually be crashsafe unless we use a new relfilenode (which we don't!). We actually used to use a different relfilenode, but optimized that away: cab9a0656c36739f59277b34fea8ab9438395869 commit cab9a0656c36739f59277b34fea8ab9438395869 Author: Tom Lane t...@sss.pgh.pa.us Date: Sun Aug 23 19:23:41 2009 + Make TRUNCATE do truncate-in-place when processing a relation that was created or previously truncated in the current (sub)transaction. This is safe since if the (sub)transaction later rolls back, we'd just discard the rel's current physical file anyway. This avoids unreasonable growth in the number of transient files when a relation is repeatedly truncated. Per a performance gripe a couple weeks ago from Todd Cook. to me the reasoning here looks flawed. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 3, 2015 at 11:52 PM, Tom Lane t...@sss.pgh.pa.us wrote: Fujii Masao masao.fu...@gmail.com writes: The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. In 9.2 or before, this problem doesn't occur because no such error is thrown even if an index file size is zero. But in 9.3 or later, since the planner tries to read a meta page of an index to get the height of the btree tree, an empty index file causes such error. The planner was changed that way by commit 31f38f28, and the problem seems to be an oversight of that commit. What? You want to blame the planner for failing because the index was left corrupt by broken WAL replay? A failure would occur anyway at execution. Yep, right. I was not thinking that such index with file size 0 is corrupted because the reported problem didn't happen before that commit was added. But that's my fault. Such index can cause an error even in other code paths. Okay, so probably we need to change WAL replay of TRUNCATE so that the index file is truncated to one containing only meta page instead of empty one. That is, the WAL replay of TRUNCATE would need to call index_build() after smgrtruncate() maybe. Then how should we implement that? Invent new WAL record type that calls smgrtruncate() and index_build() during WAL replay? Or add the special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay call index_build() only if the flag is found? Any other good idea? Anyway ISTM that we might need to add or modify WAL record. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-04 01:39:42 +0900, Fujii Masao wrote: Okay, so probably we need to change WAL replay of TRUNCATE so that the index file is truncated to one containing only meta page instead of empty one. That is, the WAL replay of TRUNCATE would need to call index_build() after smgrtruncate() maybe. Then how should we implement that? Invent new WAL record type that calls smgrtruncate() and index_build() during WAL replay? Or add the special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay call index_build() only if the flag is found? Any other good idea? Anyway ISTM that we might need to add or modify WAL record. It's easy enough to log something like a metapage with log_newpage(). But the more interesting question is why that's not hhappening today. RelationTruncateIndexes() does call the index_build() which should end up WAL logging the index creation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-03 18:38:37 -0400, Tom Lane wrote: Why exactly? The first truncation in the (sub)xact would have assigned a new relfilenode, why do we need another one? The file in question will go away on crash/rollback in any case, and no other transaction can see it yet. Consider: BEGIN; CREATE TABLE; INSERT largeval; TRUNCATE; INSERT 1; COPY; INSERT 2; COMMIT; INSERT 1 is going to be WAL logged. For that to work correctly TRUNCATE has to be WAL logged, as otherwise there'll be conflicting/overlapping tuples on the target page. But: The truncation itself is not fully wal logged, neither is the COPY. Both rely on heap_sync()/immedsync(). For that to be correct the current relfilenode's truncation may *not* be wal-logged, because the contents of the COPY or the truncation itself will only be on-disk, not in the WAL. Only being on-disk but not in the WAL is a problem if we crash and replay the truncate record. I'm prepared to believe that some bit of logic is doing the wrong thing in this state, but I do not agree that truncate-in-place is unworkable. Unless we're prepared to make everything that potentially WAL logs something do the rel-rd_createSubid == mySubid dance, I can't see that working. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Martijn van Oosterhout klep...@svana.org writes: With inserts the WAL records look as follows (relfilenodes changed): ... And amazingly, the database cluster successfuly recovers and there's no error now. So the problem is *only* because there is no data in the table at commit time. Which indicates that it's the 'newroot record that saves the day normally. And it's apparently generated by the first insert. Yeah, because the correct empty state of a btree index is to have a metapage but no root page, so the first insert forces creation of a root page. And, by chance, btree_xlog_newroot restores the metapage from scratch, so this works even if the metapage had been missing or corrupt. However, things would still break if the first access to the index was a read attempt rather than an insert. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On 2015-07-03 19:26:05 +0200, Andres Freund wrote: On 2015-07-03 19:02:29 +0200, Andres Freund wrote: Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm right now missing how the whole skip wal logging if relation has just been truncated optimization can ever actually be crashsafe unless we use a new relfilenode (which we don't!). We actually used to use a different relfilenode, but optimized that away: cab9a0656c36739f59277b34fea8ab9438395869 commit cab9a0656c36739f59277b34fea8ab9438395869 Author: Tom Lane t...@sss.pgh.pa.us Date: Sun Aug 23 19:23:41 2009 + Make TRUNCATE do truncate-in-place when processing a relation that was created or previously truncated in the current (sub)transaction. This is safe since if the (sub)transaction later rolls back, we'd just discard the rel's current physical file anyway. This avoids unreasonable growth in the number of transient files when a relation is repeatedly truncated. Per a performance gripe a couple weeks ago from Todd Cook. to me the reasoning here looks flawed. It looks to me we need to re-neg on this a bit. I think we can still be more efficient than the general codepath: We can drop the old relfilenode immediately. But pg_class.relfilenode has to differ from the old after the truncation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
Andres Freund and...@anarazel.de writes: On 2015-07-03 19:26:05 +0200, Andres Freund wrote: commit cab9a0656c36739f59277b34fea8ab9438395869 Author: Tom Lane t...@sss.pgh.pa.us Date: Sun Aug 23 19:23:41 2009 + Make TRUNCATE do truncate-in-place when processing a relation that was created or previously truncated in the current (sub)transaction. This is safe since if the (sub)transaction later rolls back, we'd just discard the rel's current physical file anyway. This avoids unreasonable growth in the number of transient files when a relation is repeatedly truncated. Per a performance gripe a couple weeks ago from Todd Cook. to me the reasoning here looks flawed. It looks to me we need to re-neg on this a bit. I think we can still be more efficient than the general codepath: We can drop the old relfilenode immediately. But pg_class.relfilenode has to differ from the old after the truncation. Why exactly? The first truncation in the (sub)xact would have assigned a new relfilenode, why do we need another one? The file in question will go away on crash/rollback in any case, and no other transaction can see it yet. I'm prepared to believe that some bit of logic is doing the wrong thing in this state, but I do not agree that truncate-in-place is unworkable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote: On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote: Am I missing something. ISTM that if the truncate record was simply not logged at all everything would work fine. The whole point is that the table was created in this transaction and so if it exists the table on disk must be the correct representation. That'd not work either. Consider: BEGIN; CREATE TABLE ... INSERT; TRUNCATE; INSERT; COMMIT; If you replay that without a truncation wal record the second INSERT will try to add stuff to already occupied space. And they can have different lengths and stuff, so you cannot just ignore that fact. I was about to disagree with you by suggesting that if the table was created in this transaction then WAL logging is skipped. But testing shows that inserts are indeed logged, as you point out. With inserts the WAL records look as follows (relfilenodes changed): martijn@martijn-jessie:~/git/ctm/docker$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 00010001 |grep -wE '16386|16384|16390' rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A79C8, prev 0/016A79A0, bkp: , desc: file create: base/12139/16384 rmgr: Sequencelen (rec/tot):158/ 190, tx:683, lsn: 0/016B4258, prev 0/016B2508, bkp: , desc: log: rel 1663/12139/16384 rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016B4318, prev 0/016B4258, bkp: , desc: file create: base/12139/16386 rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016B9468, prev 0/016B9418, bkp: , desc: file create: base/12139/16390 rmgr: Sequencelen (rec/tot):158/ 190, tx:683, lsn: 0/016BC938, prev 0/016BC880, bkp: , desc: log: rel 1663/12139/16384 rmgr: Sequencelen (rec/tot):158/ 190, tx:683, lsn: 0/016BCAF0, prev 0/016BCAA0, bkp: , desc: log: rel 1663/12139/16384 rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 0/016BCBB0, prev 0/016BCAF0, bkp: , desc: insert(init): rel 1663/12139/16386; tid 0/1 rmgr: Btree len (rec/tot): 20/52, tx:683, lsn: 0/016BCBF8, prev 0/016BCBB0, bkp: , desc: newroot: rel 1663/12139/16390; root 1 lev 0 rmgr: Btree len (rec/tot): 34/66, tx:683, lsn: 0/016BCC30, prev 0/016BCBF8, bkp: , desc: insert: rel 1663/12139/16390; tid 1/1 rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016BCC78, prev 0/016BCC30, bkp: , desc: file truncate: base/12139/16386 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:683, lsn: 0/016BCCA8, prev 0/016BCC78, bkp: , desc: file truncate: base/12139/16390 to 0 blocks rmgr: Heaplen (rec/tot): 35/67, tx:683, lsn: 0/016BCCD8, prev 0/016BCCA8, bkp: , desc: insert(init): rel 1663/12139/16386; tid 0/1 rmgr: Btree len (rec/tot): 20/52, tx:683, lsn: 0/016BCD20, prev 0/016BCCD8, bkp: , desc: newroot: rel 1663/12139/16390; root 1 lev 0 rmgr: Btree len (rec/tot): 34/66, tx:683, lsn: 0/016BCD58, prev 0/016BCD20, bkp: , desc: insert: rel 1663/12139/16390; tid 1/1 relname | relfilenode -+- test| 16386 test_id_seq | 16384 test_pkey | 16390 (3 rows) And amazingly, the database cluster successfuly recovers and there's no error now. So the problem is *only* because there is no data in the table at commit time. Which indicates that it's the 'newroot record that saves the day normally. And it's apparently generated by the first insert. Agreed. I think the problem is something else though. Namely that we reuse the relfilenode for heap_truncate_one_rel(). That's just entirely broken afaics. We need to allocate a new relfilenode and write stuff into that. Then we can forgo WAL logging the truncation record. Would that properly initialise the index though? Anyway, this is way outside my expertise, so I'll bow out now. Let me know if I can be of more assistance. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote: Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. I could reproduce the problem in the master branch by doing the following steps. Thank you, I wasn't sure if you could kill the server fast enough without containers, but it looks like immediate mode is enough. 1. start the PostgreSQL server with wal_level = minimal 2. execute the following SQL statements begin; create table test(id serial primary key); truncate table test; commit; 3. shutdown the server with immediate mode 4. restart the server (crash recovery occurs) 5. execute the following SQL statement select * from test; The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. Looks plausible to me. For reference I attach a small tarball for reproduction with docker. 1. Unpack tarball into empty dir (it has three small files) 2. docker build -t test . 3. docker run -v /tmp/pgtest:/data test 4. docker run -v /tmp/pgtest:/data test Data dir is in /tmp/pgtest Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer postgresql-test.tgz Description: GNU Unix tar archive signature.asc Description: Digital signature
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout klep...@svana.org wrote: On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote: Hi, On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote: === Start with an empty database My guess is you have wal_level = minimal? Default config, was just initdb'd. So yes, the default wal_level = minimal. === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. How did you continue from there? The container has persistent storage? Or are you repapplying the WAL to somewhere else? The container has persistant storage on the host. What I think is actually unusual is that the script that started postgres was missing an 'exec so postgres never gets the signal to shutdown. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 Note that the truncate will lead to a new, different, relfilenode. Really? Comparing the relfilenodes gives the same values before and after the truncate. Yep, the relfilenodes are not changed in this case because CREATE TABLE and TRUNCATE were executed in the same transaction block. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes Hm. I can't reproduce this. Can you include a bit more details about how to reproduce? Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. I could reproduce the problem in the master branch by doing the following steps. 1. start the PostgreSQL server with wal_level = minimal 2. execute the following SQL statements begin; create table test(id serial primary key); truncate table test; commit; 3. shutdown the server with immediate mode 4. restart the server (crash recovery occurs) 5. execute the following SQL statement select * from test; The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote: Hi, On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote: === Start with an empty database My guess is you have wal_level = minimal? Default config, was just initdb'd. So yes, the default wal_level = minimal. === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. How did you continue from there? The container has persistent storage? Or are you repapplying the WAL to somewhere else? The container has persistant storage on the host. What I think is actually unusual is that the script that started postgres was missing an 'exec so postgres never gets the signal to shutdown. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 Note that the truncate will lead to a new, different, relfilenode. Really? Comparing the relfilenodes gives the same values before and after the truncate. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes Hm. I can't reproduce this. Can you include a bit more details about how to reproduce? Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
[HACKERS] WAL logging problem in 9.4.3?
Hoi, I ran into this in our CI setup and I thinks it's an actual bug. The issue appears to be that when a table is created *and truncated* in a single transaction, that the WAL log is logging a truncate record it shouldn't, such that if the database crashes you have a broken index. It would also lose any data that was in the table at commit time. I didn't test 9.4.4 yet, though I don't see anything in the release notes that resemble this. Detail: === Start with an empty database martijn@martijn-jessie:$ psql ctmp -h localhost -U username Password for user username: psql (9.4.3) Type help for help. ctmp=# begin; BEGIN ctmp=# create table test(id serial primary key); CREATE TABLE ctmp=# truncate table test; TRUNCATE TABLE ctmp=# commit; COMMIT ctmp=# select relname, relfilenode from pg_class where relname like 'test%'; relname | relfilenode -+- test| 16389 test_id_seq | 16387 test_pkey | 16393 (3 rows) === Note: if you do a CHECKPOINT here the issue doesn't happen === obviously. ctmp=# \q martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393} [sudo] password for martijn: -rw--- 1 messagebus ssl-cert 8192 Jul 2 23:34 /data/postgres/base/16385/16387 -rw--- 1 messagebus ssl-cert0 Jul 2 23:34 /data/postgres/base/16385/16389 -rw--- 1 messagebus ssl-cert 8192 Jul 2 23:34 /data/postgres/base/16385/16393 === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. === Dump the xlogs just to show what got recorded. Note there's a === truncate for the data file and the index file. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 === Start the DB up again database_1 | LOG: database system was interrupted; last known up at 2015-07-02 21:08:05 UTC database_1 | LOG: database system was not properly shut down; automatic recovery in progress database_1 | LOG: redo starts at 0/16A92A8 database_1 | LOG: record with zero length at 0/16BE740 database_1 | LOG: redo done at 0/16BE710 database_1 | LOG: last completed transaction was at log time 2015-07-02 21:34:45.664989+00 database_1 | LOG: database system is ready to accept connections database_1 | LOG: autovacuum launcher started === Oops, the index file is empty now martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393} -rw--- 1 messagebus ssl-cert 8192 Jul 2 23:37 /data/postgres/base/16385/16387 -rw--- 1 messagebus ssl-cert0 Jul 2 23:34 /data/postgres/base/16385/16389 -rw--- 1 messagebus ssl-cert0 Jul 2 23:37 /data/postgres/base/16385/16393 martijn@martijn-jessie:$ psql ctmp -h localhost -U username Password for user username: psql (9.4.3) Type help for help. === And now the index is broken. I think the only reason it doesn't === complain about the data file is because zero bytes there is OK. But if === the table had data before it would be gone now. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes ctmp=# select version(); version --- PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit (1 row) Hope this
Re: [HACKERS] WAL logging problem in 9.4.3?
Hi, On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote: === Start with an empty database My guess is you have wal_level = minimal? ctmp=# begin; BEGIN ctmp=# create table test(id serial primary key); CREATE TABLE ctmp=# truncate table test; TRUNCATE TABLE ctmp=# commit; COMMIT ctmp=# select relname, relfilenode from pg_class where relname like 'test%'; relname | relfilenode -+- test| 16389 test_id_seq | 16387 test_pkey | 16393 (3 rows) === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. How did you continue from there? The container has persistent storage? Or are you repapplying the WAL to somewhere else? === Dump the xlogs just to show what got recorded. Note there's a === truncate for the data file and the index file. That should be ok. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 Note that the truncate will lead to a new, different, relfilenode. === Start the DB up again database_1 | LOG: database system was interrupted; last known up at 2015-07-02 21:08:05 UTC database_1 | LOG: database system was not properly shut down; automatic recovery in progress database_1 | LOG: redo starts at 0/16A92A8 database_1 | LOG: record with zero length at 0/16BE740 database_1 | LOG: redo done at 0/16BE710 database_1 | LOG: last completed transaction was at log time 2015-07-02 21:34:45.664989+00 database_1 | LOG: database system is ready to accept connections database_1 | LOG: autovacuum launcher started === Oops, the index file is empty now That's probably just the old index file? martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393} -rw--- 1 messagebus ssl-cert 8192 Jul 2 23:37 /data/postgres/base/16385/16387 -rw--- 1 messagebus ssl-cert0 Jul 2 23:34 /data/postgres/base/16385/16389 -rw--- 1 messagebus ssl-cert0 Jul 2 23:37 /data/postgres/base/16385/16393 martijn@martijn-jessie:$ psql ctmp -h localhost -U username Password for user username: psql (9.4.3) Type help for help. === And now the index is broken. I think the only reason it doesn't === complain about the data file is because zero bytes there is OK. But if === the table had data before it would be gone now. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes Hm. I can't reproduce this. Can you include a bit more details about how to reproduce? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers