Re: [HACKERS] Unlogged tables cleanup
On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas wrote: > On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier > wrote: >> OK, I rewrote a bit the patch as attached. What do you think? > > Committed and back-patched all the way back to 9.2. Thanks! >>> Right (I think). If we set and clear delayChkpt around this work, we >>> don't need the immediate sync. >> >> My point is a bit different than what you mean I think: the >> transaction creating an unlogged relfilenode would not need to even >> set delayChkpt in the empty() routines because other transactions >> would not refer to it until this transaction has committed. So I am >> arguing about just removing the sync phase. > > That doesn't sound right; see the comment for heap_create_init_fork. > Suppose the transaction creating the unlogged table commits, a > checkpoint happens, and then the operating system crashes. Without > the immediate sync, the operating system crash can cause the un-sync'd > file to crash, and because of the checkpoint the WAL record that > creates it isn't replayed either. So the file's just gone. Doh. That would have made sense if the checkpoint was actually flushing the page if it was in shared buffers. But that's not the case. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier wrote: > OK, I rewrote a bit the patch as attached. What do you think? Committed and back-patched all the way back to 9.2. >> Right (I think). If we set and clear delayChkpt around this work, we >> don't need the immediate sync. > > My point is a bit different than what you mean I think: the > transaction creating an unlogged relfilenode would not need to even > set delayChkpt in the empty() routines because other transactions > would not refer to it until this transaction has committed. So I am > arguing about just removing the sync phase. That doesn't sound right; see the comment for heap_create_init_fork. Suppose the transaction creating the unlogged table commits, a checkpoint happens, and then the operating system crashes. Without the immediate sync, the operating system crash can cause the un-sync'd file to crash, and because of the checkpoint the WAL record that creates it isn't replayed either. So the file's just gone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas wrote: > Michael, your greetings were passed on to me with a request that I > look at this thread. Thanks for showing up! > On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier > wrote: More seriously, if there could be more details regarding that, I would think that we could say something like "logging the init fork is mandatory in any case to ensure its on-disk presence when at recovery replay, even on non-default tablespaces whose base location are deleted and re-created from scratch if the WAL record in charge of creating this tablespace gets replayed". The problem shows up because of tablespaces being deleted at replay at the end... So perhaps this makes sense. What do you think? >>> >>> Yes, that's about what I think we need to explain. >> >> OK, what do you think about the attached then? >> >> I came up with something like that for those code paths: >> - /* Write the page. If archiving/streaming, XLOG it. */ >> + /* >> +* Write the page and log it unconditionally. This is important >> +* particularly for indexes created on tablespaces and databases >> +* whose creation happened after the last redo pointer as recovery >> +* removes any of their existing content when the corresponding >> +* create records are replayed. >> +*/ >> I have not been able to use the word "create" less than that. The >> patch is really repetitive, but I think that we had better mention the >> need of logging the content in each code path and not touch >> log_newpage() to keep it a maximum flexible. > > In blinsert.c, nbtree.c, etc. how about: > > Write the page and log it. It might seem that an immediate sync would > be sufficient to guarantee that the file exists on disk, but recovery > itself might remove it while replaying, for example, an > XLOG_DBASE_CREATE record. Therefore, we need this even when > wal_level=minimal. OK, I rewrote a bit the patch as attached. What do you think? >>> Actually, I'm >>> wondering if we ought to just switch this over to using the delayChkpt >>> mechanism instead of going through this complicated song-and-dance. >>> Incurring an immediate sync to avoid having to WAL-log this is a bit >>> tenuous, but having to WAL-log it AND do the immediate sync just seems >>> silly, and I'm actually a bit worried that even with your fix there >>> might still be a latent bug somewhere. We couldn't switch mechanisms >>> cleanly in the 9.2 branch (cf. >>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should >>> back-patch it in the form you have it in already, but it's probably >>> worth switching over at least in master. >> >> Thinking through it... Could we not just rip off the sync phase >> because there is delayChkpt? It seems to me that what counts is that >> the commit of the transaction that creates the relation does not get >> past the redo point. It would matter for read uncommitted transactions >> but that concept does not exist in PG, and never will. On >> back-branches it is definitely safer to keep the sync, I am just >> thinking about a HEAD-only optimization here as you do. > > Right (I think). If we set and clear delayChkpt around this work, we > don't need the immediate sync. My point is a bit different than what you mean I think: the transaction creating an unlogged relfilenode would not need to even set delayChkpt in the empty() routines because other transactions would not refer to it until this transaction has committed. So I am arguing about just removing the sync phase. -- Michael diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 0946aa29ec..a31149f044 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -164,13 +164,18 @@ blbuildempty(Relation index) metapage = (Page) palloc(BLCKSZ); BloomFillMetapage(index, metapage); - /* Write the page. If archiving/streaming, XLOG it. */ + /* +* Write the page and log it. It might seem that an immediate sync +* would be sufficient to guarantee that the file exists on disk, but +* recovery itself might remove it while replaying, for example, an +* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we +* need this even when wal_level=minimal. +*/ PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BLOOM_METAPAGE_BLKNO, metapage, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, + BLOOM_METAPAGE_BLKNO, metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the diff --git a/src/backend/access/nbtree/nbtree.c b/src
Re: [HACKERS] Unlogged tables cleanup
Michael, your greetings were passed on to me with a request that I look at this thread. On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier wrote: >>> More seriously, if there could be more details regarding that, I would >>> think that we could say something like "logging the init fork is >>> mandatory in any case to ensure its on-disk presence when at recovery >>> replay, even on non-default tablespaces whose base location are >>> deleted and re-created from scratch if the WAL record in charge of >>> creating this tablespace gets replayed". The problem shows up because >>> of tablespaces being deleted at replay at the end... So perhaps this >>> makes sense. What do you think? >> >> Yes, that's about what I think we need to explain. > > OK, what do you think about the attached then? > > I came up with something like that for those code paths: > - /* Write the page. If archiving/streaming, XLOG it. */ > + /* > +* Write the page and log it unconditionally. This is important > +* particularly for indexes created on tablespaces and databases > +* whose creation happened after the last redo pointer as recovery > +* removes any of their existing content when the corresponding > +* create records are replayed. > +*/ > I have not been able to use the word "create" less than that. The > patch is really repetitive, but I think that we had better mention the > need of logging the content in each code path and not touch > log_newpage() to keep it a maximum flexible. In blinsert.c, nbtree.c, etc. how about: Write the page and log it. It might seem that an immediate sync would be sufficient to guarantee that the file exists on disk, but recovery itself might remove it while replaying, for example, an XLOG_DBASE_CREATE record. Therefore, we need this even when wal_level=minimal. >> Actually, I'm >> wondering if we ought to just switch this over to using the delayChkpt >> mechanism instead of going through this complicated song-and-dance. >> Incurring an immediate sync to avoid having to WAL-log this is a bit >> tenuous, but having to WAL-log it AND do the immediate sync just seems >> silly, and I'm actually a bit worried that even with your fix there >> might still be a latent bug somewhere. We couldn't switch mechanisms >> cleanly in the 9.2 branch (cf. >> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should >> back-patch it in the form you have it in already, but it's probably >> worth switching over at least in master. > > Thinking through it... Could we not just rip off the sync phase > because there is delayChkpt? It seems to me that what counts is that > the commit of the transaction that creates the relation does not get > past the redo point. It would matter for read uncommitted transactions > but that concept does not exist in PG, and never will. On > back-branches it is definitely safer to keep the sync, I am just > thinking about a HEAD-only optimization here as you do. Right (I think). If we set and clear delayChkpt around this work, we don't need the immediate sync. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas wrote: > Yeah, but surely it's obvious that if you don't WAL log it, it's not > going to work for archiving or streaming. It's a lot less obvious why > you have to WAL log it when you're not doing either of those things if > you've already written it and fsync'd it locally. How is it going to > disappear if it's been fsync'd, one wonders? That's not obvious that replaying a WAL record for a database creation or tablespace creation would cause that for sure! I didn't know that redo was wiping them out with rmtree() at replay before looking at this bug. One more thing to recall when fixing an issue in this area in the future. >> More seriously, if there could be more details regarding that, I would >> think that we could say something like "logging the init fork is >> mandatory in any case to ensure its on-disk presence when at recovery >> replay, even on non-default tablespaces whose base location are >> deleted and re-created from scratch if the WAL record in charge of >> creating this tablespace gets replayed". The problem shows up because >> of tablespaces being deleted at replay at the end... So perhaps this >> makes sense. What do you think? > > Yes, that's about what I think we need to explain. OK, what do you think about the attached then? I came up with something like that for those code paths: - /* Write the page. If archiving/streaming, XLOG it. */ + /* +* Write the page and log it unconditionally. This is important +* particularly for indexes created on tablespaces and databases +* whose creation happened after the last redo pointer as recovery +* removes any of their existing content when the corresponding +* create records are replayed. +*/ I have not been able to use the word "create" less than that. The patch is really repetitive, but I think that we had better mention the need of logging the content in each code path and not touch log_newpage() to keep it a maximum flexible. > Actually, I'm > wondering if we ought to just switch this over to using the delayChkpt > mechanism instead of going through this complicated song-and-dance. > Incurring an immediate sync to avoid having to WAL-log this is a bit > tenuous, but having to WAL-log it AND do the immediate sync just seems > silly, and I'm actually a bit worried that even with your fix there > might still be a latent bug somewhere. We couldn't switch mechanisms > cleanly in the 9.2 branch (cf. > f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should > back-patch it in the form you have it in already, but it's probably > worth switching over at least in master. Thinking through it... Could we not just rip off the sync phase because there is delayChkpt? It seems to me that what counts is that the commit of the transaction that creates the relation does not get past the redo point. It would matter for read uncommitted transactions but that concept does not exist in PG, and never will. On back-branches it is definitely safer to keep the sync, I am just thinking about a HEAD-only optimization here as you do. -- Michael unlogged-tbspace-fix-v4.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Wed, Nov 16, 2016 at 11:55 PM, Michael Paquier wrote: > On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas wrote: >> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier >> wrote: >>> Indeed I missed this comment block. Please let me suggest the following >>> instead: >>> /* >>> * Set up an init fork for an unlogged table so that it can be correctly >>> - * reinitialized on restart. Since we're going to do an immediate sync, we >>> - * only need to xlog this if archiving or streaming is enabled. And the >>> - * immediate sync is required, because otherwise there's no guarantee that >>> - * this will hit the disk before the next checkpoint moves the redo >>> pointer. >>> + * reinitialized on restart. An immediate sync is required even if the >>> + * page has been logged, because the write did not go through >>> + * shared_buffers and therefore a concurrent checkpoint may have moved >>> + * the redo pointer past our xlog record. >>> */ >> >> Hmm. Well, that deletes the comment that's no longer true, but it >> doesn't replace it with any explanation of why we also need to WAL-log >> it unconditionally, and I think that explanation is not entirely >> trivial? > > OK, the original code does not give any special reason either > regarding why doing so is safe for archiving or streaming :) Yeah, but surely it's obvious that if you don't WAL log it, it's not going to work for archiving or streaming. It's a lot less obvious why you have to WAL log it when you're not doing either of those things if you've already written it and fsync'd it locally. How is it going to disappear if it's been fsync'd, one wonders? > More seriously, if there could be more details regarding that, I would > think that we could say something like "logging the init fork is > mandatory in any case to ensure its on-disk presence when at recovery > replay, even on non-default tablespaces whose base location are > deleted and re-created from scratch if the WAL record in charge of > creating this tablespace gets replayed". The problem shows up because > of tablespaces being deleted at replay at the end... So perhaps this > makes sense. What do you think? Yes, that's about what I think we need to explain. Actually, I'm wondering if we ought to just switch this over to using the delayChkpt mechanism instead of going through this complicated song-and-dance. Incurring an immediate sync to avoid having to WAL-log this is a bit tenuous, but having to WAL-log it AND do the immediate sync just seems silly, and I'm actually a bit worried that even with your fix there might still be a latent bug somewhere. We couldn't switch mechanisms cleanly in the 9.2 branch (cf. f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should back-patch it in the form you have it in already, but it's probably worth switching over at least in master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas wrote: > On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier > wrote: >> Indeed I missed this comment block. Please let me suggest the following >> instead: >> /* >> * Set up an init fork for an unlogged table so that it can be correctly >> - * reinitialized on restart. Since we're going to do an immediate sync, we >> - * only need to xlog this if archiving or streaming is enabled. And the >> - * immediate sync is required, because otherwise there's no guarantee that >> - * this will hit the disk before the next checkpoint moves the redo pointer. >> + * reinitialized on restart. An immediate sync is required even if the >> + * page has been logged, because the write did not go through >> + * shared_buffers and therefore a concurrent checkpoint may have moved >> + * the redo pointer past our xlog record. >> */ > > Hmm. Well, that deletes the comment that's no longer true, but it > doesn't replace it with any explanation of why we also need to WAL-log > it unconditionally, and I think that explanation is not entirely > trivial? OK, the original code does not give any special reason either regarding why doing so is safe for archiving or streaming :) More seriously, if there could be more details regarding that, I would think that we could say something like "logging the init fork is mandatory in any case to ensure its on-disk presence when at recovery replay, even on non-default tablespaces whose base location are deleted and re-created from scratch if the WAL record in charge of creating this tablespace gets replayed". The problem shows up because of tablespaces being deleted at replay at the end... So perhaps this makes sense. What do you think? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier wrote: > On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas wrote: >> The header comment for heap_create_init_fork() says this: >> >> /* >> * Set up an init fork for an unlogged table so that it can be correctly >> * reinitialized on restart. Since we're going to do an immediate sync, we >> * only need to xlog this if archiving or streaming is enabled. And the >> * immediate sync is required, because otherwise there's no guarantee that >> * this will hit the disk before the next checkpoint moves the redo pointer. >> */ >> >> Your patch causes the code not to match the comment any more. And the >> comment explains why at the time I wrote this code I thought it was OK >> to have the XLogIsNeeded() test in there, so it clearly needs >> updating. > > Indeed I missed this comment block. Please let me suggest the following > instead: > /* > * Set up an init fork for an unlogged table so that it can be correctly > - * reinitialized on restart. Since we're going to do an immediate sync, we > - * only need to xlog this if archiving or streaming is enabled. And the > - * immediate sync is required, because otherwise there's no guarantee that > - * this will hit the disk before the next checkpoint moves the redo pointer. > + * reinitialized on restart. An immediate sync is required even if the > + * page has been logged, because the write did not go through > + * shared_buffers and therefore a concurrent checkpoint may have moved > + * the redo pointer past our xlog record. > */ Hmm. Well, that deletes the comment that's no longer true, but it doesn't replace it with any explanation of why we also need to WAL-log it unconditionally, and I think that explanation is not entirely trivial? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas wrote: > The header comment for heap_create_init_fork() says this: > > /* > * Set up an init fork for an unlogged table so that it can be correctly > * reinitialized on restart. Since we're going to do an immediate sync, we > * only need to xlog this if archiving or streaming is enabled. And the > * immediate sync is required, because otherwise there's no guarantee that > * this will hit the disk before the next checkpoint moves the redo pointer. > */ > > Your patch causes the code not to match the comment any more. And the > comment explains why at the time I wrote this code I thought it was OK > to have the XLogIsNeeded() test in there, so it clearly needs > updating. Indeed I missed this comment block. Please let me suggest the following instead: /* * Set up an init fork for an unlogged table so that it can be correctly - * reinitialized on restart. Since we're going to do an immediate sync, we - * only need to xlog this if archiving or streaming is enabled. And the - * immediate sync is required, because otherwise there's no guarantee that - * this will hit the disk before the next checkpoint moves the redo pointer. + * reinitialized on restart. An immediate sync is required even if the + * page has been logged, because the write did not go through + * shared_buffers and therefore a concurrent checkpoint may have moved + * the redo pointer past our xlog record. */ -- Michael diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 0946aa2..c9c7049 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -164,13 +164,12 @@ blbuildempty(Relation index) metapage = (Page) palloc(BLCKSZ); BloomFillMetapage(index, metapage); - /* Write the page. If archiving/streaming, XLOG it. */ + /* Write the page and log it. */ PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BLOOM_METAPAGE_BLKNO, metapage, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +BLOOM_METAPAGE_BLKNO, metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 128744c..624aa84 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -242,13 +242,12 @@ btbuildempty(Relation index) metapage = (Page) palloc(BLCKSZ); _bt_initmetapage(metapage, P_NONE, 0); - /* Write the page. If archiving/streaming, XLOG it. */ + /* Write the page and log it */ PageSetChecksumInplace(metapage, BTREE_METAPAGE); smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, (char *) metapage, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BTREE_METAPAGE, metapage, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +BTREE_METAPAGE, metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index 01c8d21..8ac3b00 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -161,13 +161,12 @@ spgbuildempty(Relation index) page = (Page) palloc(BLCKSZ); SpGistInitMetapage(page); - /* Write the page. If archiving/streaming, XLOG it. */ + /* Write the page and log it. */ PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, (char *) page, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - SPGIST_METAPAGE_BLKNO, page, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +SPGIST_METAPAGE_BLKNO, page, false); /* Likewise for the root page. */ SpGistInitPage(page, SPGIST_LEAF); @@ -175,9 +174,8 @@ spgbuildempty(Relation index) PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO, (char *) page, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - SPGIST_ROOT_BLKNO, page, true); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +SPGIST_ROOT_BLKNO, page, true); /* Likewise for the null-tuples root page. */ SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS); @@ -185,9 +183,8 @@ spgbuildempty(Relation index) PageSetChecksumInplace(page, SPGIST_NULL_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO, (char *) page, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - SPGIST_NULL_BLKNO, page, true); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +SPGIST_NULL_BLKNO, page, true); /* * An
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 9:25 PM, Michael Paquier wrote: > On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh > wrote: >> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier >> wrote: >>> Nah. Looking at the code the fix is quite obvious. >>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if >>> the INIT forknum should be logged or not. But this is wrong, it needs >>> to be done unconditionally to ensure that the relation gets created at >>> replay. >> >> I think that we should also update other *buildempty() functions as well. >> For example, if the table has a primary key, we'll encounter the error again >> for btree index. > > Good point. I just had a look at all the AMs: bloom, btree and SP-gist > are plainly broken. The others are fine. Attached is an updated patch. > > Here are the tests I have done with wal_level = minimal to test all the AMs: > \! rm -rf /Users/mpaquier/Desktop/tbsp/PG* > create extension bloom; > create extension btree_gist; > create extension btree_gin; > create tablespace popo location '/Users/mpaquier/Desktop/tbsp'; > set default_tablespace = popo; > create unlogged table foo (a int); > insert into foo values (generate_series(1,1)); > create index foo_bt on foo(a); > create index foo_bloom on foo using bloom(a); > create index foo_gin on foo using gin (a); > create index foo_gist on foo using gist (a); > create index foo_brin on foo using brin (a); > create unlogged table foo_geo (a box); > insert into foo_geo values ('(1,2),(2,3)'); > create index foo_spgist ON foo_geo using spgist(a); > > -- crash PG > -- Insert some data > insert into foo values (100); > insert into foo_geo values ('(50,12),(312,3)'); > > This should create 8 INIT forks, 6 for the indexes and 2 for the > tables. On HEAD, only 3 are getting created and with the patch all of > them are. The header comment for heap_create_init_fork() says this: /* * Set up an init fork for an unlogged table so that it can be correctly * reinitialized on restart. Since we're going to do an immediate sync, we * only need to xlog this if archiving or streaming is enabled. And the * immediate sync is required, because otherwise there's no guarantee that * this will hit the disk before the next checkpoint moves the redo pointer. */ Your patch causes the code not to match the comment any more. And the comment explains why at the time I wrote this code I thought it was OK to have the XLogIsNeeded() test in there, so it clearly needs updating. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh wrote: > On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier > wrote: >> Nah. Looking at the code the fix is quite obvious. >> heap_create_init_fork() is checking for XLogIsNeeded() to decide if >> the INIT forknum should be logged or not. But this is wrong, it needs >> to be done unconditionally to ensure that the relation gets created at >> replay. > > I think that we should also update other *buildempty() functions as well. > For example, if the table has a primary key, we'll encounter the error again > for btree index. Good point. I just had a look at all the AMs: bloom, btree and SP-gist are plainly broken. The others are fine. Attached is an updated patch. Here are the tests I have done with wal_level = minimal to test all the AMs: \! rm -rf /Users/mpaquier/Desktop/tbsp/PG* create extension bloom; create extension btree_gist; create extension btree_gin; create tablespace popo location '/Users/mpaquier/Desktop/tbsp'; set default_tablespace = popo; create unlogged table foo (a int); insert into foo values (generate_series(1,1)); create index foo_bt on foo(a); create index foo_bloom on foo using bloom(a); create index foo_gin on foo using gin (a); create index foo_gist on foo using gist (a); create index foo_brin on foo using brin (a); create unlogged table foo_geo (a box); insert into foo_geo values ('(1,2),(2,3)'); create index foo_spgist ON foo_geo using spgist(a); -- crash PG -- Insert some data insert into foo values (100); insert into foo_geo values ('(50,12),(312,3)'); This should create 8 INIT forks, 6 for the indexes and 2 for the tables. On HEAD, only 3 are getting created and with the patch all of them are. -- Michael diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 0946aa2..c9c7049 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -164,13 +164,12 @@ blbuildempty(Relation index) metapage = (Page) palloc(BLCKSZ); BloomFillMetapage(index, metapage); - /* Write the page. If archiving/streaming, XLOG it. */ + /* Write the page and log it. */ PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BLOOM_METAPAGE_BLKNO, metapage, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +BLOOM_METAPAGE_BLKNO, metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 128744c..624aa84 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -242,13 +242,12 @@ btbuildempty(Relation index) metapage = (Page) palloc(BLCKSZ); _bt_initmetapage(metapage, P_NONE, 0); - /* Write the page. If archiving/streaming, XLOG it. */ + /* Write the page and log it */ PageSetChecksumInplace(metapage, BTREE_METAPAGE); smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, (char *) metapage, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BTREE_METAPAGE, metapage, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +BTREE_METAPAGE, metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index 01c8d21..8ac3b00 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -161,13 +161,12 @@ spgbuildempty(Relation index) page = (Page) palloc(BLCKSZ); SpGistInitMetapage(page); - /* Write the page. If archiving/streaming, XLOG it. */ + /* Write the page and log it. */ PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, (char *) page, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - SPGIST_METAPAGE_BLKNO, page, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +SPGIST_METAPAGE_BLKNO, page, false); /* Likewise for the root page. */ SpGistInitPage(page, SPGIST_LEAF); @@ -175,9 +174,8 @@ spgbuildempty(Relation index) PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO, (char *) page, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - SPGIST_ROOT_BLKNO, page, true); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, +SPGIST_ROOT_BLKNO, page, true); /* Likewise for the null-tuples root page. */ SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS); @@ -185,9 +183,8 @@ spgbuildempty(Relation index) PageSetChecksumInplace(page, SPGIST_NULL_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO, (char *
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier wrote: > On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier > wrote: >> Okay, so what happens is that the CREATE TABLESPACE record removes the >> tablespace directory and recreates a fresh one, but as no CREATE >> records are created for unlogged tables the init fork is not >> re-created. It seems to me that we should log a record to recreate the >> INIT fork, and that heap_create_with_catalog() is missing something. >> Generating a record in RelationCreateStorage() is the more direct way, >> and that actually fixes the issue. Now the comments at the top of it >> mention that RelationCreateStorage() should only be used to create the >> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at >> the end of heap_create()? > > Nah. Looking at the code the fix is quite obvious. > heap_create_init_fork() is checking for XLogIsNeeded() to decide if > the INIT forknum should be logged or not. But this is wrong, it needs > to be done unconditionally to ensure that the relation gets created at > replay. I think that we should also update other *buildempty() functions as well. For example, if the table has a primary key, we'll encounter the error again for btree index. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier wrote: > Okay, so what happens is that the CREATE TABLESPACE record removes the > tablespace directory and recreates a fresh one, but as no CREATE > records are created for unlogged tables the init fork is not > re-created. It seems to me that we should log a record to recreate the > INIT fork, and that heap_create_with_catalog() is missing something. > Generating a record in RelationCreateStorage() is the more direct way, > and that actually fixes the issue. Now the comments at the top of it > mention that RelationCreateStorage() should only be used to create the > MAIN forknum. So, wouldn't a correct fix be to log this INIT record at > the end of heap_create()? Nah. Looking at the code the fix is quite obvious. heap_create_init_fork() is checking for XLogIsNeeded() to decide if the INIT forknum should be logged or not. But this is wrong, it needs to be done unconditionally to ensure that the relation gets created at replay. -- Michael diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 0cf7b9e..2497a1e 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1380,8 +1380,7 @@ heap_create_init_fork(Relation rel) { RelationOpenSmgr(rel); smgrcreate(rel->rd_smgr, INIT_FORKNUM, false); - if (XLogIsNeeded()) - log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM); + log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM); smgrimmedsync(rel->rd_smgr, INIT_FORKNUM); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 4:33 PM, Michael Paquier wrote: > On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik > wrote: >> No, it is latest sources from Postgres repository. >> Please notice that you should create new database and tablespace to >> reproduce this issue. >> So actually the whole sequence is >> >> mkdir fs >> initdb -D pgsql >> pg_ctl -D pgsql -l logfile start >> psql postgres >> # create tablespace fs location '/home/knizhnik/dtm-data/fs'; >> # set default_tablespace=fs; >> # create unlogged table foo(x integer); >> # insert into foo values(generate_series(1,10)); >> # ^D >> pkill -9 postgres >> pg_ctl -D pgsql -l logfile start >> # select * from foo; > > OK, I understood what I was missing. This can be reproduced with > wal_level = minimal. When using hot_standby things are working > properly. Okay, so what happens is that the CREATE TABLESPACE record removes the tablespace directory and recreates a fresh one, but as no CREATE records are created for unlogged tables the init fork is not re-created. It seems to me that we should log a record to recreate the INIT fork, and that heap_create_with_catalog() is missing something. Generating a record in RelationCreateStorage() is the more direct way, and that actually fixes the issue. Now the comments at the top of it mention that RelationCreateStorage() should only be used to create the MAIN forknum. So, wouldn't a correct fix be to log this INIT record at the end of heap_create()? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik wrote: > No, it is latest sources from Postgres repository. > Please notice that you should create new database and tablespace to reproduce > this issue. > So actually the whole sequence is > > mkdir fs > initdb -D pgsql > pg_ctl -D pgsql -l logfile start > psql postgres > # create tablespace fs location '/home/knizhnik/dtm-data/fs'; > # set default_tablespace=fs; > # create unlogged table foo(x integer); > # insert into foo values(generate_series(1,10)); > # ^D > pkill -9 postgres > pg_ctl -D pgsql -l logfile start > # select * from foo; OK, I understood what I was missing. This can be reproduced with wal_level = minimal. When using hot_standby things are working properly. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote: > > Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you > have locally a standby pointing as well to this tablespace? No, it is latest sources from Postgres repository. Please notice that you should create new database and tablespace to reproduce this issue. So actually the whole sequence is mkdir fs initdb -D pgsql pg_ctl -D pgsql -l logfile start psql postgres # create tablespace fs location '/home/knizhnik/dtm-data/fs'; # set default_tablespace=fs; # create unlogged table foo(x integer); # insert into foo values(generate_series(1,10)); # ^D pkill -9 postgres pg_ctl -D pgsql -l logfile start # select * from foo; > -- > Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas wrote: > On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik > wrote: >> Now simulate server crash using using "pkill -9 postgres". >> >> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l >> logfile start >> pg_ctl: another server might be running; trying to start server anyway >> server starting >> knizhnik@knizhnik:~/dtm-data$ psql postgres >> psql (10devel) >> Type "help" for help. >> >> postgres=# select * from foo; >> ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": >> No such file or directory >> >> knizhnik@knizhnik:~/dtm-data$ ls fs >> PG_10_201611041 >> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ >> >> So all relation directory is removed! >> It happens only for first table created in tablespace. >> If you create table in Postgres data directory everything is ok: first >> segment of relation is truncated but not deleted. > > Whoa. There should be an _init fork that doesn't get removed, and > without removing the _init fork you shouldn't be able to remove the > directory that contains it. Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you have locally a standby pointing as well to this tablespace? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik wrote: > Now simulate server crash using using "pkill -9 postgres". > > knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l > logfile start > pg_ctl: another server might be running; trying to start server anyway > server starting > knizhnik@knizhnik:~/dtm-data$ psql postgres > psql (10devel) > Type "help" for help. > > postgres=# select * from foo; > ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": > No such file or directory > > knizhnik@knizhnik:~/dtm-data$ ls fs > PG_10_201611041 > knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ > > So all relation directory is removed! > It happens only for first table created in tablespace. > If you create table in Postgres data directory everything is ok: first > segment of relation is truncated but not deleted. Whoa. There should be an _init fork that doesn't get removed, and without removing the _init fork you shouldn't be able to remove the directory that contains it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unlogged tables cleanup
Hi, hackers I wonder if such behavior can be considered as a bug: knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs'; CREATE TABLESPACE postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo(x integer); CREATE TABLE postgres=# insert into foo values(generate_series(1,10)); INSERT 0 10 Now simulate server crash using using "pkill -9 postgres". knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo; ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory knizhnik@knizhnik:~/dtm-data$ ls fs PG_10_201611041 knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ So all relation directory is removed! It happens only for first table created in tablespace. If you create table in Postgres data directory everything is ok: first segment of relation is truncated but not deleted. Also if you create one more unlogged table in tablespace it is truncated correctly: postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo1(x integer); CREATE TABLE postgres=# insert into foo1 values(generate_series(1,10)); INSERT 0 10 postgres=# \q knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo1; x --- (0 rows) knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/* -rw--- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768 -rw--- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768_init -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers