Re: [HACKERS] Unlogged tables cleanup

2016-12-08 Thread Michael Paquier
On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas  wrote:
> On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
>  wrote:
>> OK, I rewrote a bit the patch as attached. What do you think?
>
> Committed and back-patched all the way back to 9.2.

Thanks!

>>> Right (I think).  If we set and clear delayChkpt around this work, we
>>> don't need the immediate sync.
>>
>> My point is a bit different than what you mean I think: the
>> transaction creating an unlogged relfilenode would not need to even
>> set delayChkpt in the empty() routines because other transactions
>> would not refer to it until this transaction has committed. So I am
>> arguing about just removing the sync phase.
>
> That doesn't sound right; see the comment for heap_create_init_fork.
> Suppose the transaction creating the unlogged table commits, a
> checkpoint happens, and then the operating system crashes.  Without
> the immediate sync, the operating system crash can cause the un-sync'd
> file to crash, and because of the checkpoint the WAL record that
> creates it isn't replayed either.  So the file's just gone.

Doh. That would have made sense if the checkpoint was actually
flushing the page if it was in shared buffers. But that's not the
case.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-12-08 Thread Robert Haas
On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
 wrote:
> OK, I rewrote a bit the patch as attached. What do you think?

Committed and back-patched all the way back to 9.2.

>> Right (I think).  If we set and clear delayChkpt around this work, we
>> don't need the immediate sync.
>
> My point is a bit different than what you mean I think: the
> transaction creating an unlogged relfilenode would not need to even
> set delayChkpt in the empty() routines because other transactions
> would not refer to it until this transaction has committed. So I am
> arguing about just removing the sync phase.

That doesn't sound right; see the comment for heap_create_init_fork.
Suppose the transaction creating the unlogged table commits, a
checkpoint happens, and then the operating system crashes.  Without
the immediate sync, the operating system crash can cause the un-sync'd
file to crash, and because of the checkpoint the WAL record that
creates it isn't replayed either.  So the file's just gone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-12-07 Thread Michael Paquier
On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas  wrote:
> Michael, your greetings were passed on to me with a request that I
> look at this thread.

Thanks for showing up!

> On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
>  wrote:
 More seriously, if there could be more details regarding that, I would
 think that we could say something like "logging the init fork is
 mandatory in any case to ensure its on-disk presence when at recovery
 replay, even on non-default tablespaces whose base location are
 deleted and re-created from scratch if the WAL record in charge of
 creating this tablespace gets replayed". The problem shows up because
 of tablespaces being deleted at replay at the end... So perhaps this
 makes sense. What do you think?
>>>
>>> Yes, that's about what I think we need to explain.
>>
>> OK, what do you think about the attached then?
>>
>> I came up with something like that for those code paths:
>> -   /* Write the page.  If archiving/streaming, XLOG it. */
>> +   /*
>> +* Write the page and log it unconditionally.  This is important
>> +* particularly for indexes created on tablespaces and databases
>> +* whose creation happened after the last redo pointer as recovery
>> +* removes any of their existing content when the corresponding
>> +* create records are replayed.
>> +*/
>> I have not been able to use the word "create" less than that. The
>> patch is really repetitive, but I think that we had better mention the
>> need of logging the content in each code path and not touch
>> log_newpage() to keep it a maximum flexible.
>
> In blinsert.c, nbtree.c, etc. how about:
>
> Write the page and log it.  It might seem that an immediate sync would
> be sufficient to guarantee that the file exists on disk, but recovery
> itself might remove it while replaying, for example, an
> XLOG_DBASE_CREATE record.  Therefore, we need this even when
> wal_level=minimal.

OK, I rewrote a bit the patch as attached. What do you think?

>>> Actually, I'm
>>> wondering if we ought to just switch this over to using the delayChkpt
>>> mechanism instead of going through this complicated song-and-dance.
>>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>>> silly, and I'm actually a bit worried that even with your fix there
>>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>>> cleanly in the 9.2 branch (cf.
>>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>>> back-patch it in the form you have it in already, but it's probably
>>> worth switching over at least in master.
>>
>> Thinking through it... Could we not just rip off the sync phase
>> because there is delayChkpt? It seems to me that what counts is that
>> the commit of the transaction that creates the relation does not get
>> past the redo point. It would matter for read uncommitted transactions
>> but that concept does not exist in PG, and never will. On
>> back-branches it is definitely safer to keep the sync, I am just
>> thinking about a HEAD-only optimization here as you do.
>
> Right (I think).  If we set and clear delayChkpt around this work, we
> don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa29ec..a31149f044 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,18 @@ blbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
BloomFillMetapage(index, metapage);
 
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+* Write the page and log it.  It might seem that an immediate sync
+* would be sufficient to guarantee that the file exists on disk, but
+* recovery itself might remove it while replaying, for example, an
+* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
+* need this even when wal_level=minimal.
+*/
PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
  (char *) metapage, true);
-   if (XLogIsNeeded())
-   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-   BLOOM_METAPAGE_BLKNO, metapage, false);
+   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+   BLOOM_METAPAGE_BLKNO, metapage, false);
 
/*
 * An immediate sync is required even if we xlog'd the page, because the
diff --git 

Re: [HACKERS] Unlogged tables cleanup

2016-12-07 Thread Robert Haas
Michael, your greetings were passed on to me with a request that I
look at this thread.

On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
 wrote:
>>> More seriously, if there could be more details regarding that, I would
>>> think that we could say something like "logging the init fork is
>>> mandatory in any case to ensure its on-disk presence when at recovery
>>> replay, even on non-default tablespaces whose base location are
>>> deleted and re-created from scratch if the WAL record in charge of
>>> creating this tablespace gets replayed". The problem shows up because
>>> of tablespaces being deleted at replay at the end... So perhaps this
>>> makes sense. What do you think?
>>
>> Yes, that's about what I think we need to explain.
>
> OK, what do you think about the attached then?
>
> I came up with something like that for those code paths:
> -   /* Write the page.  If archiving/streaming, XLOG it. */
> +   /*
> +* Write the page and log it unconditionally.  This is important
> +* particularly for indexes created on tablespaces and databases
> +* whose creation happened after the last redo pointer as recovery
> +* removes any of their existing content when the corresponding
> +* create records are replayed.
> +*/
> I have not been able to use the word "create" less than that. The
> patch is really repetitive, but I think that we had better mention the
> need of logging the content in each code path and not touch
> log_newpage() to keep it a maximum flexible.

In blinsert.c, nbtree.c, etc. how about:

Write the page and log it.  It might seem that an immediate sync would
be sufficient to guarantee that the file exists on disk, but recovery
itself might remove it while replaying, for example, an
XLOG_DBASE_CREATE record.  Therefore, we need this even when
wal_level=minimal.

>> Actually, I'm
>> wondering if we ought to just switch this over to using the delayChkpt
>> mechanism instead of going through this complicated song-and-dance.
>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>> silly, and I'm actually a bit worried that even with your fix there
>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>> cleanly in the 9.2 branch (cf.
>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>> back-patch it in the form you have it in already, but it's probably
>> worth switching over at least in master.
>
> Thinking through it... Could we not just rip off the sync phase
> because there is delayChkpt? It seems to me that what counts is that
> the commit of the transaction that creates the relation does not get
> past the redo point. It would matter for read uncommitted transactions
> but that concept does not exist in PG, and never will. On
> back-branches it is definitely safer to keep the sync, I am just
> thinking about a HEAD-only optimization here as you do.

Right (I think).  If we set and clear delayChkpt around this work, we
don't need the immediate sync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-18 Thread Michael Paquier
On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas  wrote:
> Yeah, but surely it's obvious that if you don't WAL log it, it's not
> going to work for archiving or streaming.  It's a lot less obvious why
> you have to WAL log it when you're not doing either of those things if
> you've already written it and fsync'd it locally.  How is it going to
> disappear if it's been fsync'd, one wonders?

That's not obvious that replaying a WAL record for a database creation
or tablespace creation would cause that for sure! I didn't know that
redo was wiping them out with rmtree() at replay before looking at
this bug. One more thing to recall when fixing an issue in this area
in the future.

>> More seriously, if there could be more details regarding that, I would
>> think that we could say something like "logging the init fork is
>> mandatory in any case to ensure its on-disk presence when at recovery
>> replay, even on non-default tablespaces whose base location are
>> deleted and re-created from scratch if the WAL record in charge of
>> creating this tablespace gets replayed". The problem shows up because
>> of tablespaces being deleted at replay at the end... So perhaps this
>> makes sense. What do you think?
>
> Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+* Write the page and log it unconditionally.  This is important
+* particularly for indexes created on tablespaces and databases
+* whose creation happened after the last redo pointer as recovery
+* removes any of their existing content when the corresponding
+* create records are replayed.
+*/
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

> Actually, I'm
> wondering if we ought to just switch this over to using the delayChkpt
> mechanism instead of going through this complicated song-and-dance.
> Incurring an immediate sync to avoid having to WAL-log this is a bit
> tenuous, but having to WAL-log it AND do the immediate sync just seems
> silly, and I'm actually a bit worried that even with your fix there
> might still be a latent bug somewhere.  We couldn't switch mechanisms
> cleanly in the 9.2 branch (cf.
> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
> back-patch it in the form you have it in already, but it's probably
> worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.
-- 
Michael


unlogged-tbspace-fix-v4.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-17 Thread Robert Haas
On Wed, Nov 16, 2016 at 11:55 PM, Michael Paquier
 wrote:
> On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas  wrote:
>> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
>>  wrote:
>>> Indeed I missed this comment block. Please let me suggest the following 
>>> instead:
>>>  /*
>>>   * Set up an init fork for an unlogged table so that it can be correctly
>>> - * reinitialized on restart.  Since we're going to do an immediate sync, we
>>> - * only need to xlog this if archiving or streaming is enabled.  And the
>>> - * immediate sync is required, because otherwise there's no guarantee that
>>> - * this will hit the disk before the next checkpoint moves the redo 
>>> pointer.
>>> + * reinitialized on restart.  An immediate sync is required even if the
>>> + * page has been logged, because the write did not go through
>>> + * shared_buffers and therefore a concurrent checkpoint may have moved
>>> + * the redo pointer past our xlog record.
>>>   */
>>
>> Hmm.  Well, that deletes the comment that's no longer true, but it
>> doesn't replace it with any explanation of why we also need to WAL-log
>> it unconditionally, and I think that explanation is not entirely
>> trivial?
>
> OK, the original code does not give any special reason either
> regarding why doing so is safe for archiving or streaming :)

Yeah, but surely it's obvious that if you don't WAL log it, it's not
going to work for archiving or streaming.  It's a lot less obvious why
you have to WAL log it when you're not doing either of those things if
you've already written it and fsync'd it locally.  How is it going to
disappear if it's been fsync'd, one wonders?

> More seriously, if there could be more details regarding that, I would
> think that we could say something like "logging the init fork is
> mandatory in any case to ensure its on-disk presence when at recovery
> replay, even on non-default tablespaces whose base location are
> deleted and re-created from scratch if the WAL record in charge of
> creating this tablespace gets replayed". The problem shows up because
> of tablespaces being deleted at replay at the end... So perhaps this
> makes sense. What do you think?

Yes, that's about what I think we need to explain.  Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere.  We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Michael Paquier
On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas  wrote:
> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
>  wrote:
>> Indeed I missed this comment block. Please let me suggest the following 
>> instead:
>>  /*
>>   * Set up an init fork for an unlogged table so that it can be correctly
>> - * reinitialized on restart.  Since we're going to do an immediate sync, we
>> - * only need to xlog this if archiving or streaming is enabled.  And the
>> - * immediate sync is required, because otherwise there's no guarantee that
>> - * this will hit the disk before the next checkpoint moves the redo pointer.
>> + * reinitialized on restart.  An immediate sync is required even if the
>> + * page has been logged, because the write did not go through
>> + * shared_buffers and therefore a concurrent checkpoint may have moved
>> + * the redo pointer past our xlog record.
>>   */
>
> Hmm.  Well, that deletes the comment that's no longer true, but it
> doesn't replace it with any explanation of why we also need to WAL-log
> it unconditionally, and I think that explanation is not entirely
> trivial?

OK, the original code does not give any special reason either
regarding why doing so is safe for archiving or streaming :)

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Robert Haas
On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
 wrote:
> On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas  wrote:
>> The header comment for heap_create_init_fork() says this:
>>
>> /*
>>  * Set up an init fork for an unlogged table so that it can be correctly
>>  * reinitialized on restart.  Since we're going to do an immediate sync, we
>>  * only need to xlog this if archiving or streaming is enabled.  And the
>>  * immediate sync is required, because otherwise there's no guarantee that
>>  * this will hit the disk before the next checkpoint moves the redo pointer.
>>  */
>>
>> Your patch causes the code not to match the comment any more.  And the
>> comment explains why at the time I wrote this code I thought it was OK
>> to have the XLogIsNeeded() test in there, so it clearly needs
>> updating.
>
> Indeed I missed this comment block. Please let me suggest the following 
> instead:
>  /*
>   * Set up an init fork for an unlogged table so that it can be correctly
> - * reinitialized on restart.  Since we're going to do an immediate sync, we
> - * only need to xlog this if archiving or streaming is enabled.  And the
> - * immediate sync is required, because otherwise there's no guarantee that
> - * this will hit the disk before the next checkpoint moves the redo pointer.
> + * reinitialized on restart.  An immediate sync is required even if the
> + * page has been logged, because the write did not go through
> + * shared_buffers and therefore a concurrent checkpoint may have moved
> + * the redo pointer past our xlog record.
>   */

Hmm.  Well, that deletes the comment that's no longer true, but it
doesn't replace it with any explanation of why we also need to WAL-log
it unconditionally, and I think that explanation is not entirely
trivial?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Michael Paquier
On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas  wrote:
> The header comment for heap_create_init_fork() says this:
>
> /*
>  * Set up an init fork for an unlogged table so that it can be correctly
>  * reinitialized on restart.  Since we're going to do an immediate sync, we
>  * only need to xlog this if archiving or streaming is enabled.  And the
>  * immediate sync is required, because otherwise there's no guarantee that
>  * this will hit the disk before the next checkpoint moves the redo pointer.
>  */
>
> Your patch causes the code not to match the comment any more.  And the
> comment explains why at the time I wrote this code I thought it was OK
> to have the XLogIsNeeded() test in there, so it clearly needs
> updating.

Indeed I missed this comment block. Please let me suggest the following instead:
 /*
  * Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
  */
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa2..c9c7049 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,12 @@ blbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	BloomFillMetapage(index, metapage);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BLOOM_METAPAGE_BLKNO, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BLOOM_METAPAGE_BLKNO, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..624aa84 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -242,13 +242,12 @@ btbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, P_NONE, 0);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BTREE_METAPAGE, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BTREE_METAPAGE, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 01c8d21..8ac3b00 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -161,13 +161,12 @@ spgbuildempty(Relation index)
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_METAPAGE_BLKNO, page, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_METAPAGE_BLKNO, page, false);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
@@ -175,9 +174,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_ROOT_BLKNO, page, true);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
@@ -185,9 +183,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_NULL_BLKNO, page, true);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_NULL_BLKNO, page, true);
 
 	/*
 	 * An immediate sync is required even if we xlog'd 

Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Robert Haas
On Thu, Nov 10, 2016 at 9:25 PM, Michael Paquier
 wrote:
> On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
>  wrote:
>> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
>>  wrote:
>>> Nah. Looking at the code the fix is quite obvious.
>>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
>>> the INIT forknum should be logged or not. But this is wrong, it needs
>>> to be done unconditionally to ensure that the relation gets created at
>>> replay.
>>
>> I think that we should also update other *buildempty() functions as well.
>> For example, if the table has a primary key, we'll encounter the error again
>> for btree index.
>
> Good point. I just had a look at all the AMs: bloom, btree and SP-gist
> are plainly broken. The others are fine. Attached is an updated patch.
>
> Here are the tests I have done with wal_level = minimal to test all the AMs:
> \! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
> create extension bloom;
> create extension btree_gist;
> create extension btree_gin;
> create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
> set default_tablespace = popo;
> create unlogged table foo (a int);
> insert into foo values (generate_series(1,1));
> create index foo_bt on foo(a);
> create index foo_bloom on foo using bloom(a);
> create index foo_gin on foo using gin (a);
> create index foo_gist on foo using gist (a);
> create index foo_brin on foo using brin (a);
> create unlogged table foo_geo (a box);
> insert into foo_geo values ('(1,2),(2,3)');
> create index foo_spgist ON foo_geo using spgist(a);
>
> -- crash PG
> -- Insert some data
> insert into foo values (100);
> insert into foo_geo values ('(50,12),(312,3)');
>
> This should create 8 INIT forks, 6 for the indexes and 2 for the
> tables. On HEAD, only 3 are getting created and with the patch all of
> them are.

The header comment for heap_create_init_fork() says this:

/*
 * Set up an init fork for an unlogged table so that it can be correctly
 * reinitialized on restart.  Since we're going to do an immediate sync, we
 * only need to xlog this if archiving or streaming is enabled.  And the
 * immediate sync is required, because otherwise there's no guarantee that
 * this will hit the disk before the next checkpoint moves the redo pointer.
 */

Your patch causes the code not to match the comment any more.  And the
comment explains why at the time I wrote this code I thought it was OK
to have the XLogIsNeeded() test in there, so it clearly needs
updating.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Michael Paquier
On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
 wrote:
> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
>  wrote:
>> Nah. Looking at the code the fix is quite obvious.
>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
>> the INIT forknum should be logged or not. But this is wrong, it needs
>> to be done unconditionally to ensure that the relation gets created at
>> replay.
>
> I think that we should also update other *buildempty() functions as well.
> For example, if the table has a primary key, we'll encounter the error again
> for btree index.

Good point. I just had a look at all the AMs: bloom, btree and SP-gist
are plainly broken. The others are fine. Attached is an updated patch.

Here are the tests I have done with wal_level = minimal to test all the AMs:
\! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
set default_tablespace = popo;
create unlogged table foo (a int);
insert into foo values (generate_series(1,1));
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
insert into foo_geo values ('(1,2),(2,3)');
create index foo_spgist ON foo_geo using spgist(a);

-- crash PG
-- Insert some data
insert into foo values (100);
insert into foo_geo values ('(50,12),(312,3)');

This should create 8 INIT forks, 6 for the indexes and 2 for the
tables. On HEAD, only 3 are getting created and with the patch all of
them are.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa2..c9c7049 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,12 @@ blbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	BloomFillMetapage(index, metapage);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BLOOM_METAPAGE_BLKNO, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BLOOM_METAPAGE_BLKNO, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..624aa84 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -242,13 +242,12 @@ btbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, P_NONE, 0);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BTREE_METAPAGE, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BTREE_METAPAGE, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 01c8d21..8ac3b00 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -161,13 +161,12 @@ spgbuildempty(Relation index)
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_METAPAGE_BLKNO, page, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_METAPAGE_BLKNO, page, false);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
@@ -175,9 +174,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_ROOT_BLKNO, page, true);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
@@ -185,9 +183,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
 			  (char 

Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Kuntal Ghosh
On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
 wrote:
> On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
>  wrote:
>> Okay, so what happens is that the CREATE TABLESPACE record removes the
>> tablespace directory and recreates a fresh one, but as no CREATE
>> records are created for unlogged tables the init fork is not
>> re-created. It seems to me that we should log a record to recreate the
>> INIT fork, and that heap_create_with_catalog() is missing something.
>> Generating a record in RelationCreateStorage() is the more direct way,
>> and that actually fixes the issue. Now the comments at the top of it
>> mention that RelationCreateStorage() should only be used to create the
>> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
>> the end of heap_create()?
>
> Nah. Looking at the code the fix is quite obvious.
> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
> the INIT forknum should be logged or not. But this is wrong, it needs
> to be done unconditionally to ensure that the relation gets created at
> replay.
I think that we should also update other *buildempty() functions as well.
For example, if the table has a primary key, we'll encounter the error again
for btree index.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Michael Paquier
On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
 wrote:
> Okay, so what happens is that the CREATE TABLESPACE record removes the
> tablespace directory and recreates a fresh one, but as no CREATE
> records are created for unlogged tables the init fork is not
> re-created. It seems to me that we should log a record to recreate the
> INIT fork, and that heap_create_with_catalog() is missing something.
> Generating a record in RelationCreateStorage() is the more direct way,
> and that actually fixes the issue. Now the comments at the top of it
> mention that RelationCreateStorage() should only be used to create the
> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
> the end of heap_create()?

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.
-- 
Michael
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0cf7b9e..2497a1e 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1380,8 +1380,7 @@ heap_create_init_fork(Relation rel)
 {
RelationOpenSmgr(rel);
smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
-   if (XLogIsNeeded())
-   log_smgrcreate(>rd_smgr->smgr_rnode.node, INIT_FORKNUM);
+   log_smgrcreate(>rd_smgr->smgr_rnode.node, INIT_FORKNUM);
smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Michael Paquier
On Thu, Nov 10, 2016 at 4:33 PM, Michael Paquier
 wrote:
> On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
>  wrote:
>> No, it is latest sources from Postgres repository.
>> Please notice that you should create new database and tablespace to 
>> reproduce this issue.
>> So actually the whole sequence is
>>
>> mkdir fs
>> initdb -D pgsql
>> pg_ctl -D pgsql -l logfile start
>> psql postgres
>> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
>> # set default_tablespace=fs;
>> # create unlogged table foo(x integer);
>> # insert into foo values(generate_series(1,10));
>> # ^D
>> pkill -9 postgres
>> pg_ctl -D pgsql -l logfile start
>> # select * from foo;
>
> OK, I understood what I was missing. This can be reproduced with
> wal_level = minimal. When using hot_standby things are working
> properly.

Okay, so what happens is that the CREATE TABLESPACE record removes the
tablespace directory and recreates a fresh one, but as no CREATE
records are created for unlogged tables the init fork is not
re-created. It seems to me that we should log a record to recreate the
INIT fork, and that heap_create_with_catalog() is missing something.
Generating a record in RelationCreateStorage() is the more direct way,
and that actually fixes the issue. Now the comments at the top of it
mention that RelationCreateStorage() should only be used to create the
MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
the end of heap_create()?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
 wrote:
> No, it is latest sources from Postgres repository.
> Please notice that you should create new database and tablespace to reproduce 
> this issue.
> So actually the whole sequence is
>
> mkdir fs
> initdb -D pgsql
> pg_ctl -D pgsql -l logfile start
> psql postgres
> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
> # set default_tablespace=fs;
> # create unlogged table foo(x integer);
> # insert into foo values(generate_series(1,10));
> # ^D
> pkill -9 postgres
> pg_ctl -D pgsql -l logfile start
> # select * from foo;

OK, I understood what I was missing. This can be reproduced with
wal_level = minimal. When using hot_standby things are working
properly.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread konstantin knizhnik

On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

> 
> Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
> have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce 
this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql 
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,10));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;


> -- 
> Michael



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
>  wrote:
>> Now simulate server crash using using "pkill -9 postgres".
>>
>> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
>> logfile start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>> knizhnik@knizhnik:~/dtm-data$ psql postgres
>> psql (10devel)
>> Type "help" for help.
>>
>> postgres=# select * from foo;
>> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
>> No such file or directory
>>
>> knizhnik@knizhnik:~/dtm-data$ ls fs
>> PG_10_201611041
>> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>>
>> So all relation directory is removed!
>> It happens only for first table created in tablespace.
>> If you create table in Postgres data directory everything is ok: first
>> segment of relation is truncated but not deleted.
>
> Whoa.  There should be an _init fork that doesn't get removed, and
> without removing the _init fork you shouldn't be able to remove the
> directory that contains it.

Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
have locally a standby pointing as well to this tablespace?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
 wrote:
> Now simulate server crash using using "pkill -9 postgres".
>
> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
> logfile start
> pg_ctl: another server might be running; trying to start server anyway
> server starting
> knizhnik@knizhnik:~/dtm-data$ psql postgres
> psql (10devel)
> Type "help" for help.
>
> postgres=# select * from foo;
> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
> No such file or directory
>
> knizhnik@knizhnik:~/dtm-data$ ls fs
> PG_10_201611041
> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>
> So all relation directory is removed!
> It happens only for first table created in tablespace.
> If you create table in Postgres data directory everything is ok: first
> segment of relation is truncated but not deleted.

Whoa.  There should be an _init fork that doesn't get removed, and
without removing the _init fork you shouldn't be able to remove the
directory that contains it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables cleanup

2016-11-09 Thread Konstantin Knizhnik

Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,10));
INSERT 0 10



Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR:  could not open file 
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory


knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/


So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first 
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated 
correctly:


postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,10));
INSERT 0 10
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1;
 x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 fs/PG_10_201611041/12289/32768
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 
fs/PG_10_201611041/12289/32768_init



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers