Re: [HACKERS] Unlogged tables cleanup

2016-12-08 Thread Michael Paquier
On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas  wrote:
> On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
>  wrote:
>> OK, I rewrote a bit the patch as attached. What do you think?
>
> Committed and back-patched all the way back to 9.2.

Thanks!

>>> Right (I think).  If we set and clear delayChkpt around this work, we
>>> don't need the immediate sync.
>>
>> My point is a bit different than what you mean I think: the
>> transaction creating an unlogged relfilenode would not need to even
>> set delayChkpt in the empty() routines because other transactions
>> would not refer to it until this transaction has committed. So I am
>> arguing about just removing the sync phase.
>
> That doesn't sound right; see the comment for heap_create_init_fork.
> Suppose the transaction creating the unlogged table commits, a
> checkpoint happens, and then the operating system crashes.  Without
> the immediate sync, the operating system crash can cause the un-sync'd
> file to crash, and because of the checkpoint the WAL record that
> creates it isn't replayed either.  So the file's just gone.

Doh. That would have made sense if the checkpoint was actually
flushing the page if it was in shared buffers. But that's not the
case.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-12-08 Thread Robert Haas
On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
 wrote:
> OK, I rewrote a bit the patch as attached. What do you think?

Committed and back-patched all the way back to 9.2.

>> Right (I think).  If we set and clear delayChkpt around this work, we
>> don't need the immediate sync.
>
> My point is a bit different than what you mean I think: the
> transaction creating an unlogged relfilenode would not need to even
> set delayChkpt in the empty() routines because other transactions
> would not refer to it until this transaction has committed. So I am
> arguing about just removing the sync phase.

That doesn't sound right; see the comment for heap_create_init_fork.
Suppose the transaction creating the unlogged table commits, a
checkpoint happens, and then the operating system crashes.  Without
the immediate sync, the operating system crash can cause the un-sync'd
file to crash, and because of the checkpoint the WAL record that
creates it isn't replayed either.  So the file's just gone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-12-07 Thread Michael Paquier
On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas  wrote:
> Michael, your greetings were passed on to me with a request that I
> look at this thread.

Thanks for showing up!

> On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
>  wrote:
 More seriously, if there could be more details regarding that, I would
 think that we could say something like "logging the init fork is
 mandatory in any case to ensure its on-disk presence when at recovery
 replay, even on non-default tablespaces whose base location are
 deleted and re-created from scratch if the WAL record in charge of
 creating this tablespace gets replayed". The problem shows up because
 of tablespaces being deleted at replay at the end... So perhaps this
 makes sense. What do you think?
>>>
>>> Yes, that's about what I think we need to explain.
>>
>> OK, what do you think about the attached then?
>>
>> I came up with something like that for those code paths:
>> -   /* Write the page.  If archiving/streaming, XLOG it. */
>> +   /*
>> +* Write the page and log it unconditionally.  This is important
>> +* particularly for indexes created on tablespaces and databases
>> +* whose creation happened after the last redo pointer as recovery
>> +* removes any of their existing content when the corresponding
>> +* create records are replayed.
>> +*/
>> I have not been able to use the word "create" less than that. The
>> patch is really repetitive, but I think that we had better mention the
>> need of logging the content in each code path and not touch
>> log_newpage() to keep it a maximum flexible.
>
> In blinsert.c, nbtree.c, etc. how about:
>
> Write the page and log it.  It might seem that an immediate sync would
> be sufficient to guarantee that the file exists on disk, but recovery
> itself might remove it while replaying, for example, an
> XLOG_DBASE_CREATE record.  Therefore, we need this even when
> wal_level=minimal.

OK, I rewrote a bit the patch as attached. What do you think?

>>> Actually, I'm
>>> wondering if we ought to just switch this over to using the delayChkpt
>>> mechanism instead of going through this complicated song-and-dance.
>>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>>> silly, and I'm actually a bit worried that even with your fix there
>>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>>> cleanly in the 9.2 branch (cf.
>>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>>> back-patch it in the form you have it in already, but it's probably
>>> worth switching over at least in master.
>>
>> Thinking through it... Could we not just rip off the sync phase
>> because there is delayChkpt? It seems to me that what counts is that
>> the commit of the transaction that creates the relation does not get
>> past the redo point. It would matter for read uncommitted transactions
>> but that concept does not exist in PG, and never will. On
>> back-branches it is definitely safer to keep the sync, I am just
>> thinking about a HEAD-only optimization here as you do.
>
> Right (I think).  If we set and clear delayChkpt around this work, we
> don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa29ec..a31149f044 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,18 @@ blbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
BloomFillMetapage(index, metapage);
 
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+* Write the page and log it.  It might seem that an immediate sync
+* would be sufficient to guarantee that the file exists on disk, but
+* recovery itself might remove it while replaying, for example, an
+* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
+* need this even when wal_level=minimal.
+*/
PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
  (char *) metapage, true);
-   if (XLogIsNeeded())
-   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-   BLOOM_METAPAGE_BLKNO, metapage, false);
+   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+   BLOOM_METAPAGE_BLKNO, metapage, false);
 
/*
 * An immediate sync is required even if we xlog'd the page, because the
diff --git 

Re: [HACKERS] Unlogged tables cleanup

2016-12-07 Thread Robert Haas
Michael, your greetings were passed on to me with a request that I
look at this thread.

On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
 wrote:
>>> More seriously, if there could be more details regarding that, I would
>>> think that we could say something like "logging the init fork is
>>> mandatory in any case to ensure its on-disk presence when at recovery
>>> replay, even on non-default tablespaces whose base location are
>>> deleted and re-created from scratch if the WAL record in charge of
>>> creating this tablespace gets replayed". The problem shows up because
>>> of tablespaces being deleted at replay at the end... So perhaps this
>>> makes sense. What do you think?
>>
>> Yes, that's about what I think we need to explain.
>
> OK, what do you think about the attached then?
>
> I came up with something like that for those code paths:
> -   /* Write the page.  If archiving/streaming, XLOG it. */
> +   /*
> +* Write the page and log it unconditionally.  This is important
> +* particularly for indexes created on tablespaces and databases
> +* whose creation happened after the last redo pointer as recovery
> +* removes any of their existing content when the corresponding
> +* create records are replayed.
> +*/
> I have not been able to use the word "create" less than that. The
> patch is really repetitive, but I think that we had better mention the
> need of logging the content in each code path and not touch
> log_newpage() to keep it a maximum flexible.

In blinsert.c, nbtree.c, etc. how about:

Write the page and log it.  It might seem that an immediate sync would
be sufficient to guarantee that the file exists on disk, but recovery
itself might remove it while replaying, for example, an
XLOG_DBASE_CREATE record.  Therefore, we need this even when
wal_level=minimal.

>> Actually, I'm
>> wondering if we ought to just switch this over to using the delayChkpt
>> mechanism instead of going through this complicated song-and-dance.
>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>> silly, and I'm actually a bit worried that even with your fix there
>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>> cleanly in the 9.2 branch (cf.
>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>> back-patch it in the form you have it in already, but it's probably
>> worth switching over at least in master.
>
> Thinking through it... Could we not just rip off the sync phase
> because there is delayChkpt? It seems to me that what counts is that
> the commit of the transaction that creates the relation does not get
> past the redo point. It would matter for read uncommitted transactions
> but that concept does not exist in PG, and never will. On
> back-branches it is definitely safer to keep the sync, I am just
> thinking about a HEAD-only optimization here as you do.

Right (I think).  If we set and clear delayChkpt around this work, we
don't need the immediate sync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-18 Thread Michael Paquier
On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas  wrote:
> Yeah, but surely it's obvious that if you don't WAL log it, it's not
> going to work for archiving or streaming.  It's a lot less obvious why
> you have to WAL log it when you're not doing either of those things if
> you've already written it and fsync'd it locally.  How is it going to
> disappear if it's been fsync'd, one wonders?

That's not obvious that replaying a WAL record for a database creation
or tablespace creation would cause that for sure! I didn't know that
redo was wiping them out with rmtree() at replay before looking at
this bug. One more thing to recall when fixing an issue in this area
in the future.

>> More seriously, if there could be more details regarding that, I would
>> think that we could say something like "logging the init fork is
>> mandatory in any case to ensure its on-disk presence when at recovery
>> replay, even on non-default tablespaces whose base location are
>> deleted and re-created from scratch if the WAL record in charge of
>> creating this tablespace gets replayed". The problem shows up because
>> of tablespaces being deleted at replay at the end... So perhaps this
>> makes sense. What do you think?
>
> Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+* Write the page and log it unconditionally.  This is important
+* particularly for indexes created on tablespaces and databases
+* whose creation happened after the last redo pointer as recovery
+* removes any of their existing content when the corresponding
+* create records are replayed.
+*/
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

> Actually, I'm
> wondering if we ought to just switch this over to using the delayChkpt
> mechanism instead of going through this complicated song-and-dance.
> Incurring an immediate sync to avoid having to WAL-log this is a bit
> tenuous, but having to WAL-log it AND do the immediate sync just seems
> silly, and I'm actually a bit worried that even with your fix there
> might still be a latent bug somewhere.  We couldn't switch mechanisms
> cleanly in the 9.2 branch (cf.
> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
> back-patch it in the form you have it in already, but it's probably
> worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.
-- 
Michael


unlogged-tbspace-fix-v4.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-17 Thread Robert Haas
On Wed, Nov 16, 2016 at 11:55 PM, Michael Paquier
 wrote:
> On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas  wrote:
>> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
>>  wrote:
>>> Indeed I missed this comment block. Please let me suggest the following 
>>> instead:
>>>  /*
>>>   * Set up an init fork for an unlogged table so that it can be correctly
>>> - * reinitialized on restart.  Since we're going to do an immediate sync, we
>>> - * only need to xlog this if archiving or streaming is enabled.  And the
>>> - * immediate sync is required, because otherwise there's no guarantee that
>>> - * this will hit the disk before the next checkpoint moves the redo 
>>> pointer.
>>> + * reinitialized on restart.  An immediate sync is required even if the
>>> + * page has been logged, because the write did not go through
>>> + * shared_buffers and therefore a concurrent checkpoint may have moved
>>> + * the redo pointer past our xlog record.
>>>   */
>>
>> Hmm.  Well, that deletes the comment that's no longer true, but it
>> doesn't replace it with any explanation of why we also need to WAL-log
>> it unconditionally, and I think that explanation is not entirely
>> trivial?
>
> OK, the original code does not give any special reason either
> regarding why doing so is safe for archiving or streaming :)

Yeah, but surely it's obvious that if you don't WAL log it, it's not
going to work for archiving or streaming.  It's a lot less obvious why
you have to WAL log it when you're not doing either of those things if
you've already written it and fsync'd it locally.  How is it going to
disappear if it's been fsync'd, one wonders?

> More seriously, if there could be more details regarding that, I would
> think that we could say something like "logging the init fork is
> mandatory in any case to ensure its on-disk presence when at recovery
> replay, even on non-default tablespaces whose base location are
> deleted and re-created from scratch if the WAL record in charge of
> creating this tablespace gets replayed". The problem shows up because
> of tablespaces being deleted at replay at the end... So perhaps this
> makes sense. What do you think?

Yes, that's about what I think we need to explain.  Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere.  We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Michael Paquier
On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas  wrote:
> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
>  wrote:
>> Indeed I missed this comment block. Please let me suggest the following 
>> instead:
>>  /*
>>   * Set up an init fork for an unlogged table so that it can be correctly
>> - * reinitialized on restart.  Since we're going to do an immediate sync, we
>> - * only need to xlog this if archiving or streaming is enabled.  And the
>> - * immediate sync is required, because otherwise there's no guarantee that
>> - * this will hit the disk before the next checkpoint moves the redo pointer.
>> + * reinitialized on restart.  An immediate sync is required even if the
>> + * page has been logged, because the write did not go through
>> + * shared_buffers and therefore a concurrent checkpoint may have moved
>> + * the redo pointer past our xlog record.
>>   */
>
> Hmm.  Well, that deletes the comment that's no longer true, but it
> doesn't replace it with any explanation of why we also need to WAL-log
> it unconditionally, and I think that explanation is not entirely
> trivial?

OK, the original code does not give any special reason either
regarding why doing so is safe for archiving or streaming :)

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Robert Haas
On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
 wrote:
> On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas  wrote:
>> The header comment for heap_create_init_fork() says this:
>>
>> /*
>>  * Set up an init fork for an unlogged table so that it can be correctly
>>  * reinitialized on restart.  Since we're going to do an immediate sync, we
>>  * only need to xlog this if archiving or streaming is enabled.  And the
>>  * immediate sync is required, because otherwise there's no guarantee that
>>  * this will hit the disk before the next checkpoint moves the redo pointer.
>>  */
>>
>> Your patch causes the code not to match the comment any more.  And the
>> comment explains why at the time I wrote this code I thought it was OK
>> to have the XLogIsNeeded() test in there, so it clearly needs
>> updating.
>
> Indeed I missed this comment block. Please let me suggest the following 
> instead:
>  /*
>   * Set up an init fork for an unlogged table so that it can be correctly
> - * reinitialized on restart.  Since we're going to do an immediate sync, we
> - * only need to xlog this if archiving or streaming is enabled.  And the
> - * immediate sync is required, because otherwise there's no guarantee that
> - * this will hit the disk before the next checkpoint moves the redo pointer.
> + * reinitialized on restart.  An immediate sync is required even if the
> + * page has been logged, because the write did not go through
> + * shared_buffers and therefore a concurrent checkpoint may have moved
> + * the redo pointer past our xlog record.
>   */

Hmm.  Well, that deletes the comment that's no longer true, but it
doesn't replace it with any explanation of why we also need to WAL-log
it unconditionally, and I think that explanation is not entirely
trivial?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Michael Paquier
On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas  wrote:
> The header comment for heap_create_init_fork() says this:
>
> /*
>  * Set up an init fork for an unlogged table so that it can be correctly
>  * reinitialized on restart.  Since we're going to do an immediate sync, we
>  * only need to xlog this if archiving or streaming is enabled.  And the
>  * immediate sync is required, because otherwise there's no guarantee that
>  * this will hit the disk before the next checkpoint moves the redo pointer.
>  */
>
> Your patch causes the code not to match the comment any more.  And the
> comment explains why at the time I wrote this code I thought it was OK
> to have the XLogIsNeeded() test in there, so it clearly needs
> updating.

Indeed I missed this comment block. Please let me suggest the following instead:
 /*
  * Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
  */
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa2..c9c7049 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,12 @@ blbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	BloomFillMetapage(index, metapage);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BLOOM_METAPAGE_BLKNO, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BLOOM_METAPAGE_BLKNO, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..624aa84 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -242,13 +242,12 @@ btbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, P_NONE, 0);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BTREE_METAPAGE, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BTREE_METAPAGE, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 01c8d21..8ac3b00 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -161,13 +161,12 @@ spgbuildempty(Relation index)
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_METAPAGE_BLKNO, page, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_METAPAGE_BLKNO, page, false);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
@@ -175,9 +174,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_ROOT_BLKNO, page, true);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
@@ -185,9 +183,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_NULL_BLKNO, page, true);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_NULL_BLKNO, page, true);
 
 	/*
 	 * An immediate sync is required even if we xlog'd 

Re: [HACKERS] Unlogged tables cleanup

2016-11-16 Thread Robert Haas
On Thu, Nov 10, 2016 at 9:25 PM, Michael Paquier
 wrote:
> On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
>  wrote:
>> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
>>  wrote:
>>> Nah. Looking at the code the fix is quite obvious.
>>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
>>> the INIT forknum should be logged or not. But this is wrong, it needs
>>> to be done unconditionally to ensure that the relation gets created at
>>> replay.
>>
>> I think that we should also update other *buildempty() functions as well.
>> For example, if the table has a primary key, we'll encounter the error again
>> for btree index.
>
> Good point. I just had a look at all the AMs: bloom, btree and SP-gist
> are plainly broken. The others are fine. Attached is an updated patch.
>
> Here are the tests I have done with wal_level = minimal to test all the AMs:
> \! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
> create extension bloom;
> create extension btree_gist;
> create extension btree_gin;
> create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
> set default_tablespace = popo;
> create unlogged table foo (a int);
> insert into foo values (generate_series(1,1));
> create index foo_bt on foo(a);
> create index foo_bloom on foo using bloom(a);
> create index foo_gin on foo using gin (a);
> create index foo_gist on foo using gist (a);
> create index foo_brin on foo using brin (a);
> create unlogged table foo_geo (a box);
> insert into foo_geo values ('(1,2),(2,3)');
> create index foo_spgist ON foo_geo using spgist(a);
>
> -- crash PG
> -- Insert some data
> insert into foo values (100);
> insert into foo_geo values ('(50,12),(312,3)');
>
> This should create 8 INIT forks, 6 for the indexes and 2 for the
> tables. On HEAD, only 3 are getting created and with the patch all of
> them are.

The header comment for heap_create_init_fork() says this:

/*
 * Set up an init fork for an unlogged table so that it can be correctly
 * reinitialized on restart.  Since we're going to do an immediate sync, we
 * only need to xlog this if archiving or streaming is enabled.  And the
 * immediate sync is required, because otherwise there's no guarantee that
 * this will hit the disk before the next checkpoint moves the redo pointer.
 */

Your patch causes the code not to match the comment any more.  And the
comment explains why at the time I wrote this code I thought it was OK
to have the XLogIsNeeded() test in there, so it clearly needs
updating.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Michael Paquier
On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
 wrote:
> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
>  wrote:
>> Nah. Looking at the code the fix is quite obvious.
>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
>> the INIT forknum should be logged or not. But this is wrong, it needs
>> to be done unconditionally to ensure that the relation gets created at
>> replay.
>
> I think that we should also update other *buildempty() functions as well.
> For example, if the table has a primary key, we'll encounter the error again
> for btree index.

Good point. I just had a look at all the AMs: bloom, btree and SP-gist
are plainly broken. The others are fine. Attached is an updated patch.

Here are the tests I have done with wal_level = minimal to test all the AMs:
\! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
set default_tablespace = popo;
create unlogged table foo (a int);
insert into foo values (generate_series(1,1));
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
insert into foo_geo values ('(1,2),(2,3)');
create index foo_spgist ON foo_geo using spgist(a);

-- crash PG
-- Insert some data
insert into foo values (100);
insert into foo_geo values ('(50,12),(312,3)');

This should create 8 INIT forks, 6 for the indexes and 2 for the
tables. On HEAD, only 3 are getting created and with the patch all of
them are.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa2..c9c7049 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,12 @@ blbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	BloomFillMetapage(index, metapage);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BLOOM_METAPAGE_BLKNO, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BLOOM_METAPAGE_BLKNO, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..624aa84 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -242,13 +242,12 @@ btbuildempty(Relation index)
 	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, P_NONE, 0);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it */
 	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
 			  (char *) metapage, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BTREE_METAPAGE, metapage, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+BTREE_METAPAGE, metapage, false);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 01c8d21..8ac3b00 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -161,13 +161,12 @@ spgbuildempty(Relation index)
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/* Write the page.  If archiving/streaming, XLOG it. */
+	/* Write the page and log it. */
 	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_METAPAGE_BLKNO, page, false);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_METAPAGE_BLKNO, page, false);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
@@ -175,9 +174,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
 			  (char *) page, true);
-	if (XLogIsNeeded())
-		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	SPGIST_ROOT_BLKNO, page, true);
+	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
@@ -185,9 +183,8 @@ spgbuildempty(Relation index)
 	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
 	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
 			  (char 

Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Kuntal Ghosh
On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
 wrote:
> On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
>  wrote:
>> Okay, so what happens is that the CREATE TABLESPACE record removes the
>> tablespace directory and recreates a fresh one, but as no CREATE
>> records are created for unlogged tables the init fork is not
>> re-created. It seems to me that we should log a record to recreate the
>> INIT fork, and that heap_create_with_catalog() is missing something.
>> Generating a record in RelationCreateStorage() is the more direct way,
>> and that actually fixes the issue. Now the comments at the top of it
>> mention that RelationCreateStorage() should only be used to create the
>> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
>> the end of heap_create()?
>
> Nah. Looking at the code the fix is quite obvious.
> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
> the INIT forknum should be logged or not. But this is wrong, it needs
> to be done unconditionally to ensure that the relation gets created at
> replay.
I think that we should also update other *buildempty() functions as well.
For example, if the table has a primary key, we'll encounter the error again
for btree index.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Michael Paquier
On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
 wrote:
> Okay, so what happens is that the CREATE TABLESPACE record removes the
> tablespace directory and recreates a fresh one, but as no CREATE
> records are created for unlogged tables the init fork is not
> re-created. It seems to me that we should log a record to recreate the
> INIT fork, and that heap_create_with_catalog() is missing something.
> Generating a record in RelationCreateStorage() is the more direct way,
> and that actually fixes the issue. Now the comments at the top of it
> mention that RelationCreateStorage() should only be used to create the
> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
> the end of heap_create()?

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.
-- 
Michael
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0cf7b9e..2497a1e 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1380,8 +1380,7 @@ heap_create_init_fork(Relation rel)
 {
RelationOpenSmgr(rel);
smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
-   if (XLogIsNeeded())
-   log_smgrcreate(>rd_smgr->smgr_rnode.node, INIT_FORKNUM);
+   log_smgrcreate(>rd_smgr->smgr_rnode.node, INIT_FORKNUM);
smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-10 Thread Michael Paquier
On Thu, Nov 10, 2016 at 4:33 PM, Michael Paquier
 wrote:
> On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
>  wrote:
>> No, it is latest sources from Postgres repository.
>> Please notice that you should create new database and tablespace to 
>> reproduce this issue.
>> So actually the whole sequence is
>>
>> mkdir fs
>> initdb -D pgsql
>> pg_ctl -D pgsql -l logfile start
>> psql postgres
>> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
>> # set default_tablespace=fs;
>> # create unlogged table foo(x integer);
>> # insert into foo values(generate_series(1,10));
>> # ^D
>> pkill -9 postgres
>> pg_ctl -D pgsql -l logfile start
>> # select * from foo;
>
> OK, I understood what I was missing. This can be reproduced with
> wal_level = minimal. When using hot_standby things are working
> properly.

Okay, so what happens is that the CREATE TABLESPACE record removes the
tablespace directory and recreates a fresh one, but as no CREATE
records are created for unlogged tables the init fork is not
re-created. It seems to me that we should log a record to recreate the
INIT fork, and that heap_create_with_catalog() is missing something.
Generating a record in RelationCreateStorage() is the more direct way,
and that actually fixes the issue. Now the comments at the top of it
mention that RelationCreateStorage() should only be used to create the
MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
the end of heap_create()?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
 wrote:
> No, it is latest sources from Postgres repository.
> Please notice that you should create new database and tablespace to reproduce 
> this issue.
> So actually the whole sequence is
>
> mkdir fs
> initdb -D pgsql
> pg_ctl -D pgsql -l logfile start
> psql postgres
> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
> # set default_tablespace=fs;
> # create unlogged table foo(x integer);
> # insert into foo values(generate_series(1,10));
> # ^D
> pkill -9 postgres
> pg_ctl -D pgsql -l logfile start
> # select * from foo;

OK, I understood what I was missing. This can be reproduced with
wal_level = minimal. When using hot_standby things are working
properly.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread konstantin knizhnik

On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

> 
> Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
> have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce 
this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql 
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,10));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;


> -- 
> Michael



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
>  wrote:
>> Now simulate server crash using using "pkill -9 postgres".
>>
>> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
>> logfile start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>> knizhnik@knizhnik:~/dtm-data$ psql postgres
>> psql (10devel)
>> Type "help" for help.
>>
>> postgres=# select * from foo;
>> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
>> No such file or directory
>>
>> knizhnik@knizhnik:~/dtm-data$ ls fs
>> PG_10_201611041
>> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>>
>> So all relation directory is removed!
>> It happens only for first table created in tablespace.
>> If you create table in Postgres data directory everything is ok: first
>> segment of relation is truncated but not deleted.
>
> Whoa.  There should be an _init fork that doesn't get removed, and
> without removing the _init fork you shouldn't be able to remove the
> directory that contains it.

Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
have locally a standby pointing as well to this tablespace?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
 wrote:
> Now simulate server crash using using "pkill -9 postgres".
>
> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
> logfile start
> pg_ctl: another server might be running; trying to start server anyway
> server starting
> knizhnik@knizhnik:~/dtm-data$ psql postgres
> psql (10devel)
> Type "help" for help.
>
> postgres=# select * from foo;
> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
> No such file or directory
>
> knizhnik@knizhnik:~/dtm-data$ ls fs
> PG_10_201611041
> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>
> So all relation directory is removed!
> It happens only for first table created in tablespace.
> If you create table in Postgres data directory everything is ok: first
> segment of relation is truncated but not deleted.

Whoa.  There should be an _init fork that doesn't get removed, and
without removing the _init fork you shouldn't be able to remove the
directory that contains it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables cleanup

2016-11-09 Thread Konstantin Knizhnik

Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,10));
INSERT 0 10



Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR:  could not open file 
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory


knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/


So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first 
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated 
correctly:


postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,10));
INSERT 0 10
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1;
 x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 fs/PG_10_201611041/12289/32768
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 
fs/PG_10_201611041/12289/32768_init



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables can vanish after a crash

2014-11-19 Thread Albe Laurenz
I observed an interesting (and I think buggy) behaviour today after one of
our clusters crashed due to an out of space condition in the data directory.

Five databases in that cluster have each one unlogged table.

The log reads as follows:

PANIC  could not write to file pg_xlog/xlogtemp.1820: No space left on device
...
LOGterminating any other active server processes
...
LOGall server processes terminated; reinitializing
LOGdatabase system was interrupted; last known up at 2014-11-18 18:04:28 CET
LOGdatabase system was not properly shut down; automatic recovery in 
progress
LOGredo starts at C9/50403B20
LOGredo done at C9/5A98
LOGcheckpoint starting: end-of-recovery immediate
LOGcheckpoint complete: ...
LOGautovacuum launcher started
LOGdatabase system is ready to accept connections
...
PANIC  could not write to file pg_xlog/xlogtemp.4417: No space left on device
...
LOGterminating any other active server processes
...
LOGall server processes terminated; reinitializing
LOGdatabase system was interrupted; last known up at 2014-11-18 18:04:38 CET
LOGdatabase system was not properly shut down; automatic recovery in 
progress
LOGredo starts at C9/5B70
LOGredo done at C9/5FFFE4E0
LOGcheckpoint starting: end-of-recovery immediate
LOGcheckpoint complete: ...
FATAL  could not write to file pg_xlog/xlogtemp.4442: No space left on device
LOGstartup process (PID 4442) exited with exit code 1
LOGaborting startup due to startup process failure

After the problem was removed, the cluster was restarted.
The log reads as follows:

LOGending log output to stderr  Future log output will go to log 
destination csvlog.
LOGdatabase system was shut down at 2014-11-18 18:05:03 CET
LOGautovacuum launcher started
LOGdatabase system is ready to accept connections


So no crash recovery was performed, probably because the startup process
failed *after* it completed the end-of-recovery checkpoint.

Now the main fork files for all five unlogged tables are gone; the init fork 
files
are still there.

Obviously the main fork got nuked during recovery, but the startup process died
before it could recreate them:

/*
 * Preallocate additional log files, if wanted.
 */
PreallocXlogFiles(EndOfLog);

/*
 * Reset initial contents of unlogged relations.  This has to be done
 * AFTER recovery is complete so that any unlogged relations created
 * during recovery also get picked up.
 */
if (InRecovery)
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);

It seems to me that the right fix would be to recreate the unlogged
relations *before* the checkpoint.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables can vanish after a crash

2014-11-19 Thread Andres Freund
Hi,

On 2014-11-19 11:26:56 +, Albe Laurenz wrote:
 I observed an interesting (and I think buggy) behaviour today after one of
 our clusters crashed due to an out of space condition in the data directory.

Hah, just a couple days I pushed a fix for that ;)

http://archives.postgresql.org/message-id/20140912112246.GA4984%40alap3.anarazel.de
and
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3586fc8aa5d9365a5c50cb5e555971eb633a4ec

 So no crash recovery was performed, probably because the startup process
 failed *after* it completed the end-of-recovery checkpoint.
 
 Now the main fork files for all five unlogged tables are gone; the init fork 
 files
 are still there.

You can recover them by restarting with -m immediate or so again.

 It seems to me that the right fix would be to recreate the unlogged
 relations *before* the checkpoint.

Yep, that's what we're doing now.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables can vanish after a crash

2014-11-19 Thread Albe Laurenz
Andres Freund wrote:
 On 2014-11-19 11:26:56 +, Albe Laurenz wrote:
 I observed an interesting (and I think buggy) behaviour today after one of
 our clusters crashed due to an out of space condition in the data 
 directory.
 
 Hah, just a couple days I pushed a fix for that ;)
 
 http://archives.postgresql.org/message-id/20140912112246.GA4984%40alap3.anarazel.de
 and
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3586fc8aa5d9365a5c50cb5e555971eb633a4ec

Thanks, I didn't see that.
PostgreSQL, the database system where your bugs get fixed before you report 
them!

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-02-12 Thread Jeevan Chalke
Hi Heikki,


On Mon, Feb 11, 2013 at 7:28 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 11.02.2013 08:44, Jeevan Chalke wrote:

 Hi,

 Any review comments on this ?


 Sorry for the delay.

 I did some minor cleanup on this. I added code to pg_resetxlog and
 pg_controldata to reset / display the current unlogged LSN value. I moved
 the static counter, for temporary relations, back to gistutil.c, so that
 the function in xlog.c only deals with unlogged relations. It's debatable
 if that's better, but IMHO it is. Also, the unloggedLSN counter is now
 reset to 1 at crash recovery. There's no fundamental reason it needs to be
 reset, rather than just continue from the last shutdowned value like
 nothing happened, but it seems cleaner that way.

 I'm happy with this now, but please take one more look before I commit
 this.


This morning I had a look over this. But it seems that you have already
committed it.

Changes are fine and even better.
No issues with my unit testing too.

Thanks for the commit.



 - Heikki




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


Re: [HACKERS] unlogged tables vs. GIST

2013-02-11 Thread Heikki Linnakangas

On 11.02.2013 08:44, Jeevan Chalke wrote:

Hi,

Any review comments on this ?


Sorry for the delay.

I did some minor cleanup on this. I added code to pg_resetxlog and 
pg_controldata to reset / display the current unlogged LSN value. I 
moved the static counter, for temporary relations, back to gistutil.c, 
so that the function in xlog.c only deals with unlogged relations. It's 
debatable if that's better, but IMHO it is. Also, the unloggedLSN 
counter is now reset to 1 at crash recovery. There's no fundamental 
reason it needs to be reset, rather than just continue from the last 
shutdowned value like nothing happened, but it seems cleaner that way.


I'm happy with this now, but please take one more look before I commit this.

- Heikki
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 8872920..af11eb0 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -182,8 +182,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
   automatically truncated after a crash or unclean shutdown.  The contents
   of an unlogged table are also not replicated to standby servers.
   Any indexes created on an unlogged table are automatically unlogged as
-  well; however, unlogged link linkend=GiSTGiST indexes/link are
-  currently not supported and cannot be created on an unlogged table.
+  well.
  /para
 /listitem
/varlistentry
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index e2d3390..eba95f1 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -16,6 +16,7 @@
 
 #include access/genam.h
 #include access/gist_private.h
+#include access/heapam_xlog.h
 #include catalog/index.h
 #include catalog/pg_collation.h
 #include miscadmin.h
@@ -71,9 +72,22 @@ createTempGistContext(void)
 Datum
 gistbuildempty(PG_FUNCTION_ARGS)
 {
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg(unlogged GiST indexes are not supported)));
+	Relation	index = (Relation) PG_GETARG_POINTER(0);
+	Buffer		buffer;
+
+	/* Initialize the root page */
+	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and xlog buffer */
+	START_CRIT_SECTION();
+	GISTInitBuffer(buffer, F_LEAF);
+	MarkBufferDirty(buffer);
+	log_newpage_buffer(buffer);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(buffer);
 
 	PG_RETURN_VOID();
 }
@@ -391,7 +405,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
    dist, oldrlink, oldnsn, leftchildbuf,
    markfollowright);
 		else
-			recptr = GetXLogRecPtrForTemp();
+			recptr = gistGetFakeLSN(rel);
 
 		for (ptr = dist; ptr; ptr = ptr-next)
 		{
@@ -448,7 +462,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		}
 		else
 		{
-			recptr = GetXLogRecPtrForTemp();
+			recptr = gistGetFakeLSN(rel);
 			PageSetLSN(page, recptr);
 		}
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index aec5b52..0cf22cd 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -158,16 +158,6 @@ gistbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, index \%s\ already contains data,
 			 RelationGetRelationName(index));
 
-	/*
-	 * We can't yet handle unlogged GiST indexes, because we depend on LSNs.
-	 * This is duplicative of an error in gistbuildempty, but we want to check
-	 * here so as to throw error before doing all the index-build work.
-	 */
-	if (heap-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED)
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg(unlogged GiST indexes are not supported)));
-
 	/* no locking is needed */
 	buildstate.giststate = initGISTstate(index);
 
@@ -204,7 +194,7 @@ gistbuild(PG_FUNCTION_ARGS)
 		PageSetTLI(page, ThisTimeLineID);
 	}
 	else
-		PageSetLSN(page, GetXLogRecPtrForTemp());
+		PageSetLSN(page, gistGetFakeLSN(heap));
 
 	UnlockReleaseBuffer(buffer);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index e5c3d69..c5b2c87 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -798,16 +798,30 @@ gistoptions(PG_FUNCTION_ARGS)
 }
 
 /*
- * Temporary GiST indexes are not WAL-logged, but we need LSNs to detect
- * concurrent page splits anyway. GetXLogRecPtrForTemp() provides a fake
- * sequence of LSNs for that purpose. Each call generates an LSN that is
- * greater than any previous value returned by this function in the same
- * session.
+ * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
+ * to detect concurrent page splits anyway. This function provides a fake
+ * sequence of LSNs for that purpose.
  */
 XLogRecPtr
-GetXLogRecPtrForTemp(void)
+gistGetFakeLSN(Relation rel)
 {
 	static XLogRecPtr counter = 1;
-	

Re: [HACKERS] unlogged tables vs. GIST

2013-02-10 Thread Jeevan Chalke
Hi,

Any review comments on this ?


On Tue, Jan 29, 2013 at 6:03 PM, Jeevan Chalke 
jeevan.cha...@enterprisedb.com wrote:

 Hi Heikki,


 On Mon, Jan 28, 2013 at 2:34 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 23.01.2013 17:30, Robert Haas wrote:

 On Wed, Jan 23, 2013 at 4:04 AM, Jeevan Chalke
 jeevan.chalke@enterprisedb.**com jeevan.cha...@enterprisedb.com
  wrote:

 I guess my earlier patch, which was directly incrementing

 ControlFile-unloggedLSN counter was the concern as it will take
 ControlFileLock several times.

 In this version of patch I did what Robert has suggested. At start of
 the
 postmaster, copying unloggedLSn value to XLogCtl, a shared memory
 struct.
 And
 in all access to unloggedLSN, using this shared variable using a
 SpinLock.
 And since we want to keep this counter persistent across clean shutdown,
 storing it in ControlFile before updating it.

 With this approach, we are updating ControlFile only when we shutdown
 the
 server, rest of the time we are having a shared memory counter. That
 means
 we
 are not touching pg_control every other millisecond or so. Also since
 we are
 not caring about crashes, XLogging this counter like OID counter is not
 required as such.


 On a quick read-through this looks reasonable to me, but others may
 have different opinions, and I haven't reviewed in detail.

  ...

 [a couple of good points]


 In addition to those things Robert pointed out:

  /*
 + * Temporary GiST indexes are not WAL-logged, but we need LSNs to detect
 + * concurrent page splits anyway. GetXLogRecPtrForUnloggedRel()
 provides a fake
 + * sequence of LSNs for that purpose. Each call generates an LSN that is
 + * greater than any previous value returned by this function in the same
 + * session using static counter
 + * Similarily unlogged GiST indexes are also not WAL-logged. But we
 need a
 + * persistent counter across clean shutdown. Use counter from
 ControlFile which
 + * is copied in XLogCtl.unloggedLSN to accomplish that
 + * If relation is UNLOGGED, return persistent counter from XLogCtl else
 return
 + * session wide temporary counter
 + */
 +XLogRecPtr
 +GetXLogRecPtrForUnloggedRel(**Relation rel)


 From a modularity point of view, it's not good that you xlog.c needs to
 know about Relation struct. Perhaps move the logic to check the relation is
 unlogged or not to a function in gistutil.c, and only have a small
 GetUnloggedLSN() function in xlog.c


 I kept a function as is, but instead sending Relation as a parameter, it
 now takes boolean, isUnlogged. Added a MACRO for that.



 I'd suggest adding a separate spinlock to protect unloggedLSN. I'm not
 sure if there's much contention on XLogCtl-info_lck today, but
 nevertheless it'd be better to keep this separate, also from a modularity
 point of view.


 Hmm. OK.
 Added ulsn_lck for this.



  @@ -7034,6 +7078,8 @@ CreateCheckPoint(int flags)
 LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 ControlFile-state = DB_SHUTDOWNING;
 ControlFile-time = (pg_time_t) time(NULL);
 +   /* Store unloggedLSN value as we want it persistent
 across shutdown */
 +   ControlFile-unloggedLSN = XLogCtl-unloggedLSN;
 UpdateControlFile();
 LWLockRelease(ControlFileLock)**;
 }


 This needs to acquire the spinlock to read unloggedLSN.


 OK. Done.



 Do we need to do anything to unloggedLSN in pg_resetxlog?


 I guess NO.

 Also, added Robert's comment in bufmgr.c
 I am still unsure about the spinlock around buf flags as we are just
 examining it.

 Please let me know if I missed any.

 Thanks



 - Heikki




 --
 Jeevan B Chalke
 Senior Software Engineer, RD
 EnterpriseDB Corporation
 The Enterprise PostgreSQL Company

 Phone: +91 20 30589500

 Website: www.enterprisedb.com
 EnterpriseDB Blog: http://blogs.enterprisedb.com/
 Follow us on Twitter: http://www.twitter.com/enterprisedb

 This e-mail message (and any attachment) is intended for the use of the
 individual or entity to whom it is addressed. This message contains
 information from EnterpriseDB Corporation that may be privileged,
 confidential, or exempt from disclosure under applicable law. If you are
 not the intended recipient or authorized to receive this for the intended
 recipient, any use, dissemination, distribution, retention, archiving, or
 copying of this communication is strictly prohibited. If you have received
 this e-mail in error, please notify the sender immediately by reply e-mail
 and delete this message.




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information 

Re: [HACKERS] unlogged tables vs. GIST

2013-01-29 Thread Jeevan Chalke
Hi Heikki,


On Mon, Jan 28, 2013 at 2:34 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 23.01.2013 17:30, Robert Haas wrote:

 On Wed, Jan 23, 2013 at 4:04 AM, Jeevan Chalke
 jeevan.chalke@enterprisedb.**com jeevan.cha...@enterprisedb.com
  wrote:

 I guess my earlier patch, which was directly incrementing

 ControlFile-unloggedLSN counter was the concern as it will take
 ControlFileLock several times.

 In this version of patch I did what Robert has suggested. At start of the
 postmaster, copying unloggedLSn value to XLogCtl, a shared memory struct.
 And
 in all access to unloggedLSN, using this shared variable using a
 SpinLock.
 And since we want to keep this counter persistent across clean shutdown,
 storing it in ControlFile before updating it.

 With this approach, we are updating ControlFile only when we shutdown the
 server, rest of the time we are having a shared memory counter. That
 means
 we
 are not touching pg_control every other millisecond or so. Also since we
 are
 not caring about crashes, XLogging this counter like OID counter is not
 required as such.


 On a quick read-through this looks reasonable to me, but others may
 have different opinions, and I haven't reviewed in detail.

  ...

 [a couple of good points]


 In addition to those things Robert pointed out:

  /*
 + * Temporary GiST indexes are not WAL-logged, but we need LSNs to detect
 + * concurrent page splits anyway. GetXLogRecPtrForUnloggedRel() provides
 a fake
 + * sequence of LSNs for that purpose. Each call generates an LSN that is
 + * greater than any previous value returned by this function in the same
 + * session using static counter
 + * Similarily unlogged GiST indexes are also not WAL-logged. But we need
 a
 + * persistent counter across clean shutdown. Use counter from
 ControlFile which
 + * is copied in XLogCtl.unloggedLSN to accomplish that
 + * If relation is UNLOGGED, return persistent counter from XLogCtl else
 return
 + * session wide temporary counter
 + */
 +XLogRecPtr
 +GetXLogRecPtrForUnloggedRel(**Relation rel)


 From a modularity point of view, it's not good that you xlog.c needs to
 know about Relation struct. Perhaps move the logic to check the relation is
 unlogged or not to a function in gistutil.c, and only have a small
 GetUnloggedLSN() function in xlog.c


I kept a function as is, but instead sending Relation as a parameter, it
now takes boolean, isUnlogged. Added a MACRO for that.



 I'd suggest adding a separate spinlock to protect unloggedLSN. I'm not
 sure if there's much contention on XLogCtl-info_lck today, but
 nevertheless it'd be better to keep this separate, also from a modularity
 point of view.


Hmm. OK.
Added ulsn_lck for this.



  @@ -7034,6 +7078,8 @@ CreateCheckPoint(int flags)
 LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 ControlFile-state = DB_SHUTDOWNING;
 ControlFile-time = (pg_time_t) time(NULL);
 +   /* Store unloggedLSN value as we want it persistent
 across shutdown */
 +   ControlFile-unloggedLSN = XLogCtl-unloggedLSN;
 UpdateControlFile();
 LWLockRelease(ControlFileLock)**;
 }


 This needs to acquire the spinlock to read unloggedLSN.


OK. Done.



 Do we need to do anything to unloggedLSN in pg_resetxlog?


I guess NO.

Also, added Robert's comment in bufmgr.c
I am still unsure about the spinlock around buf flags as we are just
examining it.

Please let me know if I missed any.

Thanks



 - Heikki




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


support_unlogged_gist_indexes_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-28 Thread Heikki Linnakangas

On 23.01.2013 17:30, Robert Haas wrote:

On Wed, Jan 23, 2013 at 4:04 AM, Jeevan Chalke
jeevan.cha...@enterprisedb.com  wrote:

I guess my earlier patch, which was directly incrementing
ControlFile-unloggedLSN counter was the concern as it will take
ControlFileLock several times.

In this version of patch I did what Robert has suggested. At start of the
postmaster, copying unloggedLSn value to XLogCtl, a shared memory struct.
And
in all access to unloggedLSN, using this shared variable using a SpinLock.
And since we want to keep this counter persistent across clean shutdown,
storing it in ControlFile before updating it.

With this approach, we are updating ControlFile only when we shutdown the
server, rest of the time we are having a shared memory counter. That means
we
are not touching pg_control every other millisecond or so. Also since we are
not caring about crashes, XLogging this counter like OID counter is not
required as such.


On a quick read-through this looks reasonable to me, but others may
have different opinions, and I haven't reviewed in detail.

 ...

[a couple of good points]


In addition to those things Robert pointed out:


/*
+ * Temporary GiST indexes are not WAL-logged, but we need LSNs to detect
+ * concurrent page splits anyway. GetXLogRecPtrForUnloggedRel() provides a fake
+ * sequence of LSNs for that purpose. Each call generates an LSN that is
+ * greater than any previous value returned by this function in the same
+ * session using static counter
+ * Similarily unlogged GiST indexes are also not WAL-logged. But we need a
+ * persistent counter across clean shutdown. Use counter from ControlFile which
+ * is copied in XLogCtl.unloggedLSN to accomplish that
+ * If relation is UNLOGGED, return persistent counter from XLogCtl else return
+ * session wide temporary counter
+ */
+XLogRecPtr
+GetXLogRecPtrForUnloggedRel(Relation rel)


From a modularity point of view, it's not good that you xlog.c needs to 
know about Relation struct. Perhaps move the logic to check the relation 
is unlogged or not to a function in gistutil.c, and only have a small 
GetUnloggedLSN() function in xlog.c


I'd suggest adding a separate spinlock to protect unloggedLSN. I'm not 
sure if there's much contention on XLogCtl-info_lck today, but 
nevertheless it'd be better to keep this separate, also from a 
modularity point of view.



@@ -7034,6 +7078,8 @@ CreateCheckPoint(int flags)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile-state = DB_SHUTDOWNING;
ControlFile-time = (pg_time_t) time(NULL);
+   /* Store unloggedLSN value as we want it persistent across 
shutdown */
+   ControlFile-unloggedLSN = XLogCtl-unloggedLSN;
UpdateControlFile();
LWLockRelease(ControlFileLock);
}


This needs to acquire the spinlock to read unloggedLSN.

Do we need to do anything to unloggedLSN in pg_resetxlog?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-28 Thread Robert Haas
On Mon, Jan 28, 2013 at 4:04 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Do we need to do anything to unloggedLSN in pg_resetxlog?

Does the server go into recovery after pg_resetxlog?  If so, no.  If
not, probably, but I have no idea what.  There's no safe value in
that case; what we ought to do is force a reset of all unlogged
relations, or just punt and hope nothing bad happens (which after all
is what pg_resetxlog is mostly about anyway).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-23 Thread Jeevan Chalke
On Wed, Jan 16, 2013 at 3:24 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Jan 15, 2013 at 4:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  I think that might be acceptable from a performance point of view -
  after all, if the index is unlogged, you're saving the cost of WAL -
  but I guess I still prefer a generic solution to this problem (a
  generalization of GetXLogRecPtrForTemp) rather than a special-purpose
  solution based on the nitty-gritty of how GiST uses these values.
  What's the difference between storing this value in pg_control and,
  say, the OID counter?
 
  Well, the modularity argument is that GiST shouldn't have any special
  privileges compared to a third-party index AM.  (I realize that
  third-party AMs already have problems plugging into WAL replay, but
  that doesn't mean we should create more problems.)
 
  We could possibly dodge that objection by regarding the global counter
  as some sort of generic unlogged operation counter, available to
  anybody who needs it.  It would be good to have a plausible example of
  something else needing it, but assume somebody can think of one.
 
  The bigger issue is that the reason we don't have to update pg_control
  every other millisecond is that the OID counter is capable of tracking
  its state between checkpoints without touching pg_control, that is it
  can emit WAL records to track its increments.  I think that we should
  insist that GiST do likewise, even if we give it some space in
  pg_control.  Remember that pg_control is a single point of failure for
  the database, and the more often it's written to, the more likely it is
  that something will go wrong there.
 
  So I guess what would make sense to me is that we invent an unlogged
  ops counter that is managed exactly like the OID counter, including
  having WAL records that are treated as consuming some number of values
  in advance.  If it's 64 bits wide then the WAL records could safely be
  made to consume quite a lot of values, like a thousand or so, thus
  reducing the actual WAL I/O burden to about nothing.

 I didn't look at the actual patch (silly me?) but the only time you
 need to update the control file is when writing the shutdown
 checkpoint just before stopping the database server.  If the server
 crashes, it's OK to roll the value back to some smaller value, because
 unlogged relations will be reset anyway.  And while the server is
 running the information can live in a shared memory copy protected by
 a spinlock.  So the control file traffic should be limited to once per
 server lifetime, AFAICS.


Yes.

I guess my earlier patch, which was directly incrementing
ControlFile-unloggedLSN counter was the concern as it will take
ControlFileLock several times.

In this version of patch I did what Robert has suggested. At start of the
postmaster, copying unloggedLSn value to XLogCtl, a shared memory struct.
And
in all access to unloggedLSN, using this shared variable using a SpinLock.
And since we want to keep this counter persistent across clean shutdown,
storing it in ControlFile before updating it.

With this approach, we are updating ControlFile only when we shutdown the
server, rest of the time we are having a shared memory counter. That means
we
are not touching pg_control every other millisecond or so. Also since we are
not caring about crashes, XLogging this counter like OID counter is not
required as such.

Thanks


  --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


support_unlogged_gist_indexes_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-23 Thread Robert Haas
On Wed, Jan 23, 2013 at 4:04 AM, Jeevan Chalke
jeevan.cha...@enterprisedb.com wrote:
 Yes.

 I guess my earlier patch, which was directly incrementing
 ControlFile-unloggedLSN counter was the concern as it will take
 ControlFileLock several times.

 In this version of patch I did what Robert has suggested. At start of the
 postmaster, copying unloggedLSn value to XLogCtl, a shared memory struct.
 And
 in all access to unloggedLSN, using this shared variable using a SpinLock.
 And since we want to keep this counter persistent across clean shutdown,
 storing it in ControlFile before updating it.

 With this approach, we are updating ControlFile only when we shutdown the
 server, rest of the time we are having a shared memory counter. That means
 we
 are not touching pg_control every other millisecond or so. Also since we are
 not caring about crashes, XLogging this counter like OID counter is not
 required as such.

On a quick read-through this looks reasonable to me, but others may
have different opinions, and I haven't reviewed in detail.

One obvious oversight is that bufmgr.c needs a detailed comment
explaining the reason behind the change you are making there.
Otherwise, someone might think to undo it in the future, and that
would be bad.  Perhaps add something like:

However, this rule does not apply for unlogged relations, which will
be lost after a crash anyway.  Most unlogged relation pages do not
bear LSNs since we never emit WAL records for them, and therefore
flushing up through the buffer LSN would be useless, but harmless.
However, GiST indexes use LSNs internally to track page-splits, and
therefore temporary and unlogged GiST pages bear fake LSNs generated
by GetXLogRecPtrForUnloggedRel.  It is unlikely but possible that the
fake-LSN counter used for unlogged relations could advance past the
WAL insertion point; and if it did happen, attempting to flush WAL
through that location would fail, with disastrous system-wide
consequences.  To make sure that can't happen, skip the flush if the
buffer isn't permanent.

A related problem is that you're examining the buffer header flags
without taking the buffer-header spinlock.  I'm not sure this can
actually matter, because we'll always hold the content lock on the
buffer at this point, which presumably precludes any operation that
might flip that bit, but it looks like the callers all have the buffer
header flags conveniently available, so maybe they should pass that
information down to FlushBuffer.  That would also avoid accessing that
word in memory both before and after releasing the spinlock (all
callers have just released it) which can lead to memory-access stalls.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Heikki Linnakangas

On 15.01.2013 08:54, Jeevan Chalke wrote:

For (2), I have added a new function called, GetXLogRecPtrForUnloogedRel()
which returns a fake LSN for GiST indexes. However, I have removed
GetXLogRecPtrForTemp() function and added its functionality inside this new
function itself to avoid complexity.


I don't much care for using a new field in the control file for this. 
First, it seems like a big modularity violation to store a gist-specific 
counter in the control file. Second, you'd be generating a lot of 
traffic on the ControlFileLock. It's not heavily contended at the 
moment, but when the control file is updated, it's held over an fsync, 
which could cause unnecessary stalls to insertions to unlogged gist 
tables. And it's just a bad idea to share a lock for two things with 
completely different characteristics in general.


Could we stash the counter e.g. in the root page of the index?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Robert Haas
On Tue, Jan 15, 2013 at 1:10 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 15.01.2013 08:54, Jeevan Chalke wrote:

 For (2), I have added a new function called, GetXLogRecPtrForUnloogedRel()
 which returns a fake LSN for GiST indexes. However, I have removed
 GetXLogRecPtrForTemp() function and added its functionality inside this
 new
 function itself to avoid complexity.


 I don't much care for using a new field in the control file for this. First,
 it seems like a big modularity violation to store a gist-specific counter in
 the control file. Second, you'd be generating a lot of traffic on the
 ControlFileLock. It's not heavily contended at the moment, but when the
 control file is updated, it's held over an fsync, which could cause
 unnecessary stalls to insertions to unlogged gist tables. And it's just a
 bad idea to share a lock for two things with completely different
 characteristics in general.

 Could we stash the counter e.g. in the root page of the index?

That would require maintaining a counter per table rather than a
single global counter, which would be bad because then we'd need to
store one counter in shared memory for every table, rather than just
one, period, which runs up against the fixed sizing of shared memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Heikki Linnakangas

On 15.01.2013 20:33, Robert Haas wrote:

On Tue, Jan 15, 2013 at 1:10 PM, Heikki Linnakangas

Could we stash the counter e.g. in the root page of the index?


That would require maintaining a counter per table rather than a
single global counter, which would be bad because then we'd need to
store one counter in shared memory for every table, rather than just
one, period, which runs up against the fixed sizing of shared memory.


I was thinking of just adding a new field to the root page header, and 
use that field as the counter. Something like:


XLogRecPtr
GetXLogRecPtrForTemp(void)
{
rootbuf = ReadBuffer(rel, GIST_ROOT_BLKNO);
opaq = GistPageGetOpaque(BufferGetPage(rootbuf));

LockBuffer(rootbuf, GIST_EXCLUSIVE);
nsn = opaq-counter++
UnlockReleaseBuffer(rootbuf)
return nsn;
}

or perhaps we need to use locking mechanism for that, like just a new 
global lwlock or spinlock, to avoid deadlocks if someone is just 
splitting the root page. In any case, the fixed-sizedness of shared 
memory isn't an issue here.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 15, 2013 at 1:10 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Could we stash the counter e.g. in the root page of the index?

 That would require maintaining a counter per table rather than a
 single global counter, which would be bad because then we'd need to
 store one counter in shared memory for every table, rather than just
 one, period, which runs up against the fixed sizing of shared memory.

I think what Heikki had in mind was that the copy in the index would be
the authoritative one, not some image in shared memory.  This'd imply
dirtying the root page on every insert, as well as increased contention
for the root page, so it might have performance problems.

I think a bigger issue is where we'd find any space for it.  There's no
easily-spare space in a GIST page.  This reminds me again that the lack
of a metapage in GIST was a serious design error, which we should
correct if we ever break on-disk compatibility again.

I concur that adding such a counter to pg_control is a nonstarter,
though.

Given that we don't need crash recovery for an unlogged table, could
we get away with some variant of NSN that has weaker semantics than
XLOG LSNs?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Robert Haas
On Tue, Jan 15, 2013 at 1:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 15, 2013 at 1:10 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Could we stash the counter e.g. in the root page of the index?

 That would require maintaining a counter per table rather than a
 single global counter, which would be bad because then we'd need to
 store one counter in shared memory for every table, rather than just
 one, period, which runs up against the fixed sizing of shared memory.

 I think what Heikki had in mind was that the copy in the index would be
 the authoritative one, not some image in shared memory.  This'd imply
 dirtying the root page on every insert, as well as increased contention
 for the root page, so it might have performance problems.

 I think a bigger issue is where we'd find any space for it.  There's no
 easily-spare space in a GIST page.  This reminds me again that the lack
 of a metapage in GIST was a serious design error, which we should
 correct if we ever break on-disk compatibility again.

 I concur that adding such a counter to pg_control is a nonstarter,
 though.

 Given that we don't need crash recovery for an unlogged table, could
 we get away with some variant of NSN that has weaker semantics than
 XLOG LSNs?

It needs to be strictly ascending and survive clean shutdowns.  Is
there some place we could preserve it other than the control file?

I was assuming we wanted a single sequence shared across all relations
rather than a sequence per relation, but I don't know of any reason
why that's actually required.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Heikki Linnakangas

On 15.01.2013 20:48, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Tue, Jan 15, 2013 at 1:10 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

Could we stash the counter e.g. in the root page of the index?



That would require maintaining a counter per table rather than a
single global counter, which would be bad because then we'd need to
store one counter in shared memory for every table, rather than just
one, period, which runs up against the fixed sizing of shared memory.


I think what Heikki had in mind was that the copy in the index would be
the authoritative one, not some image in shared memory.  This'd imply
dirtying the root page on every insert, as well as increased contention
for the root page, so it might have performance problems.


Not every insert, just every split. Which might still be a performance 
problem, but an order of magnitude smaller.



I think a bigger issue is where we'd find any space for it.  There's no
easily-spare space in a GIST page.


We could use a larger opaque struct, with the additional field, for the 
root page, for new indexes. As long as we continue to support the 
current layout too, that won't break on-disk compatibility. We didn't 
support unlogged gist indexes before, so we won't have to worry about 
upgrading unlogged indexes that miss the field.


Or if 32 bits is enough for this, we could reuse the right-link. The 
root page has no right link, so it can be repurposed.



This reminds me again that the lack
of a metapage in GIST was a serious design error, which we should
correct if we ever break on-disk compatibility again.


Yeah.


I concur that adding such a counter to pg_control is a nonstarter,
though.

Given that we don't need crash recovery for an unlogged table, could
we get away with some variant of NSN that has weaker semantics than
XLOG LSNs?


One thought I had is that we only generate an NSN when a page is split, 
and gist never deletes pages, so how about simply using the block number 
of the newly split page as the NSN? That closes the chance of 
reinventing page recycling in the future, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Andres Freund
On 2013-01-15 20:58:00 +0200, Heikki Linnakangas wrote:
 On 15.01.2013 20:48, Tom Lane wrote:
 Robert Haasrobertmh...@gmail.com  writes:
 On Tue, Jan 15, 2013 at 1:10 PM, Heikki Linnakangas
 hlinnakan...@vmware.com  wrote:
 Could we stash the counter e.g. in the root page of the index?
 
 That would require maintaining a counter per table rather than a
 single global counter, which would be bad because then we'd need to
 store one counter in shared memory for every table, rather than just
 one, period, which runs up against the fixed sizing of shared memory.
 
 I think what Heikki had in mind was that the copy in the index would be
 the authoritative one, not some image in shared memory.  This'd imply
 dirtying the root page on every insert, as well as increased contention
 for the root page, so it might have performance problems.
 
 Not every insert, just every split. Which might still be a performance
 problem, but an order of magnitude smaller.

I might be dense here and I don't really know that code, but if its only
splits why not do an XLogInsert(XLOG_GIST_NSN) or something there?
Inventing some other form of logging just because its an unlogged table
seems like reinventing the wheel.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Robert Haas
On Tue, Jan 15, 2013 at 1:58 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I think what Heikki had in mind was that the copy in the index would be
 the authoritative one, not some image in shared memory.  This'd imply
 dirtying the root page on every insert, as well as increased contention
 for the root page, so it might have performance problems.

 Not every insert, just every split. Which might still be a performance
 problem, but an order of magnitude smaller.

I think that might be acceptable from a performance point of view -
after all, if the index is unlogged, you're saving the cost of WAL -
but I guess I still prefer a generic solution to this problem (a
generalization of GetXLogRecPtrForTemp) rather than a special-purpose
solution based on the nitty-gritty of how GiST uses these values.
What's the difference between storing this value in pg_control and,
say, the OID counter?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I think that might be acceptable from a performance point of view -
 after all, if the index is unlogged, you're saving the cost of WAL -
 but I guess I still prefer a generic solution to this problem (a
 generalization of GetXLogRecPtrForTemp) rather than a special-purpose
 solution based on the nitty-gritty of how GiST uses these values.
 What's the difference between storing this value in pg_control and,
 say, the OID counter?

Well, the modularity argument is that GiST shouldn't have any special
privileges compared to a third-party index AM.  (I realize that
third-party AMs already have problems plugging into WAL replay, but
that doesn't mean we should create more problems.)

We could possibly dodge that objection by regarding the global counter
as some sort of generic unlogged operation counter, available to
anybody who needs it.  It would be good to have a plausible example of
something else needing it, but assume somebody can think of one.

The bigger issue is that the reason we don't have to update pg_control
every other millisecond is that the OID counter is capable of tracking
its state between checkpoints without touching pg_control, that is it
can emit WAL records to track its increments.  I think that we should
insist that GiST do likewise, even if we give it some space in
pg_control.  Remember that pg_control is a single point of failure for
the database, and the more often it's written to, the more likely it is
that something will go wrong there.

So I guess what would make sense to me is that we invent an unlogged
ops counter that is managed exactly like the OID counter, including
having WAL records that are treated as consuming some number of values
in advance.  If it's 64 bits wide then the WAL records could safely be
made to consume quite a lot of values, like a thousand or so, thus
reducing the actual WAL I/O burden to about nothing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-15 Thread Robert Haas
On Tue, Jan 15, 2013 at 4:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I think that might be acceptable from a performance point of view -
 after all, if the index is unlogged, you're saving the cost of WAL -
 but I guess I still prefer a generic solution to this problem (a
 generalization of GetXLogRecPtrForTemp) rather than a special-purpose
 solution based on the nitty-gritty of how GiST uses these values.
 What's the difference between storing this value in pg_control and,
 say, the OID counter?

 Well, the modularity argument is that GiST shouldn't have any special
 privileges compared to a third-party index AM.  (I realize that
 third-party AMs already have problems plugging into WAL replay, but
 that doesn't mean we should create more problems.)

 We could possibly dodge that objection by regarding the global counter
 as some sort of generic unlogged operation counter, available to
 anybody who needs it.  It would be good to have a plausible example of
 something else needing it, but assume somebody can think of one.

 The bigger issue is that the reason we don't have to update pg_control
 every other millisecond is that the OID counter is capable of tracking
 its state between checkpoints without touching pg_control, that is it
 can emit WAL records to track its increments.  I think that we should
 insist that GiST do likewise, even if we give it some space in
 pg_control.  Remember that pg_control is a single point of failure for
 the database, and the more often it's written to, the more likely it is
 that something will go wrong there.

 So I guess what would make sense to me is that we invent an unlogged
 ops counter that is managed exactly like the OID counter, including
 having WAL records that are treated as consuming some number of values
 in advance.  If it's 64 bits wide then the WAL records could safely be
 made to consume quite a lot of values, like a thousand or so, thus
 reducing the actual WAL I/O burden to about nothing.

I didn't look at the actual patch (silly me?) but the only time you
need to update the control file is when writing the shutdown
checkpoint just before stopping the database server.  If the server
crashes, it's OK to roll the value back to some smaller value, because
unlogged relations will be reset anyway.  And while the server is
running the information can live in a shared memory copy protected by
a spinlock.  So the control file traffic should be limited to once per
server lifetime, AFAICS.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2013-01-14 Thread Jeevan Chalke
Hi Robert / Tom

I think to support GiST with unlogged table we need to do three things:

1. Teach the buffer manager that the LSN of a page not marked
BM_PERMANENT can be ignored

2. Teach GetXLogRecPtrForTemp() to allocate fake LSNs for GiST buffers
using a 64-bit, counter that is persisted across clean shutdowns

3. Remove the error checks that prevent creating an unlogged GiST
index and update the documentation accordingly

PFA, patch which try to do above things.

For (2), I have added a new function called, GetXLogRecPtrForUnloogedRel()
which returns a fake LSN for GiST indexes. However, I have removed
GetXLogRecPtrForTemp() function and added its functionality inside this new
function itself to avoid complexity.

With this patch I am able to create GiST for unlogged tables.
It works well with my examples.
Also make check has no issues.

Please have a look and let me know your views.

Thanks



On Sat, Dec 18, 2010 at 7:20 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Dec 17, 2010 at 4:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Since these bits will only be set/cleared when the buffer mapping is
  changed, can we examine this bit without taking the spinlock?
 
  Only if you're willing for the result to be unreliable.

 If we read the bits while someone else is writing them, we'll get
 either the old or the new value, but the BM_FLUSH_XLOG bit will be the
 same either way.  When FlushBuffer() is called, a shared content log
 and a pin are held, so the buffer can't be renamed under us.  If
 that's really unacceptable, we can do something like the attached, but
 I think this is unnecessarily gross.  We already assume in other
 places that the read or write of an integer is atomic.  In this case
 we don't even need it to be fully atomic - we just need the particular
 byte that contains this bit not to go through some intermediate state
 where the bit is unset.  Is there really a memory architecture out
 there that's crazy enough to let such a thing happen?  If so, I'd
 expect things to be breaking right and left on that machine anyway.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Jeevan B Chalke
Senior Software Engineer, RD
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.


support_unlogged_gist_indexes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables and BufferSync()

2012-01-06 Thread Heikki Linnakangas

In BufferSync(), we have this:


/*
 * Unless this is a shutdown checkpoint, we write only permanent, dirty
 * buffers.  But at shutdown time, we write all dirty buffers.
 */
if (!(flags  CHECKPOINT_IS_SHUTDOWN))
flags |= BM_PERMANENT;


That seems bogus to me. We're mixing CHECKPOINT_* flags and buffer BM_* 
flags in the same variable. Furthermore, we only use that flags variable 
to pass it down to CheckpointWriteDelay(), which only pays attention to 
CHECKPOINT_IMMEDIATE flag. The intention was probably to do mask |= 
BM_PERMANENT ?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables and BufferSync()

2012-01-06 Thread Robert Haas
On Fri, Jan 6, 2012 at 7:52 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 In BufferSync(), we have this:

        /*
         * Unless this is a shutdown checkpoint, we write only permanent,
 dirty
         * buffers.  But at shutdown time, we write all dirty buffers.
         */
        if (!(flags  CHECKPOINT_IS_SHUTDOWN))
                flags |= BM_PERMANENT;


 That seems bogus to me. We're mixing CHECKPOINT_* flags and buffer BM_*
 flags in the same variable. Furthermore, we only use that flags variable to
 pass it down to CheckpointWriteDelay(), which only pays attention to
 CHECKPOINT_IMMEDIATE flag. The intention was probably to do mask |=
 BM_PERMANENT ?

Ouch.  Yeah, that's a bug, will fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-05-03 Thread Simon Riggs
On Tue, Apr 26, 2011 at 8:49 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
  If that 1%  is random (not time/transaction related), usually you'd rather
have an empty  table.

 Why do you think it would be random?

 Heap blocks would be zeroed if they were found to be damaged, following a
 crash.

 If you erase full blocks, you have no idea what data you erased; it could be
 something changed 1 hour ago, 1 month ago, 1 year ago. This is very different
 from,
 say, synchronous_commit=off: in that case, the most recent transactions may 
 be
 lost if the database should crash. In your case, some (who knows which???)
 data
 is lost. So, to me that sounds like random loss. I don't think that that is
 different
 from a corrupted table. You're not deleting rows recently changed; you're
 deleting
 everything that is physically close to it.

  In other  words: is a table that is not consistant with anything else in 
  the
db  useful?

 That's too big a leap. Why would it suddenly be inconsistent with  the
 rest of the database?


 If you delete some data, and you have no idea what data you lost, I don't 
 think
 you have a
 consistent db. Unless, of course, your table has no relation with any other
 table in the db.

 Of course, all these thoughts are based on the assumption that I know what
 happens when a
 block is erased; but my knowledge of postgresql internals is not so good, so I
 might be
 *very* wrong

You're assuming that there are referential links *from* other tables
to the table with damage. In which case you would be correct. But of
course, if you needed that data for integrity you would never do that,
so the problem is a nonexistent use case. The suggested mode is for
Fact data, not reference tables.

The current assessment is that UNLOGGED tables are useful only for
running a data cache. If the database crashes, then the table is
truncated and you must refill the cache. If that is the case, then it
must surely be better to have a cache that is already 99% full, than
one which starts at empty. There is no damage or loss because parts of
the cache were missing.

Unlogged Tables are currently so volatile they are unusable for any
other purpose. I want to see a table that is useful for low value
data, such as sensor data.
If you had 10 TB of sensor data and the database crashes, then you
want to lose a few blocks, not the whole lot. Low value = rare, minor
loss is acceptable, but it doesn;t mean total data loss is acceptable.
For that use case, total loss is catastrophic, not just mildly
irritating. If you are a Telco, losing a few minutes billing data
costs much less than having every server have better hardware so it
can cope with high WAL traffic as well. They don't want to lose the
data, but its a cost based trade off. Consistency is not an issue, you
are just missing some data. That is normal anyway, since sensor data
generators (mobile devices etc) frequently fail, are offline, turned
off etc, so there isn't even a definition of what complete data is
supposed to look like. The missing data looks exactly like lots of
people turned their phones off for a few minutes.

So my suggestion makes UNLOGGED tables more useful for the use case
they were designed to address - cached data (AIUI), plus they allow
another use case that doesn't seem to be well understood, low value
data in massive data volumes.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-05-03 Thread Greg Stark
On Tue, May 3, 2011 at 8:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
 The current assessment is that UNLOGGED tables are useful only for
 running a data cache. If the database crashes, then the table is
 truncated and you must refill the cache. If that is the case, then it
 must surely be better to have a cache that is already 99% full, than
 one which starts at empty. There is no damage or loss because parts of
 the cache were missing.

That's not necessarily the case of course. I've written caches before
where a set of records would be present for each object being cached.
I could deal with a whole set not being present but if just some of
the records were missing then I would serve incorrect results. Since
they were created in a single transacion I should be able to rely on
the whole set being present or missing consistently.

That doesn't mean there aren't cases where that's true of course. In
an ideal world the database would guarantee that you couldn't use the
table in this way and fail to get the consistency you expect.
Something like getting an error if you try to modify two rows in an
unlogged table during a single transaction. I'm not sure how to do
that reasonably though.


 For that use case, total loss is catastrophic, not just mildly
 irritating. If you are a Telco, losing a few minutes billing data
 costs much less than having every server have better hardware so it
 can cope with high WAL traffic as well. They don't want to lose the
 data, but its a cost based trade off.

This analysis is dead on. That's precisely how businesses evaluate
this question.


 Consistency is not an issue, you
 are just missing some data. That is normal anyway, since sensor data
 generators (mobile devices etc) frequently fail, are offline, turned
 off etc, so there isn't even a definition of what complete data is
 supposed to look like. The missing data looks exactly like lots of
 people turned their phones off for a few minutes.

I don't think that's true however. Consider if I have a rollup table
that contains aggregated sums of that data. If you lose some of the
records then my aggregates don't add up to the correct values any
more.

Or consider if you are counting sensor data like total data
transferred and session count -- and then reporting the average data
transferred per session. And you accidentally lose a bunch of
sessions. Now your data/session report will be reporting false
information.

This is true even if you only lose recently committed rows. But losing
whole blocks means you risk losing random old data which makes it hard
to work around by, say, purging recently aggregated data.

 So my suggestion makes UNLOGGED tables more useful for the use case
 they were designed to address - cached data (AIUI), plus they allow
 another use case that doesn't seem to be well understood, low value
 data in massive data volumes.

There other approaches as well. Foreign data wrappers mean you could
do things like store the low value data in raw text files or other
systems like memcached or Hadoop or whatever. I'm not saying there's
no reason to do something in Postgres but if you're being bitten by
Postgres's block-oriented random access storage it may be a problem
too fundamental to solve without addressing the underlying storage
strategy?


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-05-03 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 You're assuming that there are referential links *from* other tables
 to the table with damage. In which case you would be correct. But of
 course, if you needed that data for integrity you would never do that,
 so the problem is a nonexistent use case. The suggested mode is for
 Fact data, not reference tables.

So I suppose your notion of fact data includes no fields that are wide
enough to need toasting?  Because as soon as you have any out-of-line
values, there's an equivalent of foreign keys behind the scenes, where
the user can't see it (until he gets missing chunk number or some such
error).

 The current assessment is that UNLOGGED tables are useful only for
 running a data cache. If the database crashes, then the table is
 truncated and you must refill the cache. If that is the case, then it
 must surely be better to have a cache that is already 99% full, than
 one which starts at empty. There is no damage or loss because parts of
 the cache were missing.

A cache that starts at 99% full of untrustworthy data is NOT better.

 Unlogged Tables are currently so volatile they are unusable for any
 other purpose. I want to see a table that is useful for low value
 data, such as sensor data.

Basically, you're being hopelessly optimistic both about the extent to
which a crash is likely to render data inconsistent, and our ability to
detect that inconsistency.  It doesn't matter whether the data is low
value, the difficulty of cleaning up remains the same.  I don't want to
deal with trying to detect that, and I definitely don't want to dump the
problems onto users.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-05-03 Thread Christopher Browne
On Tue, May 3, 2011 at 4:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 You're assuming that there are referential links *from* other tables
 to the table with damage. In which case you would be correct. But of
 course, if you needed that data for integrity you would never do that,
 so the problem is a nonexistent use case. The suggested mode is for
 Fact data, not reference tables.

 So I suppose your notion of fact data includes no fields that are wide
 enough to need toasting?  Because as soon as you have any out-of-line
 values, there's an equivalent of foreign keys behind the scenes, where
 the user can't see it (until he gets missing chunk number or some such
 error).

 The current assessment is that UNLOGGED tables are useful only for
 running a data cache. If the database crashes, then the table is
 truncated and you must refill the cache. If that is the case, then it
 must surely be better to have a cache that is already 99% full, than
 one which starts at empty. There is no damage or loss because parts of
 the cache were missing.

 A cache that starts at 99% full of untrustworthy data is NOT better.

That's a bit pessimistic.

The case that bugs me is that a cache that's 99% trustworthy, but
where I have no idea that:
a) 1% of it is crud, and
b) Which 1% of it is crud
is still a pretty unacceptable scenario.

I head back to our policy for handling caches:
  If in doubt, TRUNCATE.

That policy would be nicely consistent with the way 9.1 deals with
unlogged tables.

 Unlogged Tables are currently so volatile they are unusable for any
 other purpose. I want to see a table that is useful for low value
 data, such as sensor data.

 Basically, you're being hopelessly optimistic both about the extent to
 which a crash is likely to render data inconsistent, and our ability to
 detect that inconsistency.  It doesn't matter whether the data is low
 value, the difficulty of cleaning up remains the same.  I don't want to
 deal with trying to detect that, and I definitely don't want to dump the
 problems onto users.

+1, on both grounds.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-26 Thread Leonardo Francalanci
  If that 1%  is random (not time/transaction related), usually you'd rather 
have an empty  table.
 
 Why do you think it would be random?

Heap blocks would be zeroed if they were found to be damaged, following a 
crash.

If you erase full blocks, you have no idea what data you erased; it could be
something changed 1 hour ago, 1 month ago, 1 year ago. This is very different 
from,
say, synchronous_commit=off: in that case, the most recent transactions may be
lost if the database should crash. In your case, some (who knows which???) 
data
is lost. So, to me that sounds like random loss. I don't think that that is 
different
from a corrupted table. You're not deleting rows recently changed; you're 
deleting
everything that is physically close to it.

  In other  words: is a table that is not consistant with anything else in 
  the 
db  useful?
 
 That's too big a leap. Why would it suddenly be inconsistent with  the
 rest of the database?


If you delete some data, and you have no idea what data you lost, I don't think 
you have a
consistent db. Unless, of course, your table has no relation with any other 
table in the db.

Of course, all these thoughts are based on the assumption that I know what 
happens when a
block is erased; but my knowledge of postgresql internals is not so good, so I 
might be
*very* wrong

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-26 Thread Cédric Villemain
2011/4/25 Christopher Browne cbbro...@gmail.com:
 On Mon, Apr 25, 2011 at 2:03 PM, Jesper Krogh jes...@krogh.cc wrote:
 On 2011-04-25 20:00, Leonardo Francalanci wrote:
 The amount of data loss on a big table will be 1% of the data
 loss caused by truncating the whole table.

 If that 1% is random (not time/transaction related), usually you'd
 rather have an empty table. In other words: is a table that is not
 consistant with anything else in the db useful?

 Depends on the application, if it serves for pure caching then it is fully
 acceptable and way
 better than dropping everything.

 Whoah...

 When cacheing, the application already needs to be able to cope with
 the case where there's nothing in the cache.

 This means that if the cache gets truncated, it's reasonable to expect
 that the application won't get deranged - it already needs to cope
 with the case where data's not there and needs to get constructed.

That is true but the application performance has already to cope with
a server crash/restart. Many things you can add to make the restart
(for the application)  more 'smooth' is good.


 In contrast, if *wrong* data is in the cache, that could very well
 lead to wrong behavior on the part of the application.

 And there may not be any mechanism aside from cache truncation that
 will rectify that.

 It seems to me that it's a lot riskier to try to preserve contents of
 such tables than it is to truncate them.
 --
 When confronted by a difficult problem, solve it by reducing it to the
 question, How would the Lone Ranger handle this?

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Robert Haas
On Apr 24, 2011, at 1:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Unlogged tables are a good new feature.

Thanks.

 I noticed Bruce had mentioned they were the equivalent of NoSQL, which
 I don't really accept.

Me neither. I thought that was poorly said.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.

The problem is not so much the blocks that are damaged (e.g. half-written, torn 
page) but the ones that were never written at all. For example, read page A, 
read page B, update tuple on page A putting new version on page B, write one 
but not both of A and B out to the O/S, crash.  Everything on disk is a valid 
page, but they are not coherent taken as a whole.  It's normally XLOG replay 
that fixes this type of situation...

I thought about this problem a bit and I think you could perhaps deal with it 
by having some sort of partially logged table, where we would XLOG just enough 
to know which blocks or relations had been modified and only nuke enough data 
to be certain of being safe. But it isn't clear that there is much use case for 
this, especially because I think it would give up nearly all the performance 
benefit.

I do think it might be useful to have an unlogged index on a logged table, 
somehow frobnicated so that on a crash the index is known invalid and not used 
until a REINDEX is performed.

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Leonardo Francalanci
The only data we can't rebuild it's the heap. So what about an option for 
UNlogged indexes on a LOGged table? It would always preserve data, and it would 
'only' cost a rebuilding of the indexes in case of an unclean shutdown. I think 
it would give a boost in performance for all those cases where the IO 
(especially random IO) is caused by the indexes, and it doesn't look too 
complicated (but maybe I'm missing something).

I proposed the unlogged to logged patch (BTW has anyone given a look at it?) 
because we partition data based on a timestamp, and we can risk loosing the 
last N minutes of data, but after N minutes we want to know data will always be 
there, so we would like to set a partition table to 'logged'. 

Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Simon Riggs
On Mon, Apr 25, 2011 at 8:36 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
 The only data we can't rebuild it's the heap. So what about an option for 
 UNlogged indexes on a LOGged table? It would always preserve data, and it 
 would 'only' cost a rebuilding of the indexes in case of an unclean shutdown. 
 I think it would give a boost in performance for all those cases where the IO 
 (especially random IO) is caused by the indexes, and it doesn't look too 
 complicated (but maybe I'm missing something).

 I proposed the unlogged to logged patch (BTW has anyone given a look at it?) 
 because we partition data based on a timestamp, and we can risk loosing the 
 last N minutes of data, but after N minutes we want to know data will always 
 be there, so we would like to set a partition table to 'logged'.

I agree that unlogged indexes on a logged heap are better for
resilience and are likely to be the best first step.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Simon Riggs
On Mon, Apr 25, 2011 at 8:14 AM, Robert Haas robertmh...@gmail.com wrote:
 On Apr 24, 2011, at 1:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Unlogged tables are a good new feature.

 Thanks.

 I noticed Bruce had mentioned they were the equivalent of NoSQL, which
 I don't really accept.

 Me neither. I thought that was poorly said.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.

 The problem is not so much the blocks that are damaged (e.g. half-written, 
 torn page) but the ones that were never written at all. For example, read 
 page A, read page B, update tuple on page A putting new version on page B, 
 write one but not both of A and B out to the O/S, crash.  Everything on disk 
 is a valid page, but they are not coherent taken as a whole.  It's normally 
 XLOG replay that fixes this type of situation...

Not really sure it matters what the cause of data loss is, does it?
The zeroing of the blocks definitely causes data loss but the
intention is to bring the table back to a consistent physical state,
not to in any way repair the data loss.

Repeating my words above, this proposed option trades potential minor
data loss for performance.

The amount of data loss on a big table will be 1% of the data loss
caused by truncating the whole table.

This is important on big tables where reloading from a backup might
take a long time.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Robert Haas
On Mon, Apr 25, 2011 at 5:04 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Apr 25, 2011 at 8:14 AM, Robert Haas robertmh...@gmail.com wrote:
 On Apr 24, 2011, at 1:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Unlogged tables are a good new feature.

 Thanks.

 I noticed Bruce had mentioned they were the equivalent of NoSQL, which
 I don't really accept.

 Me neither. I thought that was poorly said.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.

 The problem is not so much the blocks that are damaged (e.g. half-written, 
 torn page) but the ones that were never written at all. For example, read 
 page A, read page B, update tuple on page A putting new version on page B, 
 write one but not both of A and B out to the O/S, crash.  Everything on disk 
 is a valid page, but they are not coherent taken as a whole.  It's normally 
 XLOG replay that fixes this type of situation...

 Not really sure it matters what the cause of data loss is, does it?
 The zeroing of the blocks definitely causes data loss but the
 intention is to bring the table back to a consistent physical state,
 not to in any way repair the data loss.

Right, but the trick is how you identify which blocks you need to
zero.  You used the word damaged, which to me implied that the block
had been modified in some way but ended up with other than the
expected contents, so that something like a CRC check might detect the
problem.  My point (as perhaps you already understand) is that you
could easily have a situation where every block in the table passes a
hypothetical block-level CRC check, but the table as a whole is still
damaged because update chains aren't coherent.  So you need some kind
of mechanism for identifying which portions of the table you need to
zero to get back to a guaranteed-coherent state.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Robert Haas
On Mon, Apr 25, 2011 at 3:36 AM, Leonardo Francalanci m_li...@yahoo.it wrote:
 The only data we can't rebuild it's the heap. So what about an option for 
 UNlogged indexes on a LOGged table? It would always preserve data, and it 
 would 'only' cost a rebuilding of the indexes in case of an unclean shutdown. 
 I think it would give a boost in performance for all those cases where the IO 
 (especially random IO) is caused by the indexes, and it doesn't look too 
 complicated (but maybe I'm missing something).

+1.

 I proposed the unlogged to logged patch (BTW has anyone given a look at it?) 
 because we partition data based on a timestamp, and we can risk loosing the 
 last N minutes of data, but after N minutes we want to know data will always 
 be there, so we would like to set a partition table to 'logged'.

That approach is something I had also given some thought to, and I'm
glad to hear that people are thinking about doing it in the real
world.  I'm planning to look at your patch, but I haven't gotten to it
yet, because I'm giving priority to anything that must be done to get
9.1beta1 out the door.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Leonardo Francalanci
 The amount of data loss on a big
 table will be 1% of the data loss
caused by truncating the whole table.

If that 1% is random (not time/transaction related), usually you'd rather have 
an empty table. In other words: is a table that is not consistant with anything 
else in the db useful?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Jesper Krogh

On 2011-04-25 20:00, Leonardo Francalanci wrote:

 The amount of data loss on a big table will be 1% of the data
 loss caused by truncating the whole table.

 If that 1% is random (not time/transaction related), usually you'd
 rather have an empty table. In other words: is a table that is not
 consistant with anything else in the db useful?

Depends on the application, if it serves for pure caching then it is 
fully acceptable and way

better than dropping everything.

--
Jesper



Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Kevin Grittner
Jesper Krogh jes...@krogh.cc wrote:
 On 2011-04-25 20:00, Leonardo Francalanci wrote:
 The amount of data loss on a big table will be 1% of the data
 loss caused by truncating the whole table.

  If that 1% is random (not time/transaction related), usually
  you'd rather have an empty table. In other words: is a table
  that is not consistant with anything else in the db useful?

 Depends on the application, if it serves for pure caching then it
 is fully acceptable and way better than dropping everything.
 
I buy this *if* we can be sure we're not keeping information which
is duplicated or mangled, and if we can avoid crashing the server to
a panic because of broken pointers or other infelicities.  I'm not
sure that can't be done, but I don't think I've heard an explanation
of how that could be accomplished, particularly without overhead
which would wipe out the performance benefit of unlogged tables. 
(And without a performance benefit, what's the point?)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Christopher Browne
On Mon, Apr 25, 2011 at 2:03 PM, Jesper Krogh jes...@krogh.cc wrote:
 On 2011-04-25 20:00, Leonardo Francalanci wrote:
 The amount of data loss on a big table will be 1% of the data
 loss caused by truncating the whole table.

 If that 1% is random (not time/transaction related), usually you'd
 rather have an empty table. In other words: is a table that is not
 consistant with anything else in the db useful?

 Depends on the application, if it serves for pure caching then it is fully
 acceptable and way
 better than dropping everything.

Whoah...

When cacheing, the application already needs to be able to cope with
the case where there's nothing in the cache.

This means that if the cache gets truncated, it's reasonable to expect
that the application won't get deranged - it already needs to cope
with the case where data's not there and needs to get constructed.

In contrast, if *wrong* data is in the cache, that could very well
lead to wrong behavior on the part of the application.

And there may not be any mechanism aside from cache truncation that
will rectify that.

It seems to me that it's a lot riskier to try to preserve contents of
such tables than it is to truncate them.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Simon Riggs
On Mon, Apr 25, 2011 at 7:00 PM, Leonardo Francalanci m_li...@yahoo.it wrote:
 The amount of data loss on a big
 table will be 1% of the data loss
caused by truncating the whole table.

 If that 1% is random (not time/transaction related), usually you'd rather 
 have an empty table.

Why do you think it would be random?


 In other words: is a table that is not consistant with anything else in the 
 db useful?

That's too big a leap. Why would it suddenly be inconsistent with the
rest of the database?


Not good arguments.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Simon Riggs
On Mon, Apr 25, 2011 at 1:42 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Apr 25, 2011 at 5:04 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Apr 25, 2011 at 8:14 AM, Robert Haas robertmh...@gmail.com wrote:
 On Apr 24, 2011, at 1:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Unlogged tables are a good new feature.

 Thanks.

 I noticed Bruce had mentioned they were the equivalent of NoSQL, which
 I don't really accept.

 Me neither. I thought that was poorly said.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.

 The problem is not so much the blocks that are damaged (e.g. half-written, 
 torn page) but the ones that were never written at all. For example, read 
 page A, read page B, update tuple on page A putting new version on page B, 
 write one but not both of A and B out to the O/S, crash.  Everything on 
 disk is a valid page, but they are not coherent taken as a whole.  It's 
 normally XLOG replay that fixes this type of situation...

 Not really sure it matters what the cause of data loss is, does it?
 The zeroing of the blocks definitely causes data loss but the
 intention is to bring the table back to a consistent physical state,
 not to in any way repair the data loss.

 Right, but the trick is how you identify which blocks you need to
 zero.  You used the word damaged, which to me implied that the block
 had been modified in some way but ended up with other than the
 expected contents, so that something like a CRC check might detect the
 problem.  My point (as perhaps you already understand) is that you
 could easily have a situation where every block in the table passes a
 hypothetical block-level CRC check, but the table as a whole is still
 damaged because update chains aren't coherent.  So you need some kind
 of mechanism for identifying which portions of the table you need to
 zero to get back to a guaranteed-coherent state.

That sounds like progress.

The current mechanism is truncate complete table. There are clearly
other mechanisms that would not remove all data.

Probably the common case would be for insert-only data.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Robert Haas
On Mon, Apr 25, 2011 at 2:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Right, but the trick is how you identify which blocks you need to
 zero.  You used the word damaged, which to me implied that the block
 had been modified in some way but ended up with other than the
 expected contents, so that something like a CRC check might detect the
 problem.  My point (as perhaps you already understand) is that you
 could easily have a situation where every block in the table passes a
 hypothetical block-level CRC check, but the table as a whole is still
 damaged because update chains aren't coherent.  So you need some kind
 of mechanism for identifying which portions of the table you need to
 zero to get back to a guaranteed-coherent state.

 That sounds like progress.

 The current mechanism is truncate complete table. There are clearly
 other mechanisms that would not remove all data.

No doubt.  Consider a block B.  If the system crashes when block B is
dirty either in the OS cache or shared_buffers, then you must zero B,
or truncate it away.  If it was clean in both places, however, it's
good data and you can keep it.

So you can imagine for example a scheme where imagine that the
relation is divided into 8MB chunks, and we WAL-log the first
operation after each checkpoint that touches a chunk.  Replay zeroes
the chunk, and we also invalidate all the indexes (the user must
REINDEX to get them working again).  I think that would be safe, and
certainly the WAL-logging overhead would be far less than WAL-logging
every change, since we'd need to emit only ~16 bytes of WAL for every
8MB written, rather than ~8MB of WAL for every 8MB written.  It
wouldn't allow some of the optimizations that the current unlogged
tables can get away with only because they WAL-log exactly nothing -
and selectively zeroing chunks of a large table might slow down
startup quite a bit - but it might still be useful to someone.

However, I think that the logged table, unlogged index idea is
probably the most promising thing to think about doing first.  It's
easy to imagine all sorts of uses for that sort of thing even in cases
where people can't afford to have any data get zeroed, and it would
provide a convenient building block for something like the above if we
eventually wanted to go that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 However, I think that the logged table, unlogged index idea is
 probably the most promising thing to think about doing first.

+1 for that --- it's clean, has a clear use-case, and would allow us
to manage the current mess around hash indexes more cleanly.
That is, hash indexes would always be treated as unlogged.

(Or of course we could fix the lack of WAL logging for hash indexes,
but I notice a lack of people stepping up to do that.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Simon Riggs
Unlogged tables are a good new feature.

I noticed Bruce had mentioned they were the equivalent of NoSQL, which
I don't really accept. I guess it depends upon whether you mean NoSQL
for caches (e.g. memcached) or NoSQL for useful datastores (e.g.
Mongo). It seems worth discussin now before we get too far into the
marketing hype around Beta.

If you don't log changes to tables you have two choices if we crash
1) truncate the table and any indexes
2) rebuild any indexes damaged by the crash

Currently, we do (1). That certainly has its place but most data
stores don't do this if they crash, since it would lead to data loss.
Not just a couple of rows either - serious, major data loss if you put
the wrong kind of data in it. We even delete data that has been safely
on disk for weeks, months, which IMHO some people could easily get
confused about.

In the future, I would like to work on (2), which preserves as much
data as possible, while recognising indexes may be damaged. I don't
really have any name for this, since the current naming seems to
assume there is only one kind of unlogged table.

My implementation path for that would be to add a crash_number onto
pg_control and pg_index. Any index marked as unlogged, persistent
would only be usable if it's crash number is the same as current
system crash number.

REINDEX would update the index crash number to current value. That
also allows us to imagine a repair index command in the future as
well.

Heap blocks would be zeroed if they were found to be damaged, following a crash.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Greg Stark
On Sun, Apr 24, 2011 at 6:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 My implementation path for that would be to add a crash_number onto
 pg_control and pg_index. Any index marked as unlogged, persistent
 would only be usable if it's crash number is the same as current
 system crash number.

 REINDEX would update the index crash number to current value. That
 also allows us to imagine a repair index command in the future as
 well.

This seems useful for non-crash-safe indexes in general.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.


How do you propose to detect that? Until we solve the whole checksum
story I don't think we have a reliable way to detect bad pages. And in
some cases where do detect them we would detect them by crashing.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 If you don't log changes to tables you have two choices if we crash
 1) truncate the table and any indexes
 2) rebuild any indexes damaged by the crash

No, you have only one choice, and that's (1), because there's no
guarantee that what's in the table file is meaningful.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.

This is sheerest fantasy.  And even if you could implement it, what sort
of feature would you be offering?  Your data is preserved except when
it isn't?  People who want that can go use mysql.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Simon Riggs
On Sun, Apr 24, 2011 at 10:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 If you don't log changes to tables you have two choices if we crash
 1) truncate the table and any indexes
 2) rebuild any indexes damaged by the crash

 No, you have only one choice, and that's (1), because there's no
 guarantee that what's in the table file is meaningful.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.

 This is sheerest fantasy.  And even if you could implement it, what sort
 of feature would you be offering?  Your data is preserved except when
 it isn't?  People who want that can go use mysql.

AFAIUI, a great many people do.

I am proposing a non-default mode, requiring explicit activation by
user which preserves as much data as possible. I am fully aware that
what is proposed is not an optimisation, but a downgrading of normal
resilience in exchange for some data loss in the event of a crash.

Yes, many other systems support this and people are becoming persuaded
that such risk/reward choices make sense for them.
I see no reason not to provide an option to do this, so people can
make informed choices.

For large sets of low value data, it makes sense. Deleting all data,
just simply because some of it might be damaged, is not the only
option. IMHO deleting all the data is a surprising option that will
cause many people to curse us. I don't see preserving some of the data
as being worse.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Christopher Browne
On Sun, Apr 24, 2011 at 6:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 For large sets of low value data, it makes sense. Deleting all data,
 just simply because some of it might be damaged, is not the only
 option. IMHO deleting all the data is a surprising option that will
 cause many people to curse us. I don't see preserving some of the data
 as being worse.

For the cache table case, it is *certainly* reasonable to delete
everything upon discovering the risk that some might be damaged.

I have seen cases in production where the fix to 'oh, looks like
something's corrupted' is indeed truncate the cache, where that has
become a Standard Operating Procedure.

Sure, to have Postgres do this is a bit heavy-handed, but it's not as
if this isn't going to be heavily documented with warnings like:
Contains live plague bacteria.
Beware the Rabid Hippopotami.  May cause bleeding at the eyes.  If
your database crashes, this table WILL get truncated at startup time.
 If the docs are short on warnings, that should probably get
rectified.  (I'm allowed to be volunteered to do so :-).)

I'd actually find it unsurprising for such tables to get truncated
even on a clean restart; I'd favor that being an option, as along as
make it thus were generally unproblematic.  If the application using
such a table can't cope with that logic, better to induce an
understanding of that sooner rather than later ;-).

It seems like a losing battle to try terribly hard to keep the data
around when, by marking it unlogged, the data definition specifically
warned that this was risky.

I'd not go so far as to suggest having autovac TRUNCATE such tables at
random intervals, but that's a pathology that's not completely
incompatible with the DDL declaration :-)
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Simon Riggs
On Sun, Apr 24, 2011 at 7:41 PM, Greg Stark gsst...@mit.edu wrote:
 On Sun, Apr 24, 2011 at 6:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 My implementation path for that would be to add a crash_number onto
 pg_control and pg_index. Any index marked as unlogged, persistent
 would only be usable if it's crash number is the same as current
 system crash number.

 REINDEX would update the index crash number to current value. That
 also allows us to imagine a repair index command in the future as
 well.

 This seems useful for non-crash-safe indexes in general.

 Heap blocks would be zeroed if they were found to be damaged, following a 
 crash.


 How do you propose to detect that? Until we solve the whole checksum
 story I don't think we have a reliable way to detect bad pages. And in
 some cases where do detect them we would detect them by crashing.


That should be changed.


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables, persistent kind

2011-04-24 Thread Greg Stark
On Sun, Apr 24, 2011 at 11:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 IMHO deleting all the data is a surprising option that will
 cause many people to curse us. I don't see preserving some of the data
 as being worse.

What possible damage to you want to recover from?

Without WAL logging after a software crash it's possible for update
chains to be broken, for multiple copies of the same tuple to be
visible, for some tuples to disappear but not others, etc.

And without checksums after a hardware crash it'll be possible for
pages to be torn resulting in tuple pointers that land in the middle
of nowhere or tuples that start off fine but are half overwritten with
unrelated garbage.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNLOGGED tables in psql \d

2011-02-22 Thread Cédric Villemain
2011/2/22 Itagaki Takahiro itagaki.takah...@gmail.com:
 psql \d(+) doesn't show any information about UNLOGGED and TEMP attributes
 for the table. So, we cannot know the table is unlogged or not unless
 we directly select from pg_class.relpersistence.  Is this a TODO item?

 The same issue is in TEMP tables, but we can determine them by their
 schema; they are always created in pg_temp_N schema.

I believe it is in the title of the table presented by \d (Table,
Unlogged table, Temp table)

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNLOGGED tables in psql \d

2011-02-22 Thread Itagaki Takahiro
On Tue, Feb 22, 2011 at 18:11, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 2011/2/22 Itagaki Takahiro itagaki.takah...@gmail.com:
 psql \d(+) doesn't show any information about UNLOGGED and TEMP attributes
 for the table. So, we cannot know the table is unlogged or not unless
 we directly select from pg_class.relpersistence.  Is this a TODO item?

 I believe it is in the title of the table presented by \d (Table,
 Unlogged table, Temp table)

Ah, I see. Thank you.  They are shown as Unlogged Table and
Unlogged Index.

- We don't have Temp for temp tables. Should we have?
- The head characters of the second words would be small letters.
  We describe other objects as Composite type, Foreign table, or so.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNLOGGED tables in psql \d

2011-02-22 Thread Cédric Villemain
2011/2/22 Itagaki Takahiro itagaki.takah...@gmail.com:
 On Tue, Feb 22, 2011 at 18:11, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/2/22 Itagaki Takahiro itagaki.takah...@gmail.com:
 psql \d(+) doesn't show any information about UNLOGGED and TEMP attributes
 for the table. So, we cannot know the table is unlogged or not unless
 we directly select from pg_class.relpersistence.  Is this a TODO item?

 I believe it is in the title of the table presented by \d (Table,
 Unlogged table, Temp table)

 Ah, I see. Thank you.  They are shown as Unlogged Table and
 Unlogged Index.

 - We don't have Temp for temp tables. Should we have?

Hum, First, I would say yes to be more consistent with the unlogged
case, but 2nd we must refrain from outputting too much information
where it is not relevant.

The fact that you didn''t saw it might be enough to reconsider the way
we display the unlogged state (and temp state) of a relation.

Maybe some a Durability: normal, temp, unlogged  line at bottom of
the \d output  ?

 - The head characters of the second words would be small letters.
  We describe other objects as Composite type, Foreign table, or so.

 --
 Itagaki Takahiro




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNLOGGED tables in psql \d

2011-02-22 Thread Robert Haas
On Tue, Feb 22, 2011 at 8:03 AM, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 The fact that you didn''t saw it might be enough to reconsider the way
 we display the unlogged state (and temp state) of a relation.

 Maybe some a Durability: normal, temp, unlogged  line at bottom of
 the \d output  ?

The term we use internally is persistence, but I'm not really sure
it's worth making the \d output longer.  In fact I'd much rather find
some way of going the other direction.  Note also that the temp-ness
of a table can already be inferred from the schema.

Another thought is that we might eventually have global temporary
tables, where the schema is globally visible (like a permanent or
unlogged table) but each backend sees only its own contents.  So
whatever we come up with here should generalize to that case as well.

Still another point is that temp can apply to any object type, but
unlogged can only apply to tables, indexes, and sequences.  (We
don't currently implement it for sequences, though.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] UNLOGGED tables in psql \d

2011-02-21 Thread Itagaki Takahiro
psql \d(+) doesn't show any information about UNLOGGED and TEMP attributes
for the table. So, we cannot know the table is unlogged or not unless
we directly select from pg_class.relpersistence.  Is this a TODO item?

The same issue is in TEMP tables, but we can determine them by their
schema; they are always created in pg_temp_N schema.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables v5

2010-12-27 Thread Simon Riggs
On Fri, 2010-12-24 at 23:36 -0500, Robert Haas wrote:
 Here's an updated patch for unlogged tables, incorporating the
 following changes since v4:

Looks good

 5. Support unlogged GIN indexes.

Not sure from reading the docs whether unlogged indexes are supported on
logged tables? Could you clarify (or clarify more often)? Does this
solve the hash index situation?

The contents
+  of an unlogged table are also not replicated to standby servers.
would prefer to remove also

Unlogged tables aren't replicated, but they would be copied as part of a
base backup. I'd request that we have a different forkname, or some
other indicator that allows a base backup to allow exclusion of such
files, since they are going to get reset to zero very soon afterwards
anyway. Not everyone would wish it, but its a good option.

 There are a couple of open issues which I'm thinking can be left for
 future work.
 
 A. Minimization of fsync traffic.  fsyncs on unlogged relations can
 potentially be postponed until shutdown time.  Right now, they'll
 happen as part of the next checkpoint.

Half the fun of unlogged tables was for me the ability to skip the fsync
and the checkpoint writes. If we're using PostgreSQL as a cache, it will
be a little hard to explain why it still does writes in a huge storm
every so often. A performance feature that doesn't avoid a performance
hit seems a little strange, to say the least.

Should be easy enough to mark a flag on each buffer, only examined at
checkpoint.

 B. Unlogged GIST indexes aren't supported.

No problem. Understand specific difficulties.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables v5

2010-12-27 Thread Robert Haas
On Mon, Dec 27, 2010 at 3:22 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Not sure from reading the docs whether unlogged indexes are supported on
 logged tables? Could you clarify (or clarify more often)? Does this
 solve the hash index situation?

They are not.  The only place you'll see that the grammar has been
modified is in the CREATE TABLE documentation.  The table is either
unlogged or not, and if it is unlogged, then its indices are also
unlogged.  As I mentioned on one of the previous threads, I think
unlogged indexes on logged tables would be a useful thing to support,
but it turns out to require somewhat different infrastructure from
what I've built here, and I don't have any imminent plans to work on
it.

 Unlogged tables aren't replicated, but they would be copied as part of a
 base backup. I'd request that we have a different forkname, or some
 other indicator that allows a base backup to allow exclusion of such
 files, since they are going to get reset to zero very soon afterwards
 anyway. Not everyone would wish it, but its a good option.

You can identify them by looking for files that have an _init fork,
and then skipping the corresponding main-fork files.  I initially had
the idea of actually giving the main fork a distinguishing name, but
this turns out to have two serious defects: (1) it makes the patch a
lot more invasive, to the point where if I'd had to keep going that
route I likely would have given up on the feature altogether, and (2)
you still need the init fork anyway.  I agree this is a bit of a drag
to work with for base backups but I don't think there's a reasonable
way to do better.

 There are a couple of open issues which I'm thinking can be left for
 future work.

 A. Minimization of fsync traffic.  fsyncs on unlogged relations can
 potentially be postponed until shutdown time.  Right now, they'll
 happen as part of the next checkpoint.

 Half the fun of unlogged tables was for me the ability to skip the fsync
 and the checkpoint writes. If we're using PostgreSQL as a cache, it will
 be a little hard to explain why it still does writes in a huge storm
 every so often. A performance feature that doesn't avoid a performance
 hit seems a little strange, to say the least.

 Should be easy enough to mark a flag on each buffer, only examined at
 checkpoint.

Well, there is such a flag, but it just skips the checkpoint write.
If the cleaning scan or a backend does a write, an fsync is still
scheduled (and gets performed at the next checkpoint).  The problem
here is that there was a groundswell of support for making unlogged
tables survive a clean shutdown, and you've got to eventually perform
these fsyncs in order for that to work.  Now you could postpone them
until the shutdown checkpoint, but that's only going to help in a
limited set of cases.  Specifically:

1. If the unlogged tables fit entirely within shared_buffers, it won't help.
2. If the operating system writes back the dirty buffers before the
next checkpoint, it won't help.
3. If you're running ext3, then the first fsync on anything is going
to flush the entire buffer cache for that FS out to disk anyway, so
unless pg_xlog is on a separate partition AND you have absolutely no
writes to any normal tables since the last checkpoint, it won't help.

We could probably get a test system set up where we run XFS on Linux
with an unlogged table larger than shared buffers and demonstrate
that, in fact, there is some preventable suckiness there.

But even then, I'm not convinced I should try to fix that as part of
this patch.  First, of course, there is a huge performance gain from
using unlogged tables even without this optimization.  I have tested
it; it's big.  Second, we already know that the background writer's
algorithm for scheduling fsyncs is not optimal.  It may be that if we
end up rejiggering that algorithm, something will fall out to handle
this case, too.  I'd rather not contort the logic too much to handle
this case until we have a better idea of how we want it to work in
general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables

2010-12-20 Thread Alvaro Herrera
Excerpts from Robert Haas's message of sáb dic 18 02:21:41 -0300 2010:
 Here's an attempt to summarize the remaining issues with this patch
 that I know about.  I may have forgotten something, so please mention
 it if you notice something missing.
 
 1. pg_dump needs an option to control whether unlogged tables are
 dumped.  --no-unlogged-tables seems like the obvious choice, assuming
 we want the default to be to dump them, which seems like the safest
 option.

If there are valid use cases for some unlogged tables being dumped and
some others not, would it make sense to be able to specify a pattern of
tables to be dumped or skipped?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables

2010-12-20 Thread Robert Haas
On Mon, Dec 20, 2010 at 9:05 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of sáb dic 18 02:21:41 -0300 2010:
 Here's an attempt to summarize the remaining issues with this patch
 that I know about.  I may have forgotten something, so please mention
 it if you notice something missing.

 1. pg_dump needs an option to control whether unlogged tables are
 dumped.  --no-unlogged-tables seems like the obvious choice, assuming
 we want the default to be to dump them, which seems like the safest
 option.

 If there are valid use cases for some unlogged tables being dumped and
 some others not, would it make sense to be able to specify a pattern of
 tables to be dumped or skipped?

Well, if you want to dump a subset of the tables in your database, you
can already do that.  I think that adding a pattern to
--no-unlogged-tables (or whatever we end up calling it) would be an
unnecessary frammish.  There's no particular reason to think that
unlogged tables are going to be so widely used or that concerns about
which ones are going to be so widespread that we should do something
here when we don't even have much simpler things like --function,
which IMHO would extremely useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables

2010-12-20 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Robert Haas's message of sáb dic 18 02:21:41 -0300 2010:
 1. pg_dump needs an option to control whether unlogged tables are
 dumped.  --no-unlogged-tables seems like the obvious choice, assuming
 we want the default to be to dump them, which seems like the safest
 option.

 If there are valid use cases for some unlogged tables being dumped and
 some others not, would it make sense to be able to specify a pattern of
 tables to be dumped or skipped?

Presumably you could still do that with the regular --tables name
pattern switch.  I don't see a reason for unlogged tables to respond to
a different name pattern.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables

2010-12-18 Thread Kevin Grittner
Robert Haas  wrote:
 
 If there's any third-party code out there that is checking
 rd_istemp, it likely also needs to be revised to check whether
 WAL-logging is needed, not whether the relation is temp. The way
 I've coded it, such code will fail to compile, and can be very
 easily fixed by substituting a call to RelationNeedsWAL() or
 RelationUsesLocalBuffers() or RelationUsesTempNamespace(),
 depending on which property the caller actually cares about.
 
Hmm...  This broke the SSI patch, which was using rd_istemp to omit
conflict checking where it was set to true.  The property I care
about is whether tuples in one backend can be read by an transaction
in a different backend, which I assumed would not be true for
temporary tables.  Which of the above would be appropriate for that
use?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables

2010-12-18 Thread Robert Haas
On Sat, Dec 18, 2010 at 12:27 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas  wrote:

 If there's any third-party code out there that is checking
 rd_istemp, it likely also needs to be revised to check whether
 WAL-logging is needed, not whether the relation is temp. The way
 I've coded it, such code will fail to compile, and can be very
 easily fixed by substituting a call to RelationNeedsWAL() or
 RelationUsesLocalBuffers() or RelationUsesTempNamespace(),
 depending on which property the caller actually cares about.

 Hmm...  This broke the SSI patch, which was using rd_istemp to omit
 conflict checking where it was set to true.  The property I care
 about is whether tuples in one backend can be read by an transaction
 in a different backend, which I assumed would not be true for
 temporary tables.  Which of the above would be appropriate for that
 use?

RelationUsesLocalBuffers().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Tue, Dec 14, 2010 at 5:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 14, 2010 at 4:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Dec 14, 2010 at 4:24 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Hmm, the first idea that comes to mind is to use a counter like the
 GetXLogRecPtrForTemp() counter I used for temp tables, but global, in 
 shared
 memory. However, that's a bit problematic because if we store a value from
 that counter to LSN, it's possible that the counter overtakes the XLOG
 insert location, and you start to get xlog flush errors. We could avoid 
 that
 if we added a new field to the GiST page header, and used that to store the
 value in the parent page instead of the LSN.

 That doesn't seem ideal, either, because now you're eating up some
 number of bytes per page in every GIST index just on the off chance
 that one of them is unlogged.

 On-disk compatibility seems problematic here as well.

 Good point.

Given the foregoing discussion, I see only two possible paths forward here.

1. Just decide that that unlogged tables can't have GIST indexes, at
least until someone figures out a way to make it work.  That's sort of
an annoying limitation, but I think we could live with it.

2. Write WAL records even though the GIST index is supposedly
unlogged.  We could either (1) write the usual XLOG_GIST_* records,
and arrange to ignore them on replay when the relation in question is
unlogged, or (2) write an XLOG_NOOP record to advance the current LSN.
 The latter sounds safer to me, but it will mean that the XLOG code
for GIST needs three separate cases (temp, perm, unlogged).  Either
way we give up a significant chunk of the benefit of making the
relation unlogged in the first place.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Andy Colson


Given the foregoing discussion, I see only two possible paths forward here.

1. Just decide that that unlogged tables can't have GIST indexes, at
least until someone figures out a way to make it work.  That's sort of
an annoying limitation, but I think we could live with it.



+1

In the small subset of situations that need unlogged tables, I would 
think the subset of those that need gist is exceedingly small.


Unless someone can come up with a use case that needs both unlogged and 
gist, I'd vote not to spend time on it.  (And, if ever someone does come 
along with a really good use, then time can be put toward it).


-Andy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Given the foregoing discussion, I see only two possible paths forward here.

 1. Just decide that that unlogged tables can't have GIST indexes, at
 least until someone figures out a way to make it work.  That's sort of
 an annoying limitation, but I think we could live with it.

 2. Write WAL records even though the GIST index is supposedly
 unlogged.  We could either (1) write the usual XLOG_GIST_* records,
 and arrange to ignore them on replay when the relation in question is
 unlogged, or (2) write an XLOG_NOOP record to advance the current LSN.
  The latter sounds safer to me, but it will mean that the XLOG code
 for GIST needs three separate cases (temp, perm, unlogged).  Either
 way we give up a significant chunk of the benefit of making the
 relation unlogged in the first place.

 Thoughts?

IIUC, the problem is that the bufmgr might think that a GIST NSN is an
LSN that should affect when to force out a dirty buffer?  What if we
taught it the difference?  We could for example dedicate a pd_flags
bit to marking pages whose pd_lsn isn't actually an LSN.

This solution would probably imply that all pages in the shared buffer
pool have to have a standard PageHeaderData header, not just an LSN at
the front as is assumed now.  But that doesn't seem like a bad thing to
me, unless maybe we were dumb enough to not use a standard page header
in some of the secondary forks.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Heikki Linnakangas

On 17.12.2010 21:07, Tom Lane wrote:

IIUC, the problem is that the bufmgr might think that a GIST NSN is an
LSN that should affect when to force out a dirty buffer?  What if we
taught it the difference?  We could for example dedicate a pd_flags
bit to marking pages whose pd_lsn isn't actually an LSN.

This solution would probably imply that all pages in the shared buffer
pool have to have a standard PageHeaderData header, not just an LSN at
the front as is assumed now.  But that doesn't seem like a bad thing to
me, unless maybe we were dumb enough to not use a standard page header
in some of the secondary forks.


I'm not very fond of expanding buffer manager's knowledge of the page 
layout. How about a new flag in the buffer desc, BM_UNLOGGED? There was 
some talk about skipping flushing of unlogged tables at checkpoints, I 
think we'd need BM_UNLOGGED for that anyway. Or I guess we could hang 
that behavior on the pd_flags bit too, but it doesn't seem like the 
right place for that information.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 17.12.2010 21:07, Tom Lane wrote:
 IIUC, the problem is that the bufmgr might think that a GIST NSN is an
 LSN that should affect when to force out a dirty buffer?  What if we
 taught it the difference?  We could for example dedicate a pd_flags
 bit to marking pages whose pd_lsn isn't actually an LSN.

 I'm not very fond of expanding buffer manager's knowledge of the page 
 layout. How about a new flag in the buffer desc, BM_UNLOGGED?

That could work too, if you can explain how the flag comes to be set
without a bunch of ugliness all over the system.  I don't want callers
of ReadBuffer to have to supply the bit for example.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 2:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 IIUC, the problem is that the bufmgr might think that a GIST NSN is an
 LSN that should affect when to force out a dirty buffer?  What if we
 taught it the difference?  We could for example dedicate a pd_flags
 bit to marking pages whose pd_lsn isn't actually an LSN.

 This solution would probably imply that all pages in the shared buffer
 pool have to have a standard PageHeaderData header, not just an LSN at
 the front as is assumed now.  But that doesn't seem like a bad thing to
 me, unless maybe we were dumb enough to not use a standard page header
 in some of the secondary forks.

Ah, interesting idea.  visibilitymap.c has this:

#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))

and fsm_internals.h has this:

#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
  offsetof(FSMPageData, fp_nodes))

...which seems to imply, at least on a quick look, that we might be
OK, as far as the extra forks go.

I have a vague recollection that the documentation says somewhere that
a full page header on every page isn't required.  And I'm a little
worried about my ability to track down every place in the system that
might get broken by a change in this area.  What do you think about
committing the unlogged tables patch initially without support for
GIST indexes, and working on fixing this as a separate commit?

Another possibly-useful thing about mandating a full page header for
every page is that it might give us a way of avoiding unnecessary full
page writes.  As I wrote previously:

 However, I think we can avoid this too, by
 allocating an additional bit in pd_flags, PD_FPI.  Instead of emitting
 an FPI when the old LSN precedes the redo pointer, we'll emit an FPI
 when the FPI bit is set (in which case we'll also clear the bit) OR
 when the old LSN precedes the redo pointer.  Upon emitting a WAL
 record that is torn-page safe (such as a freeze or all-visible
 record), we'll pass a flag to XLogInsert that arranges to suppress
 FPIs, bump the LSN, and set PD_FPI.  That way, if the page is touched
 again before the next checkpoint by an operation that does NOT
 suppress FPI, one will be emitted then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Another possibly-useful thing about mandating a full page header for
 every page is that it might give us a way of avoiding unnecessary full
 page writes.  As I wrote previously:

Could we do that via a bufmgr status bit, instead?  Heikki's idea has
the merit that it actually reduces bufmgr's knowledge of page headers,
rather than increasing it (since a buffer marked UNLOGGED would need
no assumptions at all about its content).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 2:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 17.12.2010 21:07, Tom Lane wrote:
 IIUC, the problem is that the bufmgr might think that a GIST NSN is an
 LSN that should affect when to force out a dirty buffer?  What if we
 taught it the difference?  We could for example dedicate a pd_flags
 bit to marking pages whose pd_lsn isn't actually an LSN.

 I'm not very fond of expanding buffer manager's knowledge of the page
 layout. How about a new flag in the buffer desc, BM_UNLOGGED?

 That could work too, if you can explain how the flag comes to be set
 without a bunch of ugliness all over the system.  I don't want callers
 of ReadBuffer to have to supply the bit for example.

That's easy enough to solve - ReadBuffer() takes a Relation as an
argument, so you can easily check RelationNeedsWAL(reln).

I guess the question is whether it's right to conflate table is
unlogged with LSN is fake.  It's not immediately obvious to me that
those concepts are isomorphic, although though the reverse isn't
obvious to me either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 2:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Another possibly-useful thing about mandating a full page header for
 every page is that it might give us a way of avoiding unnecessary full
 page writes.  As I wrote previously:

 Could we do that via a bufmgr status bit, instead?  Heikki's idea has
 the merit that it actually reduces bufmgr's knowledge of page headers,
 rather than increasing it (since a buffer marked UNLOGGED would need
 no assumptions at all about its content).

That was my first thought, but it doesn't work.  The buffer could be
evicted from shared_buffers and read back in.  If a checkpoint
intervenes meanwhile, we're OK, but otherwise you fail to emit an
otherwise-needed FPI.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Heikki Linnakangas

On 17.12.2010 21:32, Robert Haas wrote:

I guess the question is whether it's right to conflate table is
unlogged with LSN is fake.  It's not immediately obvious to me that
those concepts are isomorphic, although though the reverse isn't
obvious to me either.


The buffer manager only needs to know if it has to flush the WAL before 
writing the page to disk. The flag just means that the buffer manager 
never needs to do that for this buffer. You're still free to store a 
real LSN there if you want to, it just won't cause any WAL flushes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 17.12.2010 21:32, Robert Haas wrote:
 I guess the question is whether it's right to conflate table is
 unlogged with LSN is fake.  It's not immediately obvious to me that
 those concepts are isomorphic, although though the reverse isn't
 obvious to me either.

 The buffer manager only needs to know if it has to flush the WAL before 
 writing the page to disk. The flag just means that the buffer manager 
 never needs to do that for this buffer. You're still free to store a 
 real LSN there if you want to, it just won't cause any WAL flushes.

Yeah.  I think that BM_UNLOGGED might be a poor choice for the flag name,
just because it overstates what the bufmgr needs to assume.  It might be
better to reverse the flag sense, and have a new flag that *is* set if
the page contains an LSN that we have to check against WAL.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 3:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 17.12.2010 21:32, Robert Haas wrote:
 I guess the question is whether it's right to conflate table is
 unlogged with LSN is fake.  It's not immediately obvious to me that
 those concepts are isomorphic, although though the reverse isn't
 obvious to me either.

 The buffer manager only needs to know if it has to flush the WAL before
 writing the page to disk. The flag just means that the buffer manager
 never needs to do that for this buffer. You're still free to store a
 real LSN there if you want to, it just won't cause any WAL flushes.

 Yeah.  I think that BM_UNLOGGED might be a poor choice for the flag name,
 just because it overstates what the bufmgr needs to assume.  It might be
 better to reverse the flag sense, and have a new flag that *is* set if
 the page contains an LSN that we have to check against WAL.

I was actually thinking of adding BM_UNLOGGED even before this
discussion, because that would allow unlogged buffers to be excluded
from non-shutdown checkpoints.  We could add two flags with different
semantics that take on, under present rules, the same value, but I'd
be disinclined to burn the extra bit without a concrete need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Dec 17, 2010 at 3:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah.  I think that BM_UNLOGGED might be a poor choice for the flag name,
 just because it overstates what the bufmgr needs to assume.

 I was actually thinking of adding BM_UNLOGGED even before this
 discussion, because that would allow unlogged buffers to be excluded
 from non-shutdown checkpoints.  We could add two flags with different
 semantics that take on, under present rules, the same value, but I'd
 be disinclined to burn the extra bit without a concrete need.

bufmgr is currently using eight bits out of a 16-bit flag field, and
IIRC at least five of those have been there since the beginning.  So our
accretion rate is something like one bit every four years.  I think not
being willing to use two bits to describe two unrelated behaviors is
penny-wise and pound-foolish --- bufmgr is already complicated enough,
let's not add useless barriers to readability.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 3:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Dec 17, 2010 at 3:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah.  I think that BM_UNLOGGED might be a poor choice for the flag name,
 just because it overstates what the bufmgr needs to assume.

 I was actually thinking of adding BM_UNLOGGED even before this
 discussion, because that would allow unlogged buffers to be excluded
 from non-shutdown checkpoints.  We could add two flags with different
 semantics that take on, under present rules, the same value, but I'd
 be disinclined to burn the extra bit without a concrete need.

 bufmgr is currently using eight bits out of a 16-bit flag field, and
 IIRC at least five of those have been there since the beginning.  So our
 accretion rate is something like one bit every four years.  I think not
 being willing to use two bits to describe two unrelated behaviors is
 penny-wise and pound-foolish --- bufmgr is already complicated enough,
 let's not add useless barriers to readability.

Allright, what do you want to call the other bit, then?  BM_SKIP_XLOG_FLUSH?

I have a feeling we may also be creating BM_UNTIDY rather soon, per
previous discussion of hint bit I/O.

Since these bits will only be set/cleared when the buffer mapping is
changed, can we examine this bit without taking the spinlock?  If not,
we're going to have to stick an extra spinlock acquire/release into
FlushBuffer(), which sounds rather unappealing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Allright, what do you want to call the other bit, then?  BM_SKIP_XLOG_FLUSH?

Like I said, I'd be tempted to invert the sense, so that the flag is set
for normal relations.  Then it becomes something like BM_FLUSH_XLOG.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Tom Lane
[ hit send too soon ... ]

Robert Haas robertmh...@gmail.com writes:
 Since these bits will only be set/cleared when the buffer mapping is
 changed, can we examine this bit without taking the spinlock?

Only if you're willing for the result to be unreliable.  In the
case of the xlog flush bit, I don't believe an extra locking cycle
should be necessary anyway, as you surely had the lock when you 
found the page to be dirty in the first place.  You could grab the
bit then.  I'm not sure where you envision checking the other bit,
but I doubt it can be all that far removed from a lock acquisition.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables vs. GIST

2010-12-17 Thread Robert Haas
On Fri, Dec 17, 2010 at 4:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Since these bits will only be set/cleared when the buffer mapping is
 changed, can we examine this bit without taking the spinlock?

 Only if you're willing for the result to be unreliable.

If we read the bits while someone else is writing them, we'll get
either the old or the new value, but the BM_FLUSH_XLOG bit will be the
same either way.  When FlushBuffer() is called, a shared content log
and a pin are held, so the buffer can't be renamed under us.  If
that's really unacceptable, we can do something like the attached, but
I think this is unnecessarily gross.  We already assume in other
places that the read or write of an integer is atomic.  In this case
we don't even need it to be fully atomic - we just need the particular
byte that contains this bit not to go through some intermediate state
where the bit is unset.  Is there really a memory architecture out
there that's crazy enough to let such a thing happen?  If so, I'd
expect things to be breaking right and left on that machine anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


bm-flush-xlog-paranoid.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unlogged tables

2010-12-17 Thread Robert Haas
Here's an attempt to summarize the remaining issues with this patch
that I know about.  I may have forgotten something, so please mention
it if you notice something missing.

1. pg_dump needs an option to control whether unlogged tables are
dumped.  --no-unlogged-tables seems like the obvious choice, assuming
we want the default to be to dump them, which seems like the safest
option.

2. storage.sgml likely needs to be updated.  We have a section on the
free space map and one on the visibility map, so I suppose the logical
thing to do is add a similar section on the initialization fork.

3. It's unnecessary to include unlogged relation buffers in
non-shutdown checkpoints.  I've recently realized that this is true
independently of whether or not we want unlogged tables to survive a
clean shutdown.  Whether or not we can survive a clean shutdown is a
function of whether we register dirty segments when buffers are
written, which is independent of whether we choose to write such
buffers as part of a checkpoint.  And indeed, unless we're about to
shut down, there's no reason to do so, because the whole point of
checkpointing is to advance the redo pointer, and that's irrelevant
for unlogged tables.

4. It's arguably unnecessary to register dirty segments for unlogged
relations.  Given #3, this now seems a little less important.  If the
unlogged relation is hot and fits in shared_buffers, then omitting it
from the checkpoint process means we'll never write out those dirty
buffers, so the fact that they'd cause fsyncs if we did write them
doesn't matter.  However, it's still not totally irrelevant, because a
relation that fits in the OS buffer cache but not in shared buffers
will probably generate fsyncs at every checkpoint.  (And on the third
hand, the OS may decide to write the dirty data anyway, especially if
it's a largish percentage of RAM.)  There are a couple of possible
ways of dealing with this:

4A. The solution Andres proposed - Iterate through all unlogged
relations at shutdown time and fsyncing them all.  Possibly
complicated to handle fsync failures.
4B. Another idea I just thought of - register dirty segments as
normal, but teach the background writer to accumulate them in a
separate queue that is only flushed at shutdown, or when it reaches
some maximum size, rather than at every checkpoint.
4C. Decree that this is an area for future enhancement and forget
about it for now.  I am leaning toward this option.

5. Make it work with GIST indexes.  Per discussion on the other
thread, the current proposal seems to be: (a) add a BM_FLUSH_XLOG bit;
when clear, don't flush XLOG; this then allows pages to have fake
LSNs; (b) add an XLogRecPtr structure in shared memory, protected by a
spinlock; (c) use the structure described in (b) to generate fake LSNs
every time an operation is performed on an unlogged GIST index.  I am
not clear on how we make this work across shutdowns - it seems you'd
need to save this structure somewhere during a clean shutdown (where?)
and restore it on startup, unless we go back to truncating even on a
clean shutdown.

6. Make it work with GIN indexes.  I haven't looked at what's involved here yet.

Advice, comments, feedback appreciated...  I'd like to put this one to bed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   >