Re: [HACKERS] Error with index on unlogged table

2015-12-12 Thread Michael Paquier
On Fri, Dec 11, 2015 at 4:54 PM, Michael Paquier
 wrote:
> On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund  wrote:
>> On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
>>> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
>>> > > The real problem there imo isn't that the copy_relation_data() doesn't
>>> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
>>> > > unlogged table specific codepath like heap_create_with_catalog() has.
>>> >
>>> > It looks to me like somewhere we need to do log_smgrcreate(...,
>>> > INIT_FORKNUM) in the unlogged table case.
>>>
>>> Yes.
>>>
>>> > RelationCreateStorage()
>>> > skips this for the main forknum of an unlogged table, which seems OK,
>>> > but there's nothing that even attempts it for the init fork, which
>>> > does not seem OK.
>>>
>>> We unfortunately can't trivially delegate that work to
>>> RelationCreateStorage(). E.g. heap_create() documents that only the main
>>> fork is created :(
>>>
>>> > I guess that logic should happen in
>>> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
>>>
>>> Looks like it's the easiest place.
>>
>>> > > A second problem is that the smgrimmedsync() in copy_relation_data()
>>> > > isn't called for the init fork of unlogged relations, even if it needs
>>> > > to.
>>
>> Here's a patch doing that. It's not yet fully polished, but I wanted to
>> get it out, because I noticed one thing:
>>
>> In ATExecSetTableSpace(), for !main forks, we currently call
>> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
>> relations. That seems a bit odd to me. It currently seems to be without
>> further consequence because, if there's actual data in the fork, we'll
>> just create the relation in _mdfd_getseg(); or we can cope with the
>> relation not being there.  But to me that feels wrong.
>>
>> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
>> not just INIT_FORKNUM. What do you guys think?
>
> This fixes the problem in my environment.
>
> +   if (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_PERMANENT ||
> +   (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_UNLOGGED &&
> +forkNum == INIT_FORKNUM))
> +   log_smgrcreate(, forkNum);
> There should be a XLogIsNeeded() check as well. Removing the check on
> RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :)
>
> +* The init fork for an unlogged relation in many respects has to be
> +* treated the same as normal relation, changes need to be WAL
> logged and
> +* it needs to be synced to disk.
> +*/
> +   copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
> +   forkNum == INIT_FORKNUM;
> Here as well just a check on INIT_FORKNUM would be fine.

Should we consider this bug a 9.5 blocker? I feel uneasy with the fact
of releasing a new major version knowing that we know some bugs on it,
and this one is not cool so I have added it in the list of open items.
We know the problem and there is a patch, so this is definitely
solvable.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-12 Thread Andres Freund
On 2015-12-11 16:54:45 +0900, Michael Paquier wrote:
> +   if (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_PERMANENT ||
> +   (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_UNLOGGED &&
> +forkNum == INIT_FORKNUM))
> +   log_smgrcreate(, forkNum);
> There should be a XLogIsNeeded() check as well.

RelationCreateStorage(), which creates the main fork, has no such
check. Seems better to stay symmetric, even if it's not strictly
necessary.

> Removing the check on RELPERSISTENCE_UNLOGGED is fine as well... Not
> mandatory though :)

I've gone back and forth on this, I can't really convince myself either
way. We might end up reusing init forks for 'global temporary tables' or
somesuch, so being a bit stricter seems slight better. Anyway, it's of
no actual consequence for now.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-12 Thread Andres Freund
On 2015-12-12 20:49:52 +0900, Michael Paquier wrote:
> Should we consider this bug a 9.5 blocker?

I don't think so. But either way, I'm right now working on getting it
fixed anyway.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-12 Thread Andres Freund
On 2015-12-12 12:52:21 +0100, Andres Freund wrote:
> On 2015-12-12 20:49:52 +0900, Michael Paquier wrote:
> > Should we consider this bug a 9.5 blocker?
> 
> I don't think so. But either way, I'm right now working on getting it
> fixed anyway.

And done. Took a bit longer than predicted - I had to adjust my test
scripts to cope with < 9.4 not having tablespace mapping. Rather
annoying.


It'd be really convenient if tablespaces relative to the main data
directory were supported...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-12 Thread Tom Lane
Michael Paquier  writes:
> Should we consider this bug a 9.5 blocker? I feel uneasy with the fact
> of releasing a new major version knowing that we know some bugs on it,
> and this one is not cool so I have added it in the list of open items.
> We know the problem and there is a patch, so this is definitely
> solvable.

FWIW, general project policy has been that pre-existing bugs are not
release blockers as such.  However, if we have a fix in hand that we
would like to get some field testing on, it's certainly desirable to
push it out first in a beta or RC release, rather than having it first
hit the field in stable-branch updates.  So we might delay a beta/RC
to make sure such a fix is available for testing.  But that's not quite
the same thing as a release blocker.  To my mind, a release blocker
is something that would make it undesirable for users to upgrade to
the new release compared to the previous branch, so bugs the branches
have in common are never that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-12 Thread Michael Paquier
On Sat, Dec 12, 2015 at 10:31 PM, Andres Freund  wrote:
> On 2015-12-12 12:52:21 +0100, Andres Freund wrote:
>> On 2015-12-12 20:49:52 +0900, Michael Paquier wrote:
>> > Should we consider this bug a 9.5 blocker?
>>
>> I don't think so. But either way, I'm right now working on getting it
>> fixed anyway.
>
> And done. Took a bit longer than predicted - I had to adjust my test
> scripts to cope with < 9.4 not having tablespace mapping. Rather
> annoying.

Thanks! Cool to see this thread finally addressed.

> It'd be really convenient if tablespaces relative to the main data
> directory were supported...

I got bitten at some point as well by the fact that tablespaces
created after a base backup is taken have their location map to the
same point for a master and a standby :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-11 Thread Andres Freund
On 2015-12-11 15:43:24 +0900, Kyotaro HORIGUCHI wrote:
> What it is doing seems to me reasonable but copying_initfork
> doesn't seems to be necessary. Kicking both of log_newpage() and
> smgrimmedsync() by use_wal, which has the value considering
> INIT_FORKNUM would be more descriptive. (more readable, in other
> words)

The smgrimmedsync() has a different condition, it doesn't, and may not,
check for XLogIsNeeded().

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
Hi,

On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> > > The real problem there imo isn't that the copy_relation_data() doesn't
> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > > unlogged table specific codepath like heap_create_with_catalog() has.
> > 
> > It looks to me like somewhere we need to do log_smgrcreate(...,
> > INIT_FORKNUM) in the unlogged table case.
> 
> Yes.
> 
> > RelationCreateStorage()
> > skips this for the main forknum of an unlogged table, which seems OK,
> > but there's nothing that even attempts it for the init fork, which
> > does not seem OK.
> 
> We unfortunately can't trivially delegate that work to
> RelationCreateStorage(). E.g. heap_create() documents that only the main
> fork is created :(
> 
> > I guess that logic should happen in
> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
> 
> Looks like it's the easiest place.

> > > A second problem is that the smgrimmedsync() in copy_relation_data()
> > > isn't called for the init fork of unlogged relations, even if it needs
> > > to.

Here's a patch doing that. It's not yet fully polished, but I wanted to
get it out, because I noticed one thing:

In ATExecSetTableSpace(), for !main forks, we currently call
smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
relations. That seems a bit odd to me. It currently seems to be without
further consequence because, if there's actual data in the fork, we'll
just create the relation in _mdfd_getseg(); or we can cope with the
relation not being there.  But to me that feels wrong.

It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
not just INIT_FORKNUM. What do you guys think?

> It sounds worthwhile to check that other locations rewriting tables,
> e.g. cluster/vacuum full/reindex are safe.

Seems to be ok, on a first glance.

Andres
>From a823b8aa2a24e61786ce924a3b237d846063f5fb Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 10 Dec 2015 20:14:39 +0100
Subject: [PATCH] WIP: Fix ALTER TABLE ... SET TABLESPACE for unlogged
 relations.

XXX: ATST over streaming rep didn't create the init fork on the other
side. Fix that. Also fix syncing of the init fork of unlogged relations
on the primary. Additionally always WAL log file creation for !main
forks of persistent relations.

Symptom:
ERROR:  58P01: could not open file "pg_tblspc/...": No such file or directory
on a promoted standby.

Reported-By: Michael Paquier
Author: Michael Paquier and Andres Freund
Backpatch: 9.1, where unlogged relations were introduced
---
 src/backend/commands/tablecmds.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ddde72..6df4939 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "catalog/pg_type_fn.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
 #include "commands/comment.h"
@@ -9659,6 +9660,15 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 		if (smgrexists(rel->rd_smgr, forkNum))
 		{
 			smgrcreate(dstrel, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+ forkNum == INIT_FORKNUM))
+log_smgrcreate(, forkNum);
 			copy_relation_data(rel->rd_smgr, dstrel, forkNum,
 			   rel->rd_rel->relpersistence);
 		}
@@ -9878,6 +9888,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	char	   *buf;
 	Page		page;
 	bool		use_wal;
+	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
 
@@ -9891,10 +9902,19 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	page = (Page) buf;
 
 	/*
+	 * The init fork for an unlogged relation in many respects has to be
+	 * treated the same as normal relation, changes need to be WAL logged and
+	 * it needs to be synced to disk.
+	 */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+
+	/*
 	 * We need to log the copied data in WAL iff WAL archiving/streaming is
 	 * enabled AND it's a permanent relation.
 	 */
-	use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
 	nblocks = smgrnblocks(src, forkNum);
 
@@ -9949,7 +9969,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * wouldn't replay our earlier WAL entries. If we do not fsync those pages
 	 * here, they might still not be on disk when the crash occurs.
 	 */
-	if (relpersistence == RELPERSISTENCE_PERMANENT)

Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> > On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund  wrote:
> > > I've, pushed the fix for the promotion related issue. I'm afraid that
> > > the ALTER TABLE  SET TABLESPACE issue is a bit bigger
> > > than it though.
> > 
> > I think that I would have preferred to fix this by flushing unlogged
> > buffers in bulk at the end of recovery, rather than by flushing them
> > as they are generated. This approach seems needlessly inefficient.
> 
> We touched on that somewhere in the thread - having to scan all of
> shared buffers isn't free either, and it'd mean that promotion would
> take longer because it'd do all the writes at once. As we don't fsync
> them during the flush itself, and as init forks don't ever get
> rewritten, I don't think it makes any actual difference? The total
> number of writes to the OS is the same, no?

Oh, and the primary, in most places, immediately does an smgrimmedsync()
after creating the init fork anyway. So in comparison to that a plain
smrwrite() during replay is nearly free.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 1:57 AM, Robert Haas  wrote:
> On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier
>  wrote:
>>> I feel quite uncomfortable that it solves the problem from a kind
>>> of nature of unlogged object by arbitrary flagging which is not
>>> fully corresponds to the nature. If we can deduce the necessity
>>> of fsync from some nature, it would be preferable.
>>
>> INIT_FORKNUM is not something only related to unlogged relations,
>> indexes use them as well.
>
> Eh, what?
>
> Indexes use them if they are indexes on unlogged tables, but they'd
> better not use them in any other situation.  Otherwise bad things are
> going to happen.

Yes, this was badly formulated, and caused by my lack of knowledge of
unlogged tables, I think I got it now :) Why don't we actually put
some asserts in those code paths to say that INIT_FORKNUM specific
code can just be used for unlogged relations? Just a thought...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund  wrote:
>> So, do we go for something like the patch you attached in
>> 20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
>> ~9.4 we use the one I wrote in
>> cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?
>
> I'm more thinking of using something like my patch for all branches. Why
> would we want to go for the more complicated approach in the more
> distant branches?

That's not what I think it meant: I don't wish to do the complicated
approach either anymore. I sent two patches on the mail mentioned
above: one for master/9.5 that used the approach of changing WAL, and
a second aimed for 9.4 and old versions that is close to what you
sent. It looks that you did not look at the second patch, named
20151209_replay_unlogged_94.patch that does some stuff with
XLOG_HEAP_NEWPAGE to fix the issue.

>> Note that in both cases the patches are not complete, we need to fix
>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
>> are logged all the time.
>
> Agreed. It's probably better treated as an entirely different - pretty
> ugly - bug though. I mean it's not some issue of a race during replay,
> it's entirely missing WAL logging for SET TABLESPACE of unlogged
> relations.

Okidoki.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
> In short: should I send patches for all those things or are you on it?

I'm on it. I don't think the new name you gave the function, and the new
comment, are really an improvement. We already have 'SyncOneBuffer'
(unusable for our purpose unfortunately), so going for Single here
doesn't seem like a good idea.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 9:13 PM, Andres Freund  wrote:
> On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
>> In short: should I send patches for all those things or are you on it?
>
> I'm on it. I don't think the new name you gave the function, and the new
> comment, are really an improvement. We already have 'SyncOneBuffer'
> (unusable for our purpose unfortunately), so going for Single here
> doesn't seem like a good idea.

OK, no problem. Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 4:57 PM, Andres Freund  wrote:
> On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier 
>  wrote:
>>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund 
>>wrote:
>>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
 Oh, OK. I didn't read though your lines correctly. So you basically
 mean that we would look at the init files that are on disk, and
>>check
 if they are empty. If they are, we simply use XLogReadBufferExtended
 to fetch the INIT_FORKNUM content and fill in another buffer for the
 MAIN_FORKNUM. More or less right?
>>>
>>> We would not just do so if they're empty, we would just generally
>>copy the file
>>> via shared buffers, instead of copy_file(). But we'd get the file
>>size
>>> from the filesystem (which is fine, we make sure it is correct during
>>> replay).
>>
>>So, this suggestion is basically implementing the reverse operation of
>>GetRelationPath() to be able to rebuild a RelFileNode from scratch and
>>then look at the shared buffers needed. Isn't it a bit brittle for
>>back-branches? RelFileNode stuff is available easily through records
>>when replaying individually FPIs, but not really in this code path.
>
> Oh, sorry. In definitely not thinking about doing this for the back branches. 
> That was more musing about a way to optimize this.

OK sure. Yeah that may be something worth investigating. The reverse
engineering of GetRelationPath would be interesting perhaps for some
utilities.

So, do we go for something like the patch you attached in
20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
~9.4 we use the one I wrote in
cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?
Note that in both cases the patches are not complete, we need to fix
as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
are logged all the time. I am also thinking that the patch changing
WAL format of XLOG_FPI is overdoing it as we know that all the
INIT_FORKNUM will be generated for unlogged relations, so that's fine
to flush unconditionally INIT_FORKNUM buffers at replay.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
> So, do we go for something like the patch you attached in
> 20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
> ~9.4 we use the one I wrote in
> cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?

I'm more thinking of using something like my patch for all branches. Why
would we want to go for the more complicated approach in the more
distant branches?

> Note that in both cases the patches are not complete, we need to fix
> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
> are logged all the time.

Aggreed. It's probably better treated as an entirely different - pretty
ugly - bug though. I mean it's not some issue of a race during replay,
it's entirely missing WAL logging for SET TABLESPACE of unlogged
relations.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier
 wrote:
> On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund  wrote:
>>> So, do we go for something like the patch you attached in
>>> 20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
>>> ~9.4 we use the one I wrote in
>>> cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?
>>
>> I'm more thinking of using something like my patch for all branches. Why
>> would we want to go for the more complicated approach in the more
>> distant branches?
>
> That's not what I think it meant: I don't wish to do the complicated
> approach either anymore. I sent two patches on the mail mentioned
> above: one for master/9.5 that used the approach of changing WAL, and
> a second aimed for 9.4 and old versions that is close to what you
> sent. It looks that you did not look at the second patch, named
> 20151209_replay_unlogged_94.patch that does some stuff with
> XLOG_HEAP_NEWPAGE to fix the issue.
>
>>> Note that in both cases the patches are not complete, we need to fix
>>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
>>> are logged all the time.
>>
>> Agreed. It's probably better treated as an entirely different - pretty
>> ugly - bug though. I mean it's not some issue of a race during replay,
>> it's entirely missing WAL logging for SET TABLESPACE of unlogged
>> relations.
>
> Okidoki.

In short: should I send patches for all those things or are you on it?
It seems that we are on the same page: using the simple approach, with
XLOG_FPI that enforces the flushes for 9.5/master and
XLOG_HEAP_NEWPAGE that does the same for ~9.4.

For the second issue, I would just need to extract the fix from one of
the patches upthread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Robert Haas
On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier
 wrote:
>> I feel quite uncomfortable that it solves the problem from a kind
>> of nature of unlogged object by arbitrary flagging which is not
>> fully corresponds to the nature. If we can deduce the necessity
>> of fsync from some nature, it would be preferable.
>
> INIT_FORKNUM is not something only related to unlogged relations,
> indexes use them as well.

Eh, what?

Indexes use them if they are indexes on unlogged tables, but they'd
better not use them in any other situation.  Otherwise bad things are
going to happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund  wrote:
> > I've, pushed the fix for the promotion related issue. I'm afraid that
> > the ALTER TABLE  SET TABLESPACE issue is a bit bigger
> > than it though.
> 
> I think that I would have preferred to fix this by flushing unlogged
> buffers in bulk at the end of recovery, rather than by flushing them
> as they are generated. This approach seems needlessly inefficient.

We touched on that somewhere in the thread - having to scan all of
shared buffers isn't free either, and it'd mean that promotion would
take longer because it'd do all the writes at once. As we don't fsync
them during the flush itself, and as init forks don't ever get
rewritten, I don't think it makes any actual difference? The total
number of writes to the OS is the same, no?


> Maybe it doesn't matter, but you made the same sort of decision with
> the find_multixact_start() fix: flushing sooner instead of working
> harder to make the delayed flush safe. I think that's a bad direction
> in general, and that it will bite us.

Working harder often means added complexity, and complexity for things
executed relatively infrequently has the tendency to bite. I don't think
that's a simple tradeoff.

E.g. in the find_multixact_start() case the flushing isn't just about
"itself", it's also about interactions with SlruScanDirectory. Requiring
that to also check in-memory buffers wouldn't be entirely trivial... And
all that for an option that's essentially executed infrequently.


> > The real problem there imo isn't that the copy_relation_data() doesn't
> > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > unlogged table specific codepath like heap_create_with_catalog() has.
> 
> It looks to me like somewhere we need to do log_smgrcreate(...,
> INIT_FORKNUM) in the unlogged table case.

Yes.

> RelationCreateStorage()
> skips this for the main forknum of an unlogged table, which seems OK,
> but there's nothing that even attempts it for the init fork, which
> does not seem OK.

We unfortunately can't trivially delegate that work to
RelationCreateStorage(). E.g. heap_create() documents that only the main
fork is created :(

> I guess that logic should happen in
> ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).

Looks like it's the easiest place.

It sounds worthwhile to check that other locations rewriting tables,
e.g. cluster/vacuum full/reindex are safe.

> > A second problem is that the smgrimmedsync() in copy_relation_data()
> > isn't called for the init fork of unlogged relations, even if it needs
> > to.
> 
> That seems easy enough to fix.  Can't we just sync all forks at that
> location for permanent relations, and additionally sync the init fork
> only for unlogged relations?

Yes, it's not hard, I just wanted to mention it so it doesn't get
forgotten.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund  wrote:
> I've, pushed the fix for the promotion related issue. I'm afraid that
> the ALTER TABLE  SET TABLESPACE issue is a bit bigger
> than it though.

I think that I would have preferred to fix this by flushing unlogged
buffers in bulk at the end of recovery, rather than by flushing them
as they are generated. This approach seems needlessly inefficient.
Maybe it doesn't matter, but you made the same sort of decision with
the find_multixact_start() fix: flushing sooner instead of working
harder to make the delayed flush safe.  I think that's a bad direction
in general, and that it will bite us.  Nonetheless, I've been
unresponsive on this thread, and it's certainly better to have fixed
the bug in a way that isn't what I would prefer than to have not fixed
it.

> Robert, to catch you up: The isssue, discovered by Michael somewhere in
> this thread, is that copy_relation_data(), used by SET TABLESPACE, does
> not deal with unlogged tables.

Hmm.

[ ... ]

> But that's not sufficient for a bunch of reasons:
>
> First, and that's the big one, that approach doesn't guarantee that the
> file will be created in the new tablespace if the file does not have any
> blocks. Like e.g. the heap relation itself. This will lead to errors
> like:
> ERROR:  58P01: could not open file 
> "pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or
> LOCATION:  mdopen, md.c:602
>
> The real problem there imo isn't that the copy_relation_data() doesn't
> deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> unlogged table specific codepath like heap_create_with_catalog() has.

It looks to me like somewhere we need to do log_smgrcreate(...,
INIT_FORKNUM) in the unlogged table case.  RelationCreateStorage()
skips this for the main forknum of an unlogged table, which seems OK,
but there's nothing that even attempts it for the init fork, which
does not seem OK.  I guess that logic should happen in
ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).

> A second problem is that the smgrimmedsync() in copy_relation_data()
> isn't called for the init fork of unlogged relations, even if it needs
> to.

That seems easy enough to fix.  Can't we just sync all forks at that
location for permanent relations, and additionally sync the init fork
only for unlogged relations?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 1:32 AM, Andres Freund  wrote:
> I've, pushed the fix for the promotion related issue.

Thanks! It is great to see this issue addressed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund  wrote:
> Hi,
>
> On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
>> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
>> > > The real problem there imo isn't that the copy_relation_data() doesn't
>> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
>> > > unlogged table specific codepath like heap_create_with_catalog() has.
>> >
>> > It looks to me like somewhere we need to do log_smgrcreate(...,
>> > INIT_FORKNUM) in the unlogged table case.
>>
>> Yes.
>>
>> > RelationCreateStorage()
>> > skips this for the main forknum of an unlogged table, which seems OK,
>> > but there's nothing that even attempts it for the init fork, which
>> > does not seem OK.
>>
>> We unfortunately can't trivially delegate that work to
>> RelationCreateStorage(). E.g. heap_create() documents that only the main
>> fork is created :(
>>
>> > I guess that logic should happen in
>> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
>>
>> Looks like it's the easiest place.
>
>> > > A second problem is that the smgrimmedsync() in copy_relation_data()
>> > > isn't called for the init fork of unlogged relations, even if it needs
>> > > to.
>
> Here's a patch doing that. It's not yet fully polished, but I wanted to
> get it out, because I noticed one thing:
>
> In ATExecSetTableSpace(), for !main forks, we currently call
> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
> relations. That seems a bit odd to me. It currently seems to be without
> further consequence because, if there's actual data in the fork, we'll
> just create the relation in _mdfd_getseg(); or we can cope with the
> relation not being there.  But to me that feels wrong.
>
> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
> not just INIT_FORKNUM. What do you guys think?

This fixes the problem in my environment.

+   if (rel->rd_rel->relpersistence ==
RELPERSISTENCE_PERMANENT ||
+   (rel->rd_rel->relpersistence ==
RELPERSISTENCE_UNLOGGED &&
+forkNum == INIT_FORKNUM))
+   log_smgrcreate(, forkNum);
There should be a XLogIsNeeded() check as well. Removing the check on
RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :)

+* The init fork for an unlogged relation in many respects has to be
+* treated the same as normal relation, changes need to be WAL
logged and
+* it needs to be synced to disk.
+*/
+   copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+   forkNum == INIT_FORKNUM;
Here as well just a check on INIT_FORKNUM would be fine.

>> It sounds worthwhile to check that other locations rewriting tables,
>> e.g. cluster/vacuum full/reindex are safe.
>
> Seems to be ok, on a first glance.

Yeah. REINDEX relies on index_build to recreate what it should... The
others are looking fine as well. I have tested it in case and the
files produced are consistent on standby and its master.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 10 Dec 2015 20:27:01 +0100, Andres Freund  wrote in 
<20151210192701.gc11...@alap3.anarazel.de>
> > > > A second problem is that the smgrimmedsync() in copy_relation_data()
> > > > isn't called for the init fork of unlogged relations, even if it needs
> > > > to.
> 
> Here's a patch doing that. It's not yet fully polished, but I wanted to
> get it out, because I noticed one thing:
> 
> In ATExecSetTableSpace(), for !main forks, we currently call
> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
> relations. That seems a bit odd to me. It currently seems to be without
> further consequence because, if there's actual data in the fork, we'll
> just create the relation in _mdfd_getseg(); or we can cope with the
> relation not being there.  But to me that feels wrong.

FWIW I agree that.

> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
> not just INIT_FORKNUM. What do you guys think?

What it is doing seems to me reasonable but copying_initfork
doesn't seems to be necessary. Kicking both of log_newpage() and
smgrimmedsync() by use_wal, which has the value considering
INIT_FORKNUM would be more descriptive. (more readable, in other
words)

> > It sounds worthwhile to check that other locations rewriting tables,
> > e.g. cluster/vacuum full/reindex are safe.
> 
> Seems to be ok, on a first glance.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
Hi Michael, Robert,

On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
> On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier
>  wrote:
> >>> Note that in both cases the patches are not complete, we need to fix
> >>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
> >>> are logged all the time.
> >>
> >> Agreed. It's probably better treated as an entirely different - pretty
> >> ugly - bug though. I mean it's not some issue of a race during replay,
> >> it's entirely missing WAL logging for SET TABLESPACE of unlogged
> >> relations.
> 
> For the second issue, I would just need to extract the fix from one of
> the patches upthread.

I've, pushed the fix for the promotion related issue. I'm afraid that
the ALTER TABLE  SET TABLESPACE issue is a bit bigger
than it though.

Robert, to catch you up: The isssue, discovered by Michael somewhere in
this thread, is that copy_relation_data(), used by SET TABLESPACE, does
not deal with unlogged tables.

Michael, what your earlier patch did was basically

@@ -9892,9 +9892,14 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 
/*
 * We need to log the copied data in WAL iff WAL archiving/streaming is
-* enabled AND it's a permanent relation.
-*/
-   use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT;
+* enabled AND it's a permanent relation. Unlogged relations need to have
+* their init fork logged as well, to ensure a consistent on-disk state
+* when reset at the end of recovery.
+*/
+   use_wal = XLogIsNeeded() &&
+   (relpersistence == RELPERSISTENCE_PERMANENT ||
+(relpersistence == RELPERSISTENCE_UNLOGGED &&
+ forkNum == INIT_FORKNUM));

...

nblocks = smgrnblocks(src, forkNum);

for (blkno = 0; blkno < nblocks; blkno++)
{
...
if (use_wal)
log_newpage(>smgr_rnode.node, forkNum, blkno, 
page);


But that's not sufficient for a bunch of reasons:

First, and that's the big one, that approach doesn't guarantee that the
file will be created in the new tablespace if the file does not have any
blocks. Like e.g. the heap relation itself. This will lead to errors
like:
ERROR:  58P01: could not open file 
"pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or
LOCATION:  mdopen, md.c:602

The real problem there imo isn't that the copy_relation_data() doesn't
deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
unlogged table specific codepath like heap_create_with_catalog() has.


A second problem is that the smgrimmedsync() in copy_relation_data()
isn't called for the init fork of unlogged relations, even if it needs
to.

As ATExecSetTableSpace() is also used for indices, the problem exists
there as well.

Yuck.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Michael Paquier
On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund  wrote:
> On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
>> > I'm kinda wondering if it wouldn't have been better to go through shared
>> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
>> > copy_file().
>>
>> For deployment with large shared_buffers settings, wouldn't that be
>> actually more costly than the current way of doing? We would need to
>> go through all the buffers at least once and look for the INIT_FORKNUM
>> present to flush them.
>
> We could just check the file sizes on disk, and the check for the
> contents of all the pages for each file.

By doing it at replay, the flushes are spread across time. And by
doing it at the end of recovery, all the flushes would be grouped. Do
you think that's fine?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Andres Freund
On 2015-12-09 19:36:11 +0900, Michael Paquier wrote:
> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund  wrote:
> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
> >> > I'm kinda wondering if it wouldn't have been better to go through shared
> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> >> > copy_file().
> >>
> >> For deployment with large shared_buffers settings, wouldn't that be
> >> actually more costly than the current way of doing? We would need to
> >> go through all the buffers at least once and look for the INIT_FORKNUM
> >> present to flush them.
> >
> > We could just check the file sizes on disk, and the check for the
> > contents of all the pages for each file.
> 
> By doing it at replay, the flushes are spread across time. And by
> doing it at the end of recovery, all the flushes would be grouped. Do
> you think that's fine?

The point is that we'd no flushes, because the data would come directly
from shared buffers...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Andres Freund
On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
> Oh, OK. I didn't read though your lines correctly. So you basically
> mean that we would look at the init files that are on disk, and check
> if they are empty. If they are, we simply use XLogReadBufferExtended
> to fetch the INIT_FORKNUM content and fill in another buffer for the
> MAIN_FORKNUM. More or less right?

We'd not just do so if they're empty, we'd just generally copy the file
via shared buffers, instead of copy_file(). But we'd get the file size
from the filesystem (which is fine, we make sure it is correct during
replay).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Michael Paquier
On Wed, Dec 9, 2015 at 8:04 PM, Andres Freund  wrote:
> On 2015-12-09 19:36:11 +0900, Michael Paquier wrote:
>> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund  wrote:
>> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
>> >> > I'm kinda wondering if it wouldn't have been better to go through shared
>> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
>> >> > copy_file().
>> >>
>> >> For deployment with large shared_buffers settings, wouldn't that be
>> >> actually more costly than the current way of doing? We would need to
>> >> go through all the buffers at least once and look for the INIT_FORKNUM
>> >> present to flush them.
>> >
>> > We could just check the file sizes on disk, and the check for the
>> > contents of all the pages for each file.
>>
>> By doing it at replay, the flushes are spread across time. And by
>> doing it at the end of recovery, all the flushes would be grouped. Do
>> you think that's fine?
>
> The point is that we'd no flushes, because the data would come directly
> from shared buffers...

Oh, OK. I didn't read though your lines correctly. So you basically
mean that we would look at the init files that are on disk, and check
if they are empty. If they are, we simply use XLogReadBufferExtended
to fetch the INIT_FORKNUM content and fill in another buffer for the
MAIN_FORKNUM. More or less right?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Michael Paquier
On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund  wrote:
> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>> Oh, OK. I didn't read though your lines correctly. So you basically
>> mean that we would look at the init files that are on disk, and check
>> if they are empty. If they are, we simply use XLogReadBufferExtended
>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>> MAIN_FORKNUM. More or less right?
>
> We would not just do so if they're empty, we would just generally copy the 
> file
> via shared buffers, instead of copy_file(). But we'd get the file size
> from the filesystem (which is fine, we make sure it is correct during
> replay).

So, this suggestion is basically implementing the reverse operation of
GetRelationPath() to be able to rebuild a RelFileNode from scratch and
then look at the shared buffers needed. Isn't it a bit brittle for
back-branches? RelFileNode stuff is available easily through records
when replaying individually FPIs, but not really in this code path.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Andres Freund
On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier 
 wrote:
>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund 
>wrote:
>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>>> Oh, OK. I didn't read though your lines correctly. So you basically
>>> mean that we would look at the init files that are on disk, and
>check
>>> if they are empty. If they are, we simply use XLogReadBufferExtended
>>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>>> MAIN_FORKNUM. More or less right?
>>
>> We would not just do so if they're empty, we would just generally
>copy the file
>> via shared buffers, instead of copy_file(). But we'd get the file
>size
>> from the filesystem (which is fine, we make sure it is correct during
>> replay).
>
>So, this suggestion is basically implementing the reverse operation of
>GetRelationPath() to be able to rebuild a RelFileNode from scratch and
>then look at the shared buffers needed. Isn't it a bit brittle for
>back-branches? RelFileNode stuff is available easily through records
>when replaying individually FPIs, but not really in this code path.

Oh, sorry. In definitely not thinking about doing this for the back branches. 
That was more musing about a way to optimize this.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-08 Thread Andres Freund
On 2015-12-03 22:09:43 +0100, Andres Freund wrote:
> On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
> > +   if (bkpb.fork == INIT_FORKNUM)
> > +   {
> > +   SMgrRelation srel;
> > +   srel = smgropen(bkpb.node, InvalidBackendId);
> > +   smgrimmedsync(srel, INIT_FORKNUM);
> > +   smgrclose(srel);
> > +   }
>
> A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

More importantly, the smgrimmedsync() won't actually achieve anything -
RestoreBackupBlockContents() will just have written the data into shared
buffers. smgrimmedsync() doesn't flush that.

And further, just flushing the buffer directly to disk using
smgrwrite(), without reflecting that in the in-memory state, doesn't
seem prudent. At the very least we'll write all those blocks twice. It
also likely misses dealing with checksums and such.

What I think we really ought to do instead is to use "proper" buffer
functionality, e.g. flush the buffer via FlushBuffer().

It seems slightly better not to directly expose FlushBuffer() and
instead add a tiny wrapper. Couldn't come up with a grand name tho.

For me the attached, preliminary, patch, fixes the problem in master;
previous branches ought to look mostly similar, except the flush moved
to RestoreBackupBlockContents/RestoreBackupBlock.


Does anybody have a better idea? Suitable for the back-branches?


I'm kinda wondering if it wouldn't have been better to go through shared
buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
copy_file().

Regareds,

Andres
>From 9c4a7691595c1edfc12c5583cd007bb569f9db94 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 8 Dec 2015 13:56:18 +0100
Subject: [PATCH] Preliminary-unlogged-rel-recovery-fix

---
 src/backend/access/transam/xlogutils.c |  9 +
 src/backend/storage/buffer/bufmgr.c| 21 +
 src/include/storage/bufmgr.h   |  1 +
 3 files changed, 31 insertions(+)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index a5003c3..9073a85 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -368,6 +368,15 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 		MarkBufferDirty(*buf);
 
+		/*
+		 * At the end of crash recovery the init forks of unlogged relations
+		 * are copied, without going through shared buffers. So we need to
+		 * force the on-disk state of init forks to always be in sync with the
+		 * state in shared buffers.
+		 */
+		if (forknum == INIT_FORKNUM)
+			FlushOneBuffer(*buf);
+
 		return BLK_RESTORED;
 	}
 	else
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 63188a3..b32543b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2989,6 +2989,27 @@ FlushDatabaseBuffers(Oid dbid)
 }
 
 /*
+ * Flush a previously, shared or exclusively, locked and pinned buffer to the
+ * OS.
+ */
+void
+FlushOneBuffer(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+
+	/* currently not needed, but no fundamental reason not to support */
+	Assert(!BufferIsLocal(buffer));
+
+	Assert(BufferIsPinned(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	LWLockHeldByMe(bufHdr->content_lock);
+
+	FlushBuffer(bufHdr, NULL);
+}
+
+/*
  * ReleaseBuffer -- release the pin on a buffer
  */
 void
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0f59201..09b2b11 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,7 @@ extern void CheckPointBuffers(int flags);
 extern BlockNumber BufferGetBlockNumber(Buffer buffer);
 extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 ForkNumber forkNum);
+extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
-- 
2.6.0.rc3


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-08 Thread Andres Freund
Hi,

On 2015-12-04 21:57:54 +0900, Michael Paquier wrote:
> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund  wrote:
> >> Let's go for XLOG_FPI_FLUSH.
> >
> > I think the other way is a bit better, because we can add new flags
> > without changing the WAL format.
> 
> Hm. On the contrary, I think that it would make more sense to have a
> flag as well for FOR_HINT honestly, those are really the same
> operations, and FOR_HINT is just here for statistic purposes.

I don't think it's all that much the same operation. And WRT statistics
purpose: Being able to easily differentiate FOR_HINT is important for
pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH.

> >> --- a/src/backend/access/transam/xloginsert.c
> >> +++ b/src/backend/access/transam/xloginsert.c
> >> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
> >>   * If the page follows the standard page layout, with a PageHeader and 
> >> unused
> >>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That 
> >> allows
> >>   * the unused space to be left out from the WAL record, making it smaller.
> >> + *
> >> + * If 'is_flush' is set to TRUE, relation will be requested to flush
> >> + * immediately its data at replay after replaying this full page image.
> >>   */
> >
> > s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> > OS immediately after replaying the record'?
> 
> s/OS/stable storage?

It's not flushed to stable storage here. Just to the OS.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..fff48ab 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>   brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>  BRIN_CURRENT_VERSION);
>   MarkBufferDirty(metabuf);
> - log_newpage_buffer(metabuf, false);
> +
> + /*
> +  * For unlogged relations, this page should be immediately flushed
> +  * to disk after being replayed. This is necessary to ensure that the
> +  * initial on-disk state of unlogged relations is preserved as the
> +  * on-disk files are copied directly at the end of recovery.
> +  */
> + log_newpage_buffer(metabuf, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>   END_CRIT_SECTION();
>  
>   UnlockReleaseBuffer(metabuf);
> diff --git a/src/backend/access/brin/brin_pageops.c 
> b/src/backend/access/brin/brin_pageops.c
> index f876f62..572fe20 100644
> --- a/src/backend/access/brin/brin_pageops.c
> +++ b/src/backend/access/brin/brin_pageops.c
> @@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer 
> buffer)
>   page = BufferGetPage(buffer);
>   brin_page_init(page, BRIN_PAGETYPE_REGULAR);
>   MarkBufferDirty(buffer);
> - log_newpage_buffer(buffer, true);
> + log_newpage_buffer(buffer, true, false);
>   END_CRIT_SECTION();
>  
>   /*
> diff --git a/src/backend/access/gin/gininsert.c 
> b/src/backend/access/gin/gininsert.c
> index 49e9185..17c168a 100644
> --- a/src/backend/access/gin/gininsert.c
> +++ b/src/backend/access/gin/gininsert.c
> @@ -450,14 +450,22 @@ ginbuildempty(PG_FUNCTION_ARGS)
>   ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, 
> NULL);
>   LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
>  
> - /* Initialize and xlog metabuffer and root buffer. */
> + /*
> +  * Initialize and xlog metabuffer and root buffer. For unlogged
> +  * relations, those pages need to be immediately flushed to disk
> +  * after being replayed. This is necessary to ensure that the
> +  * initial on-disk state of unlogged relations is preserved when
> +  * they get reset at the end of recovery.
> +  */
>   START_CRIT_SECTION();
>   GinInitMetabuffer(MetaBuffer);
>   MarkBufferDirty(MetaBuffer);
> - log_newpage_buffer(MetaBuffer, false);
> + log_newpage_buffer(MetaBuffer, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>   GinInitBuffer(RootBuffer, GIN_LEAF);
>   MarkBufferDirty(RootBuffer);
> - log_newpage_buffer(RootBuffer, false);
> + log_newpage_buffer(RootBuffer, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>   END_CRIT_SECTION();
>  
>   /* Unlock and release the buffers. */
> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
> index 53bccf6..6a20031 100644
> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -84,7 +84,8 @@ gistbuildempty(PG_FUNCTION_ARGS)
>   START_CRIT_SECTION();
>   GISTInitBuffer(buffer, F_LEAF);
>   MarkBufferDirty(buffer);
> - log_newpage_buffer(buffer, true);
> + log_newpage_buffer(buffer, true,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);

Re: [HACKERS] Error with index on unlogged table

2015-12-08 Thread Andres Freund
On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
> > I'm kinda wondering if it wouldn't have been better to go through shared
> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> > copy_file().
> 
> For deployment with large shared_buffers settings, wouldn't that be
> actually more costly than the current way of doing? We would need to
> go through all the buffers at least once and look for the INIT_FORKNUM
> present to flush them.

We could just check the file sizes on disk, and the check for the
contents of all the pages for each file.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-08 Thread Michael Paquier
On Tue, Dec 8, 2015 at 9:57 PM, Andres Freund  wrote:
> For me the attached, preliminary, patch, fixes the problem in master;
> previous branches ought to look mostly similar, except the flush moved
> to RestoreBackupBlockContents/RestoreBackupBlock.
> Does anybody have a better idea? Suitable for the back-branches?

Not really I am afraid..

> I'm kinda wondering if it wouldn't have been better to go through shared
> buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> copy_file().

For deployment with large shared_buffers settings, wouldn't that be
actually more costly than the current way of doing? We would need to
go through all the buffers at least once and look for the INIT_FORKNUM
present to flush them.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-08 Thread Michael Paquier
On Tue, Dec 8, 2015 at 10:09 PM, Andres Freund  wrote:
> On 2015-12-04 21:57:54 +0900, Michael Paquier wrote:
>> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund  wrote:
>> >> Let's go for XLOG_FPI_FLUSH.
>> >
>> > I think the other way is a bit better, because we can add new flags
>> > without changing the WAL format.
>>
>> Hm. On the contrary, I think that it would make more sense to have a
>> flag as well for FOR_HINT honestly, those are really the same
>> operations, and FOR_HINT is just here for statistic purposes.
>
> I don't think it's all that much the same operation. And WRT statistics
> purpose: Being able to easily differentiate FOR_HINT is important for
> pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH.

OK. Switched back to using XLOG_FPI_FLUSH.

>> [stuff about unlogged relations in code]
>
> I might be missing something here - but isn't it pretty much guaranteed
> that all these are unlogged relations? Otherwise *buildempty() wouldn't
> have been called, right?

Yep, that's ensured in index.c when calling ambuildempty. Let's flush
it unconditionally then in those code paths.

>>   else if (info == XLOG_BACKUP_END)
>>   {
>> @@ -178,9 +183,6 @@ xlog_identify(uint8 info)
>>   case XLOG_FPI:
>>   id = "FPI";
>>   break;
>> - case XLOG_FPI_FOR_HINT:
>> - id = "FPI_FOR_HINT";
>> - break;
>>   }
>
> Really don't want to do that.

OK, added back.

>> @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record)
>>* resource manager needs to generate conflicts, it has to 
>> define a
>>* separate WAL record type and redo routine.
>>*
>> -  * XLOG_FPI_FOR_HINT records are generated when a page needs 
>> to be
>> -  * WAL- logged because of a hint bit update. They are only 
>> generated
>> -  * when checksums are enabled. There is no difference in 
>> handling
>> -  * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a 
>> different info
>> -  * code just to distinguish them for statistics purposes.
>> +  * Records flagged with 'for_hint_bits' are generated when a 
>> page needs
>> +  * to be WAL- logged because of a hint bit update. They are 
>> only
>> +  * generated when checksums are enabled. There is no 
>> difference in
>> +  * handling records when this flag is set, it is used for 
>> statistics
>> +  * purposes.
>> +  *
>> +  * Records flagged with 'is_flush' indicate that the page 
>> immediately
>> +  * needs to be written to disk, not just to shared buffers. 
>> This is
>> +  * important if the on-disk state is to be the authoritative, 
>> not the
>> +  * state in shared buffers. E.g. because on-disk files may 
>> later be
>> +  * copied directly.
>>*/
>>   if (XLogReadBufferForRedo(record, 0, ) != BLK_RESTORED)
>>   elog(ERROR, "unexpected XLogReadBufferForRedo result 
>> when restoring backup block");
>> +
>> + if (xlrec.is_flush)
>> + {
>> + RelFileNode rnode;
>> + ForkNumber  forknum;
>> + BlockNumber blkno;
>> + SMgrRelation srel;
>> +
>> + (void) XLogRecGetBlockTag(record, 0, , , 
>> );
>> + srel = smgropen(rnode, InvalidBackendId);
>> + smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), 
>> false);
>> + smgrclose(srel);
>> + }
>
> That'd leave the dirty bit set on the buffer...

OK, I have used an equivalent of FlushBuffer, FlushSingleBuffer being
a copycat of your previous patch... I am attaching as well a patch for
~9.4, which uses XLOG_HEAP_NEWPAGE instead.
-- 
Michael


20151209_replay_unlogged_master_95.patch
Description: binary/octet-stream


20151209_replay_unlogged_94.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-04 Thread Andres Freund
On 2015-12-04 17:00:13 +0900, Michael Paquier wrote:
> Andres Freud wrote:
> >> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
> >>   START_CRIT_SECTION();
> >>   GinInitMetabuffer(MetaBuffer);
> >>   MarkBufferDirty(MetaBuffer);
> >> - log_newpage_buffer(MetaBuffer, false);
> >> + log_newpage_buffer(MetaBuffer, false, false);
> >>   GinInitBuffer(RootBuffer, GIN_LEAF);
> >>   MarkBufferDirty(RootBuffer);
> >> - log_newpage_buffer(RootBuffer, false);
> >> + log_newpage_buffer(RootBuffer, false, true);
> >>   END_CRIT_SECTION();
> >>
> > Why isn't the metapage flushed to disk?
> 
> I was not sure if we should only flush only at the last page, as pages
> just before would be already replayed. Switched.

Uh, that's not how it works! The earlier pages would just be in shared
buffers. Neither smgrwrite, nor smgrimmedsync does anything about that!

> >>  extern void InitXLogInsert(void);
> >> diff --git a/src/include/catalog/pg_control.h 
> >> b/src/include/catalog/pg_control.h
> >> index ad1eb4b..91445f1 100644
> >> --- a/src/include/catalog/pg_control.h
> >> +++ b/src/include/catalog/pg_control.h
> >> @@ -73,6 +73,7 @@ typedef struct CheckPoint
> >>  #define XLOG_END_OF_RECOVERY 0x90
> >>  #define XLOG_FPI_FOR_HINT0xA0
> >>  #define XLOG_FPI 0xB0
> >> +#define XLOG_FPI_FOR_SYNC0xC0
> >
> >
> > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
> > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
> > instead adding actual record data and a 'flags' field in there? I
> > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
> > different, XLOG_FPI_FOR_SYNC not so much.
> 
> Let's go for XLOG_FPI_FLUSH.

I think the other way is a bit better, because we can add new flags
without changing the WAL format.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..b646101 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>   brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>  BRIN_CURRENT_VERSION);
>   MarkBufferDirty(metabuf);
> - log_newpage_buffer(metabuf, false);
> +
> + /*
> +  * For unlogged relations, this page should be immediately flushed
> +  * to disk after being replayed. This is necessary to ensure that the
> +  * initial on-disk state of unlogged relations is preserved when
> +  * they get reset at the end of recovery.
> +  */
> + log_newpage_buffer(metabuf, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>   END_CRIT_SECTION();

Maybe write the last sentence as '... as the on disk files are copied
directly at the end of recovery.'?

> @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
>   MAIN_FORKNUM,
>   state->rs_blockno,
>   state->rs_buffer,
> - true);
> + true,
> + false);
>   RelationOpenSmgr(state->rs_new_rel);
>  
>   PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
> @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
>   MAIN_FORKNUM,
>   state->rs_blockno,
>   page,
> - true);
> + true,
> + false);

Did you verify that that's ok when a unlogged table is clustered/vacuum
full'ed?

> @@ -181,6 +183,9 @@ xlog_identify(uint8 info)
>   case XLOG_FPI_FOR_HINT:
>   id = "FPI_FOR_HINT";
>   break;
> + case XLOG_FPI_FLUSH:
> + id = "FPI_FOR_SYNC";
> + break;
>   }

Old string.


> --- a/src/backend/access/transam/xloginsert.c
> +++ b/src/backend/access/transam/xloginsert.c
> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
>   * If the page follows the standard page layout, with a PageHeader and unused
>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
>   * the unused space to be left out from the WAL record, making it smaller.
> + *
> + * If 'is_flush' is set to TRUE, relation will be requested to flush
> + * immediately its data at replay after replaying this full page image.
>   */

s/is_flush/flush_immed/? And maybe say that it 

Re: [HACKERS] Error with index on unlogged table

2015-12-04 Thread Michael Paquier
Andres Freud wrote:
> On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
>> diff --git a/src/backend/access/transam/xlog.c 
>> b/src/backend/access/transam/xlog.c
>> index cc845d2..4883697 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
>>   data += sizeof(BkpBlock);
>>
>>   RestoreBackupBlockContents(lsn, bkpb, data, false, false);
>> +
>> + if (bkpb.fork == INIT_FORKNUM)
>> + {
>> + SMgrRelation srel;
>> + srel = smgropen(bkpb.node, InvalidBackendId);
>> + smgrimmedsync(srel, INIT_FORKNUM);
>> + smgrclose(srel);
>> + }
>>   }
>>   else if (info == XLOG_BACKUP_END)
>>   {
>
> A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

Check.

On Fri, Dec 4, 2015 at 6:35 AM, Andres Freund  wrote:
> On 2015-11-27 16:59:20 +0900, Michael Paquier wrote:
>> Attached is a patch that fixes the issue for me in master and 9.5.
>> Actually in the last patch I forgot a call to smgrwrite to ensure that
>> the INIT_FORKNUM is correctly synced to disk when those pages are
>> replayed at recovery, letting the reset routines for unlogged
>> relations do their job correctly. I have noticed as well that we need
>> to do the same for gin and brin relations. In this case I think that
>> we could limit the flush to unlogged relations, my patch does it
>> unconditionally though to generalize the logic. Thoughts?
>
> I think it's a good idea to limit this to unlogged relations. For a
> dataset that fits into shared_buffers and creates many relations, the
> additional disk writes could be noticeable.

OK, then I have switched the patch this way to limit the flush to
unlogged relations where needed.

>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
>> index 99337b0..d7964ac 100644
>> --- a/src/backend/access/brin/brin.c
>> +++ b/src/backend/access/brin/brin.c
>> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
>>   brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>>  BRIN_CURRENT_VERSION);
>>   MarkBufferDirty(metabuf);
>> - log_newpage_buffer(metabuf, false);
>> + /*
>> +  * When this full page image is replayed, there is no guarantee that
>> +  * this page will be present to disk when replayed particularly for
>> +  * unlogged relations, hence enforce it to be flushed to disk.
>> +  */
>
> The grammar seems a bit off here.

Check.

>> + /*
>> +  * Initialize and xlog metabuffer and root buffer. When those full
>> +  * pages are replayed, it is not guaranteed that those relation
>> +  * init forks will be flushed to disk after replaying them, hence
>> +  * enforce those pages to be flushed to disk at replay, only the
>> +  * last record will enforce a flush for performance reasons and
>> +  * because it is actually unnecessary to do it multiple times.
>> +  */
>
> That comment needs some love too. I think it really only makes sense if
> you know the current state. There's also some language polishing needed.

Check.

>> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
>>   START_CRIT_SECTION();
>>   GinInitMetabuffer(MetaBuffer);
>>   MarkBufferDirty(MetaBuffer);
>> - log_newpage_buffer(MetaBuffer, false);
>> + log_newpage_buffer(MetaBuffer, false, false);
>>   GinInitBuffer(RootBuffer, GIN_LEAF);
>>   MarkBufferDirty(RootBuffer);
>> - log_newpage_buffer(RootBuffer, false);
>> + log_newpage_buffer(RootBuffer, false, true);
>>   END_CRIT_SECTION();
>>
> Why isn't the metapage flushed to disk?

I was not sure if we should only flush only at the last page, as pages
just before would be already replayed. Switched.

>> --- a/src/backend/access/spgist/spginsert.c
>> +++ b/src/backend/access/spgist/spginsert.c
>> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>> (char *) page, true);
>>   if (XLogIsNeeded())
>>   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
>> - SPGIST_METAPAGE_BLKNO, page, false);
>> + SPGIST_METAPAGE_BLKNO, page, false, 
>> false);
>>
>>   /* Likewise for the root page. */
>>   SpGistInitPage(page, SPGIST_LEAF);
>> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>> (char *) page, true);
>>   if (XLogIsNeeded())
>>   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
>> - SPGIST_ROOT_BLKNO, page, true);
>> + SPGIST_ROOT_BLKNO, page, true, false);
>>
>
> I don't think it's correct to not flush in these cases. Yes, the primary
> does an smgrimmedsync() - but that's not 

Re: [HACKERS] Error with index on unlogged table

2015-12-04 Thread Michael Paquier
On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund  wrote:
> On 2015-12-04 17:00:13 +0900, Michael Paquier wrote:
>> Andres Freud wrote:
>> >>  extern void InitXLogInsert(void);
>> >> diff --git a/src/include/catalog/pg_control.h 
>> >> b/src/include/catalog/pg_control.h
>> >> index ad1eb4b..91445f1 100644
>> >> --- a/src/include/catalog/pg_control.h
>> >> +++ b/src/include/catalog/pg_control.h
>> >> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>> >>  #define XLOG_END_OF_RECOVERY 0x90
>> >>  #define XLOG_FPI_FOR_HINT0xA0
>> >>  #define XLOG_FPI 0xB0
>> >> +#define XLOG_FPI_FOR_SYNC0xC0
>> >
>> >
>> > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
>> > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
>> > instead adding actual record data and a 'flags' field in there? I
>> > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
>> > different, XLOG_FPI_FOR_SYNC not so much.
>>
>> Let's go for XLOG_FPI_FLUSH.
>
> I think the other way is a bit better, because we can add new flags
> without changing the WAL format.

Hm. On the contrary, I think that it would make more sense to have a
flag as well for FOR_HINT honestly, those are really the same
operations, and FOR_HINT is just here for statistic purposes.

>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
>> index 99337b0..b646101 100644
>> --- a/src/backend/access/brin/brin.c
>> +++ b/src/backend/access/brin/brin.c
>> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>>   brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>>  BRIN_CURRENT_VERSION);
>>   MarkBufferDirty(metabuf);
>> - log_newpage_buffer(metabuf, false);
>> +
>> + /*
>> +  * For unlogged relations, this page should be immediately flushed
>> +  * to disk after being replayed. This is necessary to ensure that the
>> +  * initial on-disk state of unlogged relations is preserved when
>> +  * they get reset at the end of recovery.
>> +  */
>> + log_newpage_buffer(metabuf, false,
>> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>>   END_CRIT_SECTION();
>
> Maybe write the last sentence as '... as the on disk files are copied
> directly at the end of recovery.'?

Check.

>> @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
>>   MAIN_FORKNUM,
>>   state->rs_blockno,
>>   state->rs_buffer,
>> - true);
>> + true,
>> + false);
>>   RelationOpenSmgr(state->rs_new_rel);
>>
>>   PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
>> @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
>>   MAIN_FORKNUM,
>>   state->rs_blockno,
>>   page,
>> - true);
>> + true,
>> + false);
>
> Did you verify that that's ok when a unlogged table is clustered/vacuum
> full'ed?

Yep.

>> @@ -181,6 +183,9 @@ xlog_identify(uint8 info)
>>   case XLOG_FPI_FOR_HINT:
>>   id = "FPI_FOR_HINT";
>>   break;
>> + case XLOG_FPI_FLUSH:
>> + id = "FPI_FOR_SYNC";
>> + break;
>>   }
>
> Old string.

Yeah, that's now completely removed.

>> --- a/src/backend/access/transam/xloginsert.c
>> +++ b/src/backend/access/transam/xloginsert.c
>> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
>>   * If the page follows the standard page layout, with a PageHeader and 
>> unused
>>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
>>   * the unused space to be left out from the WAL record, making it smaller.
>> + *
>> + * If 'is_flush' is set to TRUE, relation will be requested to flush
>> + * immediately its data at replay after replaying this full page image.
>>   */
>
> s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> OS immediately after replaying the record'?

s/OS/stable storage?
-- 
Michael
From a25b938d39fe735cf2c4c46a5c54db762510220c Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 4 Dec 2015 16:58:23 +0900
Subject: [PATCH] Ensure consistent on-disk state of UNLOGGED indexes at
 recovery

Unlogged relation indexes need to have a consistent initial state on-disk

Re: [HACKERS] Error with index on unlogged table

2015-12-03 Thread Andres Freund
On 2015-11-27 16:59:20 +0900, Michael Paquier wrote:
> Attached is a patch that fixes the issue for me in master and 9.5.
> Actually in the last patch I forgot a call to smgrwrite to ensure that
> the INIT_FORKNUM is correctly synced to disk when those pages are
> replayed at recovery, letting the reset routines for unlogged
> relations do their job correctly. I have noticed as well that we need
> to do the same for gin and brin relations. In this case I think that
> we could limit the flush to unlogged relations, my patch does it
> unconditionally though to generalize the logic. Thoughts?

I think it's a good idea to limit this to unlogged relations. For a
dataset that fits into shared_buffers and creates many relations, the
additional disk writes could be noticeable.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..d7964ac 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
>   brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>  BRIN_CURRENT_VERSION);
>   MarkBufferDirty(metabuf);
> - log_newpage_buffer(metabuf, false);
> + /*
> +  * When this full page image is replayed, there is no guarantee that
> +  * this page will be present to disk when replayed particularly for
> +  * unlogged relations, hence enforce it to be flushed to disk.
> +  */

The grammar seems a bit off here.

> + /*
> +  * Initialize and xlog metabuffer and root buffer. When those full
> +  * pages are replayed, it is not guaranteed that those relation
> +  * init forks will be flushed to disk after replaying them, hence
> +  * enforce those pages to be flushed to disk at replay, only the
> +  * last record will enforce a flush for performance reasons and
> +  * because it is actually unnecessary to do it multiple times.
> +  */

That comment needs some love too. I think it really only makes sense if
you know the current state. There's also some language polishing needed.

> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
>   START_CRIT_SECTION();
>   GinInitMetabuffer(MetaBuffer);
>   MarkBufferDirty(MetaBuffer);
> - log_newpage_buffer(MetaBuffer, false);
> + log_newpage_buffer(MetaBuffer, false, false);
>   GinInitBuffer(RootBuffer, GIN_LEAF);
>   MarkBufferDirty(RootBuffer);
> - log_newpage_buffer(RootBuffer, false);
> + log_newpage_buffer(RootBuffer, false, true);
>   END_CRIT_SECTION();
>
Why isn't the metapage flushed to disk?

> --- a/src/backend/access/spgist/spginsert.c
> +++ b/src/backend/access/spgist/spginsert.c
> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
> (char *) page, true);
>   if (XLogIsNeeded())
>   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> - SPGIST_METAPAGE_BLKNO, page, false);
> + SPGIST_METAPAGE_BLKNO, page, false, 
> false);
>  
>   /* Likewise for the root page. */
>   SpGistInitPage(page, SPGIST_LEAF);
> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
> (char *) page, true);
>   if (XLogIsNeeded())
>   log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> - SPGIST_ROOT_BLKNO, page, true);
> + SPGIST_ROOT_BLKNO, page, true, false);
>

I don't think it's correct to not flush in these cases. Yes, the primary
does an smgrimmedsync() - but that's not done on the standby.

> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record)
>* when checksums are enabled. There is no difference in 
> handling
>* XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different 
> info
>* code just to distinguish them for statistics purposes.
> +  *
> +  * XLOG_FPI_FOR_SYNC records are generated when a relation 
> needs to
> +  * be sync'ed just after replaying a full page. This is 
> important
> +  * to match the master behavior in the case where a page is 
> written
> +  * directly to disk without going through shared buffers, 
> particularly
> +  * when writing init forks for index relations.
>*/

How about

FPI_FOR_SYNC records indicate that the page immediately needs to be
written to disk, not just to shared buffers. This is important if the
on-disk state is to be the authoritative, not the state in shared
buffers. E.g. because on-disk files may later be copied directly.

> diff --git a/src/backend/commands/tablecmds.c 
> b/src/backend/commands/tablecmds.c
> index 0ddde72..eb22a51 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, 

Re: [HACKERS] Error with index on unlogged table

2015-12-03 Thread Andres Freund
Hi,

On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index cc845d2..4883697 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
>   data += sizeof(BkpBlock);
>  
>   RestoreBackupBlockContents(lsn, bkpb, data, false, false);
> +
> + if (bkpb.fork == INIT_FORKNUM)
> + {
> + SMgrRelation srel;
> + srel = smgropen(bkpb.node, InvalidBackendId);
> + smgrimmedsync(srel, INIT_FORKNUM);
> + smgrclose(srel);
> + }
>   }
>   else if (info == XLOG_BACKUP_END)
>   {

A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-01 Thread Michael Paquier
On Tue, Dec 1, 2015 at 3:06 PM, Kyotaro HORIGUCHI
 wrote:
> At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier 
>  wrote in 
> 
>> On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
>>  wrote:
>> > Hello, I studied your latest patch.
>>
>> Thanks!
>>
>> > I feel quite uncomfortable that it solves the problem from a kind
>> > of nature of unlogged object by arbitrary flagging which is not
>> > fully corresponds to the nature. If we can deduce the necessity
>> > of fsync from some nature, it would be preferable.
>>
>> INIT_FORKNUM is not something only related to unlogged relations,
>> indexes use them as well. And that's actually
>> If you look at for example BRIN indexes that do not sync immediately
>> their INIT_FORKNUM when index is created, I think that we still are
>> going to need a new flag to control the sync at WAL replay because
>> startup process cannot know a relation's persistence, thing that we
>> can know when the XLOG_FPI record is created. For BRIN indexes, we
>> want particularly to not sync the INIT_FORKNUM when the relation is
>> not an unlogged one.
>
> (The comment added in brinbuildempty looks wrong since it
>  actually doesn't fsync it immediately..)
>
> Hmm, I've already seen that, and having your explanation I wonder
> why brinbuidempty issues WAL for what is not necessary to be
> persistent at the mement. Isn't it breaking agreements about
> Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would
> be equally tied excluding the anormally about WAL. (Except for
> succeeding newpages.)

Alvaro, your thoughts regarding those lines? When building an empty
INIT_FORKNUM for a brin index its data is saved into a shared buffer
and not immediately synced into disk. Shouldn't that be necessary for
at least unlogged relations?

>> > In short, it seems to me that the reason to choose using
>> > XLOG_FPI_FOR_SYNC here is only performance of processing
>> > successive FPIs for INIT_FORKNUM.
>>
>> Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
>> However please note that having a INIT_FORKNUM does not always imply
>> that a sync is wanted. copy_relation_data is an example of that.
>
> As I wrote above, I suppose we should fix(?) the irregular
> relationship between WAL and init fork of brin and so.

Yep.

>> > INIT_FORKNUM is generated only for unlogged tables and their
>> > belongings. I suppose such successive fsyncs doesn't cause
>> > observable performance drop assuming that the number of unlogged
>> > tables and belongings is not so high, especially with smarter
>> > storages. All we should do is that just fsync only for
>> > INIT_FORKNUM's FPIs for the case. If the performance does matter
>> > even so, we still can fsync the last md-file when any wal record
>> > other than FPI for INIT_FORK comes. (But this would be a bit
>> > complex..)
>>
>> Hm. If a system uses a bunch of temporary relations with brin index or
>> other included I would not say so. For back branches we may have to do
>> it unconditionally using INIT_FORKNUM, but having a control flag to
>> have it only done for unlogged relations would leverage that.
>
> It could, and should do so. And if we take such systems with
> bunch of temp relations as significant (I agree with this),
> XLogRegisterBlock() looks to be able to register multiple blocks
> into single wal record and we could eliminate arbitrary flagging
> on individual FPI records using it. Is it possible?

I thought about using a BKPBLOCK flag but all of them are already
taken if that's what you meant. it seems cheaper to do that a record
level...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-30 Thread Kyotaro HORIGUCHI
Hello, I studied your lastest patch.

At Fri, 27 Nov 2015 16:59:20 +0900, Michael Paquier  
wrote in 

Re: [HACKERS] Error with index on unlogged table

2015-11-30 Thread Michael Paquier
On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
 wrote:
> Hello, I studied your latest patch.

Thanks!

> I feel quite uncomfortable that it solves the problem from a kind
> of nature of unlogged object by arbitrary flagging which is not
> fully corresponds to the nature. If we can deduce the necessity
> of fsync from some nature, it would be preferable.

INIT_FORKNUM is not something only related to unlogged relations,
indexes use them as well. And that's actually
If you look at for example BRIN indexes that do not sync immediately
their INIT_FORKNUM when index is created, I think that we still are
going to need a new flag to control the sync at WAL replay because
startup process cannot know a relation's persistence, thing that we
can know when the XLOG_FPI record is created. For BRIN indexes, we
want particularly to not sync the INIT_FORKNUM when the relation is
not an unlogged one.

> In the current patch, is_sync for log_newpage is generally true
> for and only for INIT_FORKNUM pages. Exceptions as far as I can
> see are,

Yep, that's not always the case.

> copy_relation_data: called with arbitrary forknum but it doesn't
>set is_fsync even for coying INIT_FORKNUM. (Is this not a
>problem?)

Oh. And actually, it seems that use_wal is broken there as well. I
think that we would still want to issue a XLOG_FPI record for an
unlogged relation should it be moved to a new tablespace to copy its
INIT_FORKNUM correctly to its new home.
After moving an unlogged relation to a new tablespace, and after
promoting a standby the standby fails to start because of this error:
FATAL:  could not create file
"pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists
This could be fixed separately though.

> spgbuildempty, ginbuildempty: these emits two or three newpage
>   logs at once so only the last one is set is_fsync for
>   performance reason.

It doesn't matter to fsync just at the last one. Each one of them
would be replayed either way, the last one triggering the sync, no?

> In short, it seems to me that the reason to choose using
> XLOG_FPI_FOR_SYNC here is only performance of processing
> successive FPIs for INIT_FORKNUM.

Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
However please note that having a INIT_FORKNUM does not always imply
that a sync is wanted. copy_relation_data is an example of that.

> INIT_FORKNUM is generated only for unlogged tables and their
> belongings. I suppose such successive fsyncs doesn't cause
> observable performance drop assuming that the number of unlogged
> tables and belongings is not so high, especially with smarter
> storages. All we should do is that just fsync only for
> INIT_FORKNUM's FPIs for the case. If the performance does matter
> even so, we still can fsync the last md-file when any wal record
> other than FPI for INIT_FORK comes. (But this would be a bit
> complex..)

Hm. If a system uses a bunch of temporary relations with brin index or
other included I would not say so. For back branches we may have to do
it unconditionally using INIT_FORKNUM, but having a control flag to
have it only done for unlogged relations would leverage that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-30 Thread Kyotaro HORIGUCHI
Hello, 

At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier  
wrote in 
> On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
>  wrote:
> > Hello, I studied your latest patch.
> 
> Thanks!
> 
> > I feel quite uncomfortable that it solves the problem from a kind
> > of nature of unlogged object by arbitrary flagging which is not
> > fully corresponds to the nature. If we can deduce the necessity
> > of fsync from some nature, it would be preferable.
> 
> INIT_FORKNUM is not something only related to unlogged relations,
> indexes use them as well. And that's actually
> If you look at for example BRIN indexes that do not sync immediately
> their INIT_FORKNUM when index is created, I think that we still are
> going to need a new flag to control the sync at WAL replay because
> startup process cannot know a relation's persistence, thing that we
> can know when the XLOG_FPI record is created. For BRIN indexes, we
> want particularly to not sync the INIT_FORKNUM when the relation is
> not an unlogged one.

(The comment added in brinbuildempty looks wrong since it
 actually doesn't fsync it immediately..)

Hmm, I've already seen that, and having your explanation I wonder
why brinbuidempty issues WAL for what is not necessary to be
persistent at the mement. Isn't it breaking agreements about
Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would
be equally tied excluding the anormally about WAL. (Except for
succeeding newpages.)

> > In the current patch, is_sync for log_newpage is generally true
> > for and only for INIT_FORKNUM pages. Exceptions as far as I can
> > see are,
> 
> Yep, that's not always the case.
> 
> > copy_relation_data: called with arbitrary forknum but it doesn't
> >set is_fsync even for coying INIT_FORKNUM. (Is this not a
> >problem?)
> 
> Oh. And actually, it seems that use_wal is broken there as well. I
> think that we would still want to issue a XLOG_FPI record for an
> unlogged relation should it be moved to a new tablespace to copy its
> INIT_FORKNUM correctly to its new home.

Agreed.

> After moving an unlogged relation to a new tablespace, and after
> promoting a standby the standby fails to start because of this error:
> FATAL:  could not create file
> "pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists
> This could be fixed separately though.
> 
> > spgbuildempty, ginbuildempty: these emits two or three newpage
> >   logs at once so only the last one is set is_fsync for
> >   performance reason.
> 
> It doesn't matter to fsync just at the last one. Each one of them
> would be replayed either way, the last one triggering the sync, no?

Yes, it does (except for some unreal case below). It was just a
confirmation and I didn't see this as a problem at all. Sorry for the
ambiguous writing.

> > In short, it seems to me that the reason to choose using
> > XLOG_FPI_FOR_SYNC here is only performance of processing
> > successive FPIs for INIT_FORKNUM.
> 
> Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
> However please note that having a INIT_FORKNUM does not always imply
> that a sync is wanted. copy_relation_data is an example of that.

As I wrote above, I suppose we should fix(?) the irregular
relationship between WAL and init fork of brin and so.

> > INIT_FORKNUM is generated only for unlogged tables and their
> > belongings. I suppose such successive fsyncs doesn't cause
> > observable performance drop assuming that the number of unlogged
> > tables and belongings is not so high, especially with smarter
> > storages. All we should do is that just fsync only for
> > INIT_FORKNUM's FPIs for the case. If the performance does matter
> > even so, we still can fsync the last md-file when any wal record
> > other than FPI for INIT_FORK comes. (But this would be a bit
> > complex..)
> 
> Hm. If a system uses a bunch of temporary relations with brin index or
> other included I would not say so. For back branches we may have to do
> it unconditionally using INIT_FORKNUM, but having a control flag to
> have it only done for unlogged relations would leverage that.

It could, and should do so. And if we take such systems with
bunch of temp relations as significant (I agree with this),
XLogRegisterBlock() looks to be able to register multiple blocks
into single wal record and we could eliminate arbitrary flagging
on individual FPI records using it. Is it possible?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-27 Thread Michael Paquier
On Fri, Nov 27, 2015 at 3:42 PM, Michael Paquier wrote:
> I am still investigating for a correct fix, looking at reinit.c the
> code in charge of copying the init fork as the main fork for a
> relation at the end of recovery looks to be doing its job correctly...

Attached is a patch that fixes the issue for me in master and 9.5.
Actually in the last patch I forgot a call to smgrwrite to ensure that
the INIT_FORKNUM is correctly synced to disk when those pages are
replayed at recovery, letting the reset routines for unlogged
relations do their job correctly. I have noticed as well that we need
to do the same for gin and brin relations. In this case I think that
we could limit the flush to unlogged relations, my patch does it
unconditionally though to generalize the logic. Thoughts?
-- 
Michael
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 99337b0..d7964ac 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
 	   BRIN_CURRENT_VERSION);
 	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, false);
+	/*
+	 * When this full page image is replayed, there is no guarantee that
+	 * this page will be present to disk when replayed particularly for
+	 * unlogged relations, hence enforce it to be flushed to disk.
+	 */
+	log_newpage_buffer(metabuf, false, true);
 	END_CRIT_SECTION();
 
 	UnlockReleaseBuffer(metabuf);
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index f876f62..572fe20 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
 	MarkBufferDirty(buffer);
-	log_newpage_buffer(buffer, true);
+	log_newpage_buffer(buffer, true, false);
 	END_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 49e9185..755c983 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/* Initialize and xlog metabuffer and root buffer. */
+	/*
+	 * Initialize and xlog metabuffer and root buffer. When those full
+	 * pages are replayed, it is not guaranteed that those relation
+	 * init forks will be flushed to disk after replaying them, hence
+	 * enforce those pages to be flushed to disk at replay, only the
+	 * last record will enforce a flush for performance reasons and
+	 * because it is actually unnecessary to do it multiple times.
+	 */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
 	MarkBufferDirty(MetaBuffer);
-	log_newpage_buffer(MetaBuffer, false);
+	log_newpage_buffer(MetaBuffer, false, false);
 	GinInitBuffer(RootBuffer, GIN_LEAF);
 	MarkBufferDirty(RootBuffer);
-	log_newpage_buffer(RootBuffer, false);
+	log_newpage_buffer(RootBuffer, false, true);
 	END_CRIT_SECTION();
 
 	/* Unlock and release the buffers. */
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 53bccf6..fd53268 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,7 +84,7 @@ gistbuildempty(PG_FUNCTION_ARGS)
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
 	MarkBufferDirty(buffer);
-	log_newpage_buffer(buffer, true);
+	log_newpage_buffer(buffer, true, true);
 	END_CRIT_SECTION();
 
 	/* Unlock and release the buffer */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6a6fc3b..e9a9a8f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
 		MAIN_FORKNUM,
 		state->rs_blockno,
 		state->rs_buffer,
-		true);
+		true,
+		false);
 		RelationOpenSmgr(state->rs_new_rel);
 
 		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
@@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			MAIN_FORKNUM,
 			state->rs_blockno,
 			page,
-			true);
+			true,
+			false);
 
 			/*
 			 * Now write the page. We say isTemp = true even if it's not a
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..d211a98 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -206,7 +206,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 			  (char *) metapage, true);
 	if (XLogIsNeeded())
 		log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-	BTREE_METAPAGE, metapage, false);
+	BTREE_METAPAGE, metapage, false, true);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because 

Re: [HACKERS] Error with index on unlogged table

2015-11-26 Thread Michael Paquier
On Sat, Nov 21, 2015 at 11:30 PM, Michael Paquier
 wrote:
> On Fri, Nov 20, 2015 at 4:11 PM, Michael Paquier
>  wrote:
>> For master and perhaps 9.5, I would think that a dedicated WAL record
>> type enforcing the fsync is the way to go. This special treatment
>> would go only for btree and spgist when they use INIT_FORKNUM and we
>> would not impact other relation types and code paths with this
>> behavior.
>
> So, I have been looking at that, and finished with the attached to
> implement this idea. This way, it is possible to control at replay if
> a relation should be synced to disk just after replaying a FPI or a
> set of FPIs. This makes btree and spgist init forks more consistent at
> replay with what is done on master by syncing them immediately, which
> is not a bad thing to me.

Depending on the type of index used on an unlogged table, the failure
happening is quite various. With gist and btree, a promoted standby
will complain about an empty page. With brin, the standby will
complain with a floating-point error:
ERROR:  22P01: floating-point exception
DETAIL:  An invalid floating-point operation was signaled. This
probably means an out-of-range result or an invalid operation, such as
division by zero.
LOCATION:  FloatExceptionHandler, postgres.c:2702

Now with gin, the system becomes actually unresponsive, being stuck on
PGSemaphoreLock and unable to answer to signals:
frame #1: 0x00010a29f7cf
postgres`PGSemaphoreLock(sema=0x0001138df118) + 63 at
pg_sema.c:387
frame #2: 0x00010a348e81
postgres`LWLockAcquire(lock=0x00010acc5dc0, mode=LW_EXCLUSIVE) +
369 at lwlock.c:1026
frame #3: 0x00010a3190b1 postgres`LockBuffer(buffer=208,
mode=2) + 289 at bufmgr.c:3240
frame #4: 0x000109f341e1
postgres`ginHeapTupleFastInsert(ginstate=0x7fff55d06a08,
collector=0x7fff55d069d8) + 849 at ginfast.c:309
frame #5: 0x000109f1383f
postgres`gininsert(fcinfo=0x7fff55d09010) + 431 at gininsert.c:531

I am still investigating for a correct fix, looking at reinit.c the
code in charge of copying the init fork as the main fork for a
relation at the end of recovery looks to be doing its job correctly...
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-21 Thread Michael Paquier
On Fri, Nov 20, 2015 at 4:11 PM, Michael Paquier
 wrote:
> For master and perhaps 9.5, I would think that a dedicated WAL record
> type enforcing the fsync is the way to go. This special treatment
> would go only for btree and spgist when they use INIT_FORKNUM and we
> would not impact other relation types and code paths with this
> behavior.

So, I have been looking at that, and finished with the attached to
implement this idea. This way, it is possible to control at replay if
a relation should be synced to disk just after replaying a FPI or a
set of FPIs. This makes btree and spgist init forks more consistent at
replay with what is done on master by syncing them immediately, which
is not a bad thing to me.

Now, and actually my last email has been misleading as well, this
patch as well as the previous patch I sent for ~9.4 do not actually
fix the initialization of indexes for unlogged tables after promoting
a standby. Hence I guess that we are still missing a trick when
reinitializing those relations at the end of recovery. It is a bit
late here so I am attaching a patch reflecting the progress I have
done. Comments are welcome.
-- 
Michael


20151121_xlog_fpi_replay_master_95.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Robert Haas
On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown  wrote:
> On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
>  wrote:
>> Hello,
>>
>> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund  
>> wrote in <20150326175024.gj...@alap3.anarazel.de>
>>> I think the problem here is that the *primary* makes no such
>>> assumptions. Init forks are logged via stuff like
>>>   smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
>> ..
>>> i.e. the data is written out directly to disk, circumventing
>>> shared_buffers. It's pretty bad that we don't do the same on the
>>> standby. For master I think we should just add a bit to the XLOG_FPI
>>> record saying the data should be forced out to disk. I'm less sure
>>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>>> record isn't really an option.
>>
>> The problem exists only for INIT_FORKNUM. So I suppose it is
>> enough to check forknum to decide whether to sync immediately.
>>
>> Specifically for this instance, syncing buffers of INIT_FORKNUM
>> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>>
>> The another (ugly!) solution sould be syncing only buffers for
>> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
>> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
>> though it is a kind of reversion of fast promotion. But buffers
>> to be synced here should be pretty few.
>
> This bug still exists.

Hmm.  This probably should have been on the open items list.  I didn't
pay too much attention this to this before because it seemed like
Andres and Michael were all over it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Thom Brown
On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund  
> wrote in <20150326175024.gj...@alap3.anarazel.de>
>> I think the problem here is that the *primary* makes no such
>> assumptions. Init forks are logged via stuff like
>>   smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
> ..
>> i.e. the data is written out directly to disk, circumventing
>> shared_buffers. It's pretty bad that we don't do the same on the
>> standby. For master I think we should just add a bit to the XLOG_FPI
>> record saying the data should be forced out to disk. I'm less sure
>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>> record isn't really an option.
>
> The problem exists only for INIT_FORKNUM. So I suppose it is
> enough to check forknum to decide whether to sync immediately.
>
> Specifically for this instance, syncing buffers of INIT_FORKNUM
> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>
> The another (ugly!) solution sould be syncing only buffers for
> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
> though it is a kind of reversion of fast promotion. But buffers
> to be synced here should be pretty few.

This bug still exists.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Michael Paquier
On Fri, Nov 20, 2015 at 7:03 AM, Robert Haas  wrote:
> On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown  wrote:
>> This bug still exists.
>
> Hmm.  This probably should have been on the open items list.

I have added an open item for 9.5. That's not a cool bug to release
the next GA even if that's not limited to 9.5, it is easily
reproducible in older stable branches as well.

> I didn't
> pay too much attention this to this before because it seemed like
> Andres and Michael were all over it.

This completely fell off my radar. Sorry about that.

For back-branches, I tend to agree with what Horiguchi-san mentioned
upthread: we had better issue an unconditional fsync on the relation
when INIT_FORKNUM is used for it when replaying XLOG_FPI record. That
would rather localize the impact. An example of patch is attached that
applies on top of REL9_4_STABLE. That's rather ugly but it shows the
idea and fixes the issue.

For master and perhaps 9.5, I would think that a dedicated WAL record
type enforcing the fsync is the way to go. This special treatment
would go only for btree and spgist when they use INIT_FORKNUM and we
would not impact other relation types and code paths with this
behavior.

Thoughts?
-- 
Michael


20151120_xlog_fpi_replay.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-26 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund and...@2ndquadrant.com 
wrote in 20150326175024.gj...@alap3.anarazel.de
 I think the problem here is that the *primary* makes no such
 assumptions. Init forks are logged via stuff like
   smgrwrite(index-rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
..
 i.e. the data is written out directly to disk, circumventing
 shared_buffers. It's pretty bad that we don't do the same on the
 standby. For master I think we should just add a bit to the XLOG_FPI
 record saying the data should be forced out to disk. I'm less sure
 what's to be done in the back branches. Flushing every HEAP_NEWPAGE
 record isn't really an option.

The problem exists only for INIT_FORKNUM. So I suppose it is
enough to check forknum to decide whether to sync immediately.

Specifically for this instance, syncing buffers of INIT_FORKNUM
at the end of XLOG_FPI block in xlog_redo fixed the problem.

The another (ugly!) solution sould be syncing only buffers for
INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
though it is a kind of reversion of fast promotion. But buffers
to be synced here should be pretty few.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-26 Thread Andres Freund
On 2015-03-26 15:13:41 +0100, Andres Freund wrote:
 On 2015-03-26 13:55:22 +, Thom Brown wrote:
  I still, however, have a problem with the separate and original issue of:
  
  # insert into utest (thing) values ('moomoo');
  ERROR:  index utest_pkey contains unexpected zero page at block 0
  HINT:  Please REINDEX it.
  
  I don't see why the user should need to go re-indexing all unlogged tables
  each time a standby is promoted.  The index should just be empty and ready
  to use.
 
 There's definitely something rather broken here. Investigating.

As far as I can see this has been broken at least since the introduction
of fast promotion. WAL replay will update the init fork in shared
memory, but it'll not be guaranteed to be flushed to disk when the reset
happens. d3586fc8a et al. then also made it possible to hit the issue
without fast promotion.

To hit the issue there may not be a restartpoint (requiring a checkpoint
on the primary) since the creation of the unlogged table.

I think the problem here is that the *primary* makes no such
assumptions. Init forks are logged via stuff like
smgrwrite(index-rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
  (char *) metapage, true);
if (XLogIsNeeded())
log_newpage(index-rd_smgr-smgr_rnode.node, INIT_FORKNUM,
BTREE_METAPAGE, metapage, false);

/*
 * An immediate sync is required even if we xlog'd the page, because the
 * write did not go through shared_buffers and therefore a concurrent
 * checkpoint may have moved the redo pointer past our xlog record.
 */
smgrimmedsync(index-rd_smgr, INIT_FORKNUM);

i.e. the data is written out directly to disk, circumventing
shared_buffers. It's pretty bad that we don't do the same on the
standby. For master I think we should just add a bit to the XLOG_FPI
record saying the data should be forced out to disk. I'm less sure
what's to be done in the back branches. Flushing every HEAP_NEWPAGE
record isn't really an option.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-26 Thread Thom Brown
On 26 March 2015 at 00:55, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:
  On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello
  fabriziome...@gmail.com wrote:
 
 
  On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund and...@2ndquadrant.com
 
  wrote:
  
 
   Did you check whether a similar bug was made in other places of
   85b506bb? Could you additionally add a regression test to this end?
   Seems like something worth testing.
  
 
  I'm checking it and adding some regression tests.
 
 
  I didn't found any other similar bug introduced by 85b506bb.
 
  Attached the original patch provided by Michael with some regression
 tests.

 Thanks for adding a test, this looks fine to me (did some sanity
 checks and tutti-quanti for people wondering). On temporary tables
 this was failing with an error in md.c...


Thanks to both of you for fixing this.

I still, however, have a problem with the separate and original issue of:

# insert into utest (thing) values ('moomoo');
ERROR:  index utest_pkey contains unexpected zero page at block 0
HINT:  Please REINDEX it.

I don't see why the user should need to go re-indexing all unlogged tables
each time a standby is promoted.  The index should just be empty and ready
to use.

-- 
Thom


Re: [HACKERS] Error with index on unlogged table

2015-03-26 Thread Andres Freund
On 2015-03-26 13:55:22 +, Thom Brown wrote:
 I still, however, have a problem with the separate and original issue of:
 
 # insert into utest (thing) values ('moomoo');
 ERROR:  index utest_pkey contains unexpected zero page at block 0
 HINT:  Please REINDEX it.
 
 I don't see why the user should need to go re-indexing all unlogged tables
 each time a standby is promoted.  The index should just be empty and ready
 to use.

There's definitely something rather broken here. Investigating.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Amit Langote
On Wednesday, March 25, 2015, Michael Paquier michael.paqu...@gmail.com wrote:

 On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
  The index is unlogged until reindexing...
 
  [...]
  Which is think also raises the question, why are unlogged indexes made
  persistent by a reindex?

 That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
 REINDEX INDEX. What happens is that ReindexIndex relies on
 relpersistence provided by makeRangeVar at parse time, which is just
 incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
 attached fixes that...


How about VACUUM FULL and CLUSTER as the problem seems to have been
reported to be there too?

Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Andres Freund
Hi,

On 2015-03-25 11:38:30 +0900, Michael Paquier wrote:
 On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
  The index is unlogged until reindexing...
 
  [...]
  Which is think also raises the question, why are unlogged indexes made
  persistent by a reindex?
 
 That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
 REINDEX INDEX. What happens is that ReindexIndex relies on
 relpersistence provided by makeRangeVar at parse time, which is just
 incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
 attached fixes that...

What the hell? That's somewhat nasty. Nice that it got caught before 9.5
was released.

Did you check whether a similar bug was made in other places of
85b506bb? Could you additionally add a regression test to this end?
Seems like something worth testing.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Thom Brown
On 25 March 2015 at 12:22, Amit Langote amitlangot...@gmail.com wrote:

 On Wednesday, March 25, 2015, Michael Paquier michael.paqu...@gmail.com
 wrote:
 
  On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
   The index is unlogged until reindexing...
  
   [...]
   Which is think also raises the question, why are unlogged indexes made
   persistent by a reindex?
 
  That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
  REINDEX INDEX. What happens is that ReindexIndex relies on
  relpersistence provided by makeRangeVar at parse time, which is just
  incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
  attached fixes that...
 

 How about VACUUM FULL and CLUSTER as the problem seems to have been
 reported to be there too?


No, those are okay.  They actually revert the index back to the same
persistence level as the table they're attached to.

-- 
Thom


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Amit Langote
On Wednesday, March 25, 2015, Thom Brown t...@linux.com wrote:

 On 25 March 2015 at 12:22, Amit Langote amitlangot...@gmail.com
 javascript:_e(%7B%7D,'cvml','amitlangot...@gmail.com'); wrote:

 On Wednesday, March 25, 2015, Michael Paquier michael.paqu...@gmail.com
 javascript:_e(%7B%7D,'cvml','michael.paqu...@gmail.com'); wrote:
 
  On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
   The index is unlogged until reindexing...
  
   [...]
   Which is think also raises the question, why are unlogged indexes made
   persistent by a reindex?
 
  That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
  REINDEX INDEX. What happens is that ReindexIndex relies on
  relpersistence provided by makeRangeVar at parse time, which is just
  incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
  attached fixes that...
 

 How about VACUUM FULL and CLUSTER as the problem seems to have been
 reported to be there too?


 No, those are okay.  They actually revert the index back to the same
 persistence level as the table they're attached to.


Ah, I misread then; sorry about the noise.

Amit


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Fabrízio de Royes Mello
On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund and...@2ndquadrant.com
wrote:

 Hi,

 On 2015-03-25 11:38:30 +0900, Michael Paquier wrote:
  On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
   The index is unlogged until reindexing...
  
   [...]
   Which is think also raises the question, why are unlogged indexes made
   persistent by a reindex?
 
  That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
  REINDEX INDEX. What happens is that ReindexIndex relies on
  relpersistence provided by makeRangeVar at parse time, which is just
  incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
  attached fixes that...

 What the hell? That's somewhat nasty. Nice that it got caught before 9.5
 was released.


Unfortunately this is very nasty. Sorry!


 Did you check whether a similar bug was made in other places of
 85b506bb? Could you additionally add a regression test to this end?
 Seems like something worth testing.


I'm checking it and adding some regression tests.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Fabrízio de Royes Mello
On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello 
fabriziome...@gmail.com wrote:


 On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund and...@2ndquadrant.com
wrote:
 

  Did you check whether a similar bug was made in other places of
  85b506bb? Could you additionally add a regression test to this end?
  Seems like something worth testing.
 

 I'm checking it and adding some regression tests.


I didn't found any other similar bug introduced by 85b506bb.

Attached the original patch provided by Michael with some regression tests.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 1c1d0da..1520d32 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1685,6 +1685,8 @@ ReindexIndex(RangeVar *indexRelation)
 {
 	Oid			indOid;
 	Oid			heapOid = InvalidOid;
+	Relation	irel;
+	char		relpersistence;
 
 	/* lock level used here should match index lock reindex_index() */
 	indOid = RangeVarGetRelidExtended(indexRelation, AccessExclusiveLock,
@@ -1692,7 +1694,11 @@ ReindexIndex(RangeVar *indexRelation)
 	  RangeVarCallbackForReindexIndex,
 	  (void *) heapOid);
 
-	reindex_index(indOid, false, indexRelation-relpersistence);
+	irel = index_open(indOid, AccessExclusiveLock);
+	relpersistence = irel-rd_rel-relpersistence;
+	index_close(irel, NoLock);
+
+	reindex_index(indOid, false, relpersistence);
 
 	return indOid;
 }
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index 34b5fc1..3214c19 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -208,6 +208,21 @@ CREATE TABLE IF NOT EXISTS test_tsvector(
 );
 NOTICE:  relation test_tsvector already exists, skipping
 CREATE UNLOGGED TABLE unlogged1 (a int primary key);			-- OK
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1' ORDER BY relname;
+relname | relkind | relpersistence 
++-+
+ unlogged1  | r   | u
+ unlogged1_pkey | i   | u
+(2 rows)
+
+REINDEX INDEX unlogged1_pkey;
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1' ORDER BY relname;
+relname | relkind | relpersistence 
++-+
+ unlogged1  | r   | u
+ unlogged1_pkey | i   | u
+(2 rows)
+
 INSERT INTO unlogged1 VALUES (42);
 CREATE UNLOGGED TABLE public.unlogged2 (a int primary key);		-- also OK
 CREATE UNLOGGED TABLE pg_temp.unlogged3 (a int primary key);	-- not OK
diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql
index 08029a9..acb7eb8 100644
--- a/src/test/regress/sql/create_table.sql
+++ b/src/test/regress/sql/create_table.sql
@@ -246,6 +246,9 @@ CREATE TABLE IF NOT EXISTS test_tsvector(
 );
 
 CREATE UNLOGGED TABLE unlogged1 (a int primary key);			-- OK
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1' ORDER BY relname;
+REINDEX INDEX unlogged1_pkey;
+SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1' ORDER BY relname;
 INSERT INTO unlogged1 VALUES (42);
 CREATE UNLOGGED TABLE public.unlogged2 (a int primary key);		-- also OK
 CREATE UNLOGGED TABLE pg_temp.unlogged3 (a int primary key);	-- not OK

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Michael Paquier
On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello
fabriziome...@gmail.com wrote:
 On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:


 On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund and...@2ndquadrant.com
 wrote:
 

  Did you check whether a similar bug was made in other places of
  85b506bb? Could you additionally add a regression test to this end?
  Seems like something worth testing.
 

 I'm checking it and adding some regression tests.


 I didn't found any other similar bug introduced by 85b506bb.

 Attached the original patch provided by Michael with some regression tests.

Thanks for adding a test, this looks fine to me (did some sanity
checks and tutti-quanti for people wondering). On temporary tables
this was failing with an error in md.c...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-25 Thread Michael Paquier
On Wed, Mar 25, 2015 at 10:53 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2015-03-25 11:38:30 +0900, Michael Paquier wrote:
 On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
  The index is unlogged until reindexing...
 
  [...]
  Which is think also raises the question, why are unlogged indexes made
  persistent by a reindex?

 That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
 REINDEX INDEX. What happens is that ReindexIndex relies on
 relpersistence provided by makeRangeVar at parse time, which is just
 incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
 attached fixes that...

 What the hell? That's somewhat nasty. Nice that it got caught before 9.5
 was released.

 Did you check whether a similar bug was made in other places of
 85b506bb?

Yeah I got a look at the other code paths, particularly cluster and
matviews, and the relpersistence used is taken directly from a
Relation.

 Could you additionally add a regression test to this end?
 Seems like something worth testing.

Definitely. And I guess that Fabrizio already did that...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-24 Thread Thom Brown
On 24 March 2015 at 11:37, Andres Freund and...@anarazel.de wrote:

 On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier 
 michael.paqu...@gmail.com wrote:
 On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown t...@linux.com wrote:
  I was attempting to set up a data set to test pg_rewind, when I
 encountered
  an error.  I created a primary and standby, then:
 
  [...]
 
  # insert into utest (thing) values ('moomoo');
  ERROR:  index utest_pkey contains unexpected zero page at block 0
  HINT:  Please REINDEX it.
 
  This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
 
 Unlogged tables are not in WAL, and cannot be accessed while in
 recovery, so having an empty index relation is expected on a promoted
 standby IMO. Now perhaps we could have a more friendly error message
 in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
 additional HINT to mention unlogged tables, but I am not sure that
 this is much worth it. Mentioning this behavior in the docs would be
 good instead.

 I think Thom's point is that he promoted the node...

 Thom, are you sure this want transient?


The index is unlogged until reindexing...

# select oid, relname, relpersistence from pg_class where relname in
('test','test_pkey','utest','utest_pkey');
  oid  |  relname   | relpersistence
---++
 16387 | test   | p
 16394 | test_pkey  | p
 16398 | utest  | u
 16405 | utest_pkey | u
(4 rows)

# reindex index utest_pkey;
REINDEX

# select oid, relname, relpersistence from pg_class where relname in
('test','test_pkey','utest','utest_pkey');
  oid  |  relname   | relpersistence
---++
 16387 | test   | p
 16394 | test_pkey  | p
 16398 | utest  | u
 16405 | utest_pkey | p
(4 rows)

Which is think also raises the question, why are unlogged indexes made
persistent by a reindex?

-- 
Thom


Re: [HACKERS] Error with index on unlogged table

2015-03-24 Thread Michael Paquier
On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown t...@linux.com wrote:
 I was attempting to set up a data set to test pg_rewind, when I encountered
 an error.  I created a primary and standby, then:

 [...]

 # insert into utest (thing) values ('moomoo');
 ERROR:  index utest_pkey contains unexpected zero page at block 0
 HINT:  Please REINDEX it.

 This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.

Unlogged tables are not in WAL, and cannot be accessed while in
recovery, so having an empty index relation is expected on a promoted
standby IMO. Now perhaps we could have a more friendly error message
in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
additional HINT to mention unlogged tables, but I am not sure that
this is much worth it. Mentioning this behavior in the docs would be
good instead.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-24 Thread Andres Freund
On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier 
michael.paqu...@gmail.com wrote:
On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown t...@linux.com wrote:
 I was attempting to set up a data set to test pg_rewind, when I
encountered
 an error.  I created a primary and standby, then:

 [...]

 # insert into utest (thing) values ('moomoo');
 ERROR:  index utest_pkey contains unexpected zero page at block 0
 HINT:  Please REINDEX it.

 This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.

Unlogged tables are not in WAL, and cannot be accessed while in
recovery, so having an empty index relation is expected on a promoted
standby IMO. Now perhaps we could have a more friendly error message
in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
additional HINT to mention unlogged tables, but I am not sure that
this is much worth it. Mentioning this behavior in the docs would be
good instead.

I think Thom's point is that he promoted the node...

Thom, are you sure this want transient?

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-24 Thread Thom Brown
On 24 March 2015 at 11:46, Thom Brown t...@linux.com wrote:


 On 24 March 2015 at 11:37, Andres Freund and...@anarazel.de wrote:

 On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier 
 michael.paqu...@gmail.com wrote:
 On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown t...@linux.com wrote:
  I was attempting to set up a data set to test pg_rewind, when I
 encountered
  an error.  I created a primary and standby, then:
 
  [...]
 
  # insert into utest (thing) values ('moomoo');
  ERROR:  index utest_pkey contains unexpected zero page at block 0
  HINT:  Please REINDEX it.
 
  This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
 
 Unlogged tables are not in WAL, and cannot be accessed while in
 recovery, so having an empty index relation is expected on a promoted
 standby IMO. Now perhaps we could have a more friendly error message
 in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
 additional HINT to mention unlogged tables, but I am not sure that
 this is much worth it. Mentioning this behavior in the docs would be
 good instead.

 I think Thom's point is that he promoted the node...

 Thom, are you sure this want transient?


 The index is unlogged until reindexing...

 # select oid, relname, relpersistence from pg_class where relname in
 ('test','test_pkey','utest','utest_pkey');
   oid  |  relname   | relpersistence
 ---++
  16387 | test   | p
  16394 | test_pkey  | p
  16398 | utest  | u
  16405 | utest_pkey | u
 (4 rows)

 # reindex index utest_pkey;
 REINDEX

 # select oid, relname, relpersistence from pg_class where relname in
 ('test','test_pkey','utest','utest_pkey');
   oid  |  relname   | relpersistence
 ---++
  16387 | test   | p
  16394 | test_pkey  | p
  16398 | utest  | u
  16405 | utest_pkey | p
 (4 rows)

 Which is think also raises the question, why are unlogged indexes made
 persistent by a reindex?


I should also mention that it becomes unlogged again when running VACUUM
FULL or CLUSTER on the table.

-- 
Thom


Re: [HACKERS] Error with index on unlogged table

2015-03-24 Thread Michael Paquier
On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
 The index is unlogged until reindexing...

 [...]
 Which is think also raises the question, why are unlogged indexes made
 persistent by a reindex?

That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
REINDEX INDEX. What happens is that ReindexIndex relies on
relpersistence provided by makeRangeVar at parse time, which is just
incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
attached fixes that...
-- 
Michael
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 1c1d0da..1520d32 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1685,6 +1685,8 @@ ReindexIndex(RangeVar *indexRelation)
 {
 	Oid			indOid;
 	Oid			heapOid = InvalidOid;
+	Relation	irel;
+	char		relpersistence;
 
 	/* lock level used here should match index lock reindex_index() */
 	indOid = RangeVarGetRelidExtended(indexRelation, AccessExclusiveLock,
@@ -1692,7 +1694,11 @@ ReindexIndex(RangeVar *indexRelation)
 	  RangeVarCallbackForReindexIndex,
 	  (void *) heapOid);
 
-	reindex_index(indOid, false, indexRelation-relpersistence);
+	irel = index_open(indOid, AccessExclusiveLock);
+	relpersistence = irel-rd_rel-relpersistence;
+	index_close(irel, NoLock);
+
+	reindex_index(indOid, false, relpersistence);
 
 	return indOid;
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-03-24 Thread Michael Paquier
On Tue, Mar 24, 2015 at 8:37 PM, Andres Freund and...@anarazel.de wrote:
 On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier wrote:
 I think Thom's point is that he promoted the node...

 Thom, are you sure this want transient?

Well, I got his point :)
I was just thinking that this error message is legit,
ResetUnloggedRelationsInDbspaceDir reinitializing unlogged relation
and its indexes at the end of recovery.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Error with index on unlogged table

2015-03-24 Thread Thom Brown
Hi,

I was attempting to set up a data set to test pg_rewind, when I encountered
an error.  I created a primary and standby, then:

# create table test (id serial primary key, thing text);
CREATE TABLE

# create unlogged table utest (id serial primary key, thing text);
CREATE TABLE

# insert into test (thing) values ('jifj');
INSERT 0 1

# insert into utest (thing) values ('jifj');
INSERT 0 1

$ pg_ctl -D standby1 promote

$ psql -p 5531 postgres # standby

# insert into test (thing) values ('moomoo');
INSERT 0 1

# insert into utest (thing) values ('moomoo');
ERROR:  index utest_pkey contains unexpected zero page at block 0
HINT:  Please REINDEX it.

This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
-- 
Thom