Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-13 Thread Heikki Linnakangas

On 12.11.2012 22:53, Tom Lane wrote:

Here's an updated patch that fixes the GIST replay functions as well as
the other minor issues that were mentioned.  Barring objections, I'll
set about back-patching this as far as 9.0.


Ok. It won't help all that much on 9.0, though.


One thing that could use verification is my fix for
gistRedoPageSplitRecord.  AFAICS, the first page listed in the WAL
record is always the original page, and the ones following it are
pages that were split off from it, and can (as yet) only be reached by
following right-links from the original page.  As such, it should be
okay to release locks on the non-first pages as soon as we've written
them.  We have to hold lock on the original page though to avoid letting
readers follow dangling right-links.  Also, the update of
NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
with all this, so that has to be done before releasing the original-page
lock as well.  Does that sound right?


Yep.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-13 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 12.11.2012 22:53, Tom Lane wrote:
 Here's an updated patch that fixes the GIST replay functions as well as
 the other minor issues that were mentioned.  Barring objections, I'll
 set about back-patching this as far as 9.0.

 Ok. It won't help all that much on 9.0, though.

Well, it won't help GIST much, but the actually-reported-from-the-field
case is in btree, and it does fix that.

It occurs to me that if we're sufficiently scared of this case, we could
probably hack the planner (in 9.0 only) to refuse to use GIST indexes
in hot-standby queries.  That cure might be worse than the disease though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-13 Thread Merlin Moncure
On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ok. It won't help all that much on 9.0, though.

 Well, it won't help GIST much, but the actually-reported-from-the-field
 case is in btree, and it does fix that.

 It occurs to me that if we're sufficiently scared of this case, we could
 probably hack the planner (in 9.0 only) to refuse to use GIST indexes
 in hot-standby queries.  That cure might be worse than the disease though.

if anything, it should be documented.  if you do this kind of thing
people will stop installing bugfix releases.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-13 Thread Gavin Flower

On 14/11/12 04:32, Merlin Moncure wrote:

On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Ok. It won't help all that much on 9.0, though.

Well, it won't help GIST much, but the actually-reported-from-the-field
case is in btree, and it does fix that.

It occurs to me that if we're sufficiently scared of this case, we could
probably hack the planner (in 9.0 only) to refuse to use GIST indexes
in hot-standby queries.  That cure might be worse than the disease though.

if anything, it should be documented.  if you do this kind of thing
people will stop installing bugfix releases.

merlin


How about displaying a warning, when people try to use the 'feature', as 
well as document it?


Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-13 Thread Robert Haas
On Tue, Nov 13, 2012 at 10:32 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ok. It won't help all that much on 9.0, though.

 Well, it won't help GIST much, but the actually-reported-from-the-field
 case is in btree, and it does fix that.

 It occurs to me that if we're sufficiently scared of this case, we could
 probably hack the planner (in 9.0 only) to refuse to use GIST indexes
 in hot-standby queries.  That cure might be worse than the disease though.

 if anything, it should be documented.  if you do this kind of thing
 people will stop installing bugfix releases.

Agreed.  I think doing that in a back-branch release would be
extremely user-hostile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Heikki Linnakangas

On 12.11.2012 01:25, Tom Lane wrote:

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.


Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
the GiST index is always in a consistent state. Before the downlink for 
a newly split page has been inserted yet, its left sibling is flagged so 
that a search knows to follow the right-link to find it. In normal 
operation, the lock continuity is needed to prevent another backend from 
seeing the incomplete split and trying to fix it, but in hot standby, we 
never try to fix incomplete splits anyway.


So I think we're good on = 9.1. The 9.0 code is broken, however. In 
9.0, when a child page is split, the parent and new children are kept 
locked until the downlinks are inserted/updated. If a concurrent scan 
comes along and sees that incomplete state, it will miss tuples on the 
new right siblings. We rely on a rm_cleanup operation at the end of WAL 
replay to fix that situation, if the downlink insertion record is not 
there. I don't see any easy way to fix that, unfortunately. Perhaps we 
could backpatch the 9.1 rewrite, now that it's gotten some real-world 
testing, but it was a big change so I don't feel very comfortable doing 
that.



Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.


Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
removed.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Robert Haas
On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I already pointed out the inconsistency in heap_xlog_freeze about
 whether a cleanup lock is needed.  As is, this patch uses a cleanup
 lock, but I suspect that a regular lock is sufficient --- comments?

Well, according to storage/buffer/README:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive content lock.  To examine the commit status (XIDs and status bits)
of a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

That does indeed make it sound like an x-lock is enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I already pointed out the inconsistency in heap_xlog_freeze about
 whether a cleanup lock is needed.  As is, this patch uses a cleanup
 lock, but I suspect that a regular lock is sufficient --- comments?

 Well, according to storage/buffer/README:

 1. To scan a page for tuples, one must hold a pin and either shared or
 exclusive content lock.  To examine the commit status (XIDs and status bits)
 of a tuple in a shared buffer, one must likewise hold a pin and either shared
 or exclusive lock.

 That does indeed make it sound like an x-lock is enough.

Yeah.  AFAICS, the only possible downside is that somebody might be
using the tuple (while holding just a buffer pin), and that its xmin
might change while that's happening.  So for instance a nestloop join
might read out different xmin values for the same row while the join
proceeds.  But that could happen anyway if a different plan had been
chosen (viz, this table on the inside not the outside of the nestloop).
The xmin update ought to be physically atomic, so you shouldn't be able
to get a garbage result, just the old value or the new.

The cleanup lock is needed for cases where we'd like to remove or move a
tuple, but that's not required here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 12.11.2012 01:25, Tom Lane wrote:
 Worse than that, GIST WAL replay seems to be a total disaster from this
 standpoint, and I'm not convinced that it's even fixable without
 incompatible WAL changes.

 Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
 the GiST index is always in a consistent state. Before the downlink for 
 a newly split page has been inserted yet, its left sibling is flagged so 
 that a search knows to follow the right-link to find it. In normal 
 operation, the lock continuity is needed to prevent another backend from 
 seeing the incomplete split and trying to fix it, but in hot standby, we 
 never try to fix incomplete splits anyway.

 So I think we're good on = 9.1.

Okay.  I'll update the patch to make sure that the individual WAL replay
functions hold all locks, but not worry about holding locks across
actions.

 The 9.0 code is broken, however. In 
 9.0, when a child page is split, the parent and new children are kept 
 locked until the downlinks are inserted/updated. If a concurrent scan 
 comes along and sees that incomplete state, it will miss tuples on the 
 new right siblings. We rely on a rm_cleanup operation at the end of WAL 
 replay to fix that situation, if the downlink insertion record is not 
 there. I don't see any easy way to fix that, unfortunately. Perhaps we 
 could backpatch the 9.1 rewrite, now that it's gotten some real-world 
 testing, but it was a big change so I don't feel very comfortable doing 
 that.

Me either.  Given the lack of field complaints, I think we're better
advised to just leave it unfixed in 9.0.  It'd not be a step forward
if we broke something trying to make this work.

 Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
 anything that emits XLOG_GIST_PAGE_DELETE.

 Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
 removed.

Okay.  I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer
any way that 9.0 or later versions could see this WAL record type.
I'll delete that replay function rather than bothering to fix it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
 If any of this stuff were getting used by external modules, changing it
 would be problematic ... but fortunately, it isn't, because we lack
 support for plug-in rmgrs.  So I'm not worried about back-patching the
 change, and would prefer to keep the 9.x branches in sync.

 XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
 used by xlogdump. Not sure if either are worth that much attention, but
 it seems worth noticing that such a change will break stuff.

Hm.  Okay, how about we leave the old macros in place in the back
branches?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Andres Freund
On 2012-11-12 10:19:09 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
  If any of this stuff were getting used by external modules, changing it
  would be problematic ... but fortunately, it isn't, because we lack
  support for plug-in rmgrs.  So I'm not worried about back-patching the
  change, and would prefer to keep the 9.x branches in sync.

  XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
  used by xlogdump. Not sure if either are worth that much attention, but
  it seems worth noticing that such a change will break stuff.

 Hm.  Okay, how about we leave the old macros in place in the back
 branches?

Sounds good to me. The RestoreBkpBlocks change seems unproblematic to
me. If anything its good that it has been renamed.

Thanks,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Simon Riggs
On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote:

 Practically all WAL record types that touch multiple pages have some
 bug of this type.  In addition to btree_xlog_split, I found that
 heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
 spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
 locks as required to make their updates safe for concurrent queries.
 (I'm not totally sure about ginRedoDeletePage, but the original action
 definitely locks the pages simultaneously, and it's not clear that it's
 safe not to.)  Most of these are okay in cases without any full-page
 images, but could fail if the wrong subset of the pages-to-be-touched
 were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Hmm, not good. Thanks for spotting.

Do these changes do anything to actions that occur across multiple
records? I assume not and think those are OK, agreed?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Simon Riggs
On 12 November 2012 14:51, Tom Lane t...@sss.pgh.pa.us wrote:

 The 9.0 code is broken, however. In
 9.0, when a child page is split, the parent and new children are kept
 locked until the downlinks are inserted/updated. If a concurrent scan
 comes along and sees that incomplete state, it will miss tuples on the
 new right siblings. We rely on a rm_cleanup operation at the end of WAL
 replay to fix that situation, if the downlink insertion record is not
 there. I don't see any easy way to fix that, unfortunately. Perhaps we
 could backpatch the 9.1 rewrite, now that it's gotten some real-world
 testing, but it was a big change so I don't feel very comfortable doing
 that.

 Me either.  Given the lack of field complaints, I think we're better
 advised to just leave it unfixed in 9.0.  It'd not be a step forward
 if we broke something trying to make this work.

Agreed. Most people running 9.0 with GIST indexes have already upgraded.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote:
 Practically all WAL record types that touch multiple pages have some
 bug of this type.  In addition to btree_xlog_split, I found that
 heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
 spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
 locks as required to make their updates safe for concurrent queries.
 (I'm not totally sure about ginRedoDeletePage, but the original action
 definitely locks the pages simultaneously, and it's not clear that it's
 safe not to.)  Most of these are okay in cases without any full-page
 images, but could fail if the wrong subset of the pages-to-be-touched
 were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

 Hmm, not good. Thanks for spotting.

 Do these changes do anything to actions that occur across multiple
 records? I assume not and think those are OK, agreed?

Right, we were and still are assuming that any state that exists between
WAL records is consistent and safe to expose to hot-standby queries.
The important thing here is that these WAL replay functions were failing
to ensure that their own actions appear atomic to onlookers.  This is
basically hangover from pre-hot-standby coding conventions, when no such
concern existed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Here's an updated patch that fixes the GIST replay functions as well as
the other minor issues that were mentioned.  Barring objections, I'll
set about back-patching this as far as 9.0.

One thing that could use verification is my fix for
gistRedoPageSplitRecord.  AFAICS, the first page listed in the WAL
record is always the original page, and the ones following it are
pages that were split off from it, and can (as yet) only be reached by
following right-links from the original page.  As such, it should be
okay to release locks on the non-first pages as soon as we've written
them.  We have to hold lock on the original page though to avoid letting
readers follow dangling right-links.  Also, the update of
NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
with all this, so that has to be done before releasing the original-page
lock as well.  Does that sound right?

regards, tom lane



restore-backup-blocks-individually-3.patch.gz
Description: restore-backup-blocks-individually-3.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-11 Thread Tom Lane
Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's merge lists case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane

#application/octet-stream [restore-backup-blocks-individually-2.patch.gz] 
/home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-11 Thread Tom Lane
[ this time *with* the patch ]

Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's merge lists case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane



binVAAMKoREwG.bin
Description: restore-backup-blocks-individually-2.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-11 Thread Andres Freund
On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  One thing that seems a bit annoying is the use of zero-based backup
  block indexes in RestoreBackupBlock, when most (not all) of the callers
  are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
  That's a bug waiting to happen.  We could address it by changing
  RestoreBackupBlock to accept a one-based argument, but what I'm thinking
  of doing instead is getting rid of the one-based convention entirely;
  that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
  all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
  advantage of doing that is that it'll force inspection of all code
  related to this.

  I wouldn't do that in a back branch, but I can see why its a good idea.
 
 If any of this stuff were getting used by external modules, changing it
 would be problematic ... but fortunately, it isn't, because we lack
 support for plug-in rmgrs.  So I'm not worried about back-patching the
 change, and would prefer to keep the 9.x branches in sync.

XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
used by xlogdump. Not sure if either are worth that much attention, but
it seems worth noticing that such a change will break stuff.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Simon Riggs
On 9 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote:
 During normal running, operations such as btree page splits are
 extremely careful about the order in which they acquire and release
 buffer locks, if they're doing something that concurrently modifies
 multiple pages.

 During WAL replay, that all goes out the window.  Even if an individual
 WAL-record replay function does things in the right order for standard
 cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
 of the referenced pages gets treated as a full-page image, we are left
 with no guarantee whatsoever about what order the pages are restored in.
 That never mattered when the code was originally designed, but it sure
 matters during Hot Standby when other queries might be able to see the
 intermediate states.

 I can't prove that this is the cause of bug #7648, but it's fairly easy
 to see that it could explain the symptom.  You only need to assume that
 the page-being-split had been handled as a full-page image, and that the
 new right-hand page had gotten allocated by extending the relation.
 Then there will be an interval just after RestoreBkpBlocks does its
 thing where the updated left-hand sibling is in the index and is not
 locked in any way, but its right-link points off the end of the index.
 If a few indexscans come along before the replay process gets to
 continue, you'd get exactly the reported errors.

 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

No, but it looks like a clear bug scenario and a clear resolution also.

I'll start looking at it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 9 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

 No, but it looks like a clear bug scenario and a clear resolution also.
 I'll start looking at it.

I'm on it already.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Tom Lane
I wrote:
 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

Here's a WIP patch that attacks it in this way.  I've only gone as far as
fixing btree_xlog_split() for the moment, since the main point is to see
whether the coding change seems reasonable.  At least in this function,
it seems that locking considerations are handled correctly as long as no
full-page image is used, so it's pretty straightforward to tweak it to
handle the case with full-page image(s) as well.  I've not looked
through any other replay functions yet --- some of them may need more
invasive hacking.  But so far this looks pretty good.

Note that this patch isn't correct yet even for btree_xlog_split(),
because I've not removed the RestoreBkpBlocks() call in btree_redo().
All the btree replay routines will have to get fixed before it's
testable at all.

One thing that seems a bit annoying is the use of zero-based backup
block indexes in RestoreBackupBlock, when most (not all) of the callers
are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
That's a bug waiting to happen.  We could address it by changing
RestoreBackupBlock to accept a one-based argument, but what I'm thinking
of doing instead is getting rid of the one-based convention entirely;
that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
advantage of doing that is that it'll force inspection of all code
related to this.

Comments, opinions?

regards, tom lane

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..ee7e15fddcfca7dfe7ccba6ab021397937b1a96d 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*** btree_xlog_split(bool onleft, bool isroo
*** 323,329 
  		datalen -= newitemsz;
  	}
  
! 	/* Reconstruct right (new) sibling from scratch */
  	rbuf = XLogReadBuffer(xlrec-node, xlrec-rightsib, true);
  	Assert(BufferIsValid(rbuf));
  	rpage = (Page) BufferGetPage(rbuf);
--- 323,329 
  		datalen -= newitemsz;
  	}
  
! 	/* Reconstruct right (new) sibling page from scratch */
  	rbuf = XLogReadBuffer(xlrec-node, xlrec-rightsib, true);
  	Assert(BufferIsValid(rbuf));
  	rpage = (Page) BufferGetPage(rbuf);
*** btree_xlog_split(bool onleft, bool isroo
*** 357,374 
  
  	/* don't release the buffer yet; we touch right page's first item below */
  
! 	/*
! 	 * Reconstruct left (original) sibling if needed.  Note that this code
! 	 * ensures that the items remaining on the left page are in the correct
! 	 * item number order, but it does not reproduce the physical order they
! 	 * would have had.	Is this worth changing?  See also _bt_restore_page().
! 	 */
! 	if (!(record-xl_info  XLR_BKP_BLOCK_1))
  	{
  		Buffer		lbuf = XLogReadBuffer(xlrec-node, xlrec-leftsib, false);
  
  		if (BufferIsValid(lbuf))
  		{
  			Page		lpage = (Page) BufferGetPage(lbuf);
  			BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
  
--- 357,377 
  
  	/* don't release the buffer yet; we touch right page's first item below */
  
! 	/* Now reconstruct left (original) sibling page */
! 	if (record-xl_info  XLR_BKP_BLOCK_1)
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		Buffer		lbuf = XLogReadBuffer(xlrec-node, xlrec-leftsib, false);
  
  		if (BufferIsValid(lbuf))
  		{
+ 			/*
+ 			 * Note that this code ensures that the items remaining on the
+ 			 * left page are in the correct item number order, but it does not
+ 			 * reproduce the physical order they would have had.  Is this
+ 			 * worth changing?  See also _bt_restore_page().
+ 			 */
  			Page		lpage = (Page) BufferGetPage(lbuf);
  			BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
  
*** btree_xlog_split(bool onleft, bool isroo
*** 432,439 
  	/* We no longer need the right buffer */
  	UnlockReleaseBuffer(rbuf);
  
! 	/* Fix left-link of the page to the right of the new right sibling */
! 	if (xlrec-rnext != P_NONE  !(record-xl_info  XLR_BKP_BLOCK_2))
  	{
  		Buffer		buffer = XLogReadBuffer(xlrec-node, xlrec-rnext, false);
  
--- 435,451 
  	/* We no longer need the right buffer */
  	UnlockReleaseBuffer(rbuf);
  
! 	/*
! 	 * Fix left-link of the page to the right of the new right sibling.
! 	 *
! 	 * Note: in normal operation, we do this while still holding lock on the
! 	 * two split pages.  However, that's not necessary for correctness in WAL
! 	 * replay, because no other index update can be in progress, and readers
! 	 * will cope properly when following an obsolete left-link.
! 	 */
! 	if (record-xl_info  

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Simon Riggs
On 10 November 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

 Here's a WIP patch that attacks it in this way.  I've only gone as far as
 fixing btree_xlog_split() for the moment, since the main point is to see
 whether the coding change seems reasonable.  At least in this function,
 it seems that locking considerations are handled correctly as long as no
 full-page image is used, so it's pretty straightforward to tweak it to
 handle the case with full-page image(s) as well.  I've not looked
 through any other replay functions yet --- some of them may need more
 invasive hacking.  But so far this looks pretty good.

Looks fine, but we need a top-level comment in btree code explaining
why/how it follows the very well explained rules in the README.

 Note that this patch isn't correct yet even for btree_xlog_split(),
 because I've not removed the RestoreBkpBlocks() call in btree_redo().
 All the btree replay routines will have to get fixed before it's
 testable at all.

I was just about to write you back you missed that...

 One thing that seems a bit annoying is the use of zero-based backup
 block indexes in RestoreBackupBlock, when most (not all) of the callers
 are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
 That's a bug waiting to happen.  We could address it by changing
 RestoreBackupBlock to accept a one-based argument, but what I'm thinking
 of doing instead is getting rid of the one-based convention entirely;
 that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
 all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
 advantage of doing that is that it'll force inspection of all code
 related to this.

I wouldn't do that in a back branch, but I can see why its a good idea.

Thanks for fixing.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 10 November 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote:
 Here's a WIP patch that attacks it in this way.

 Looks fine, but we need a top-level comment in btree code explaining
 why/how it follows the very well explained rules in the README.

Not sure what you'd want it to say exactly?  Really these issues are per
WAL-replay function, not global.  I've now been through all of the btree
functions, and AFAICS btree_xlog_split is the only one of them that
actually has a bug; the others can be a bit lax about the order in which
things get done.

 One thing that seems a bit annoying is the use of zero-based backup
 block indexes in RestoreBackupBlock, when most (not all) of the callers
 are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
 That's a bug waiting to happen.  We could address it by changing
 RestoreBackupBlock to accept a one-based argument, but what I'm thinking
 of doing instead is getting rid of the one-based convention entirely;
 that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
 all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
 advantage of doing that is that it'll force inspection of all code
 related to this.

 I wouldn't do that in a back branch, but I can see why its a good idea.

If any of this stuff were getting used by external modules, changing it
would be problematic ... but fortunately, it isn't, because we lack
support for plug-in rmgrs.  So I'm not worried about back-patching the
change, and would prefer to keep the 9.x branches in sync.

Attached is an updated patch in which nbtxlog.c has been fully converted,
but I've not touched other rmgrs yet.  I've tested this to the extent of
having a replication slave follow a master that's running the regression
tests, and it seems to work.  It's difficult to prove much by testing
about concurrent behavior, of course.

I found that there's another benefit of managing backup-block replay
this way, which is that we can de-contort the handling of conflict
resolution by moving it into the per-record-type subroutines, instead of
needing an extra switch before the RestoreBkpBlocks call.  So that gives
me more confidence that this is a good way to go.

regards, tom lane

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72ea1719e7d2c3ca230c8f755261b9a439ec2a99..8f534803a7cea75105869de6eaf2946e03d5437c 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*** btree_xlog_insert(bool isleaf, bool isme
*** 218,227 
  		datalen -= sizeof(xl_btree_metadata);
  	}
  
! 	if ((record-xl_info  XLR_BKP_BLOCK_1)  !ismeta  isleaf)
! 		return;	/* nothing to do */
! 
! 	if (!(record-xl_info  XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xlrec-target.node,
  			 ItemPointerGetBlockNumber((xlrec-target.tid)),
--- 218,226 
  		datalen -= sizeof(xl_btree_metadata);
  	}
  
! 	if (record-xl_info  XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xlrec-target.node,
  			 ItemPointerGetBlockNumber((xlrec-target.tid)),
*** btree_xlog_insert(bool isleaf, bool isme
*** 249,254 
--- 248,260 
  		}
  	}
  
+ 	/*
+ 	 * Note: in normal operation, we'd update the metapage while still holding
+ 	 * lock on the page we inserted into.  But during replay it's not
+ 	 * necessary to hold that lock, since no other index updates can be
+ 	 * happening concurrently, and readers will cope fine with following an
+ 	 * obsolete link from the metapage.
+ 	 */
  	if (ismeta)
  		_bt_restore_meta(xlrec-target.node, lsn,
  		 md.root, md.level,
*** btree_xlog_split(bool onleft, bool isroo
*** 290,296 
  		forget_matching_split(xlrec-node, downlink, false);
  
  		/* Extract left hikey and its size (still assuming 16-bit alignment) */
! 		if (!(record-xl_info  XLR_BKP_BLOCK_1))
  		{
  			/* We assume 16-bit alignment is enough for IndexTupleSize */
  			left_hikey = (Item) datapos;
--- 296,302 
  		forget_matching_split(xlrec-node, downlink, false);
  
  		/* Extract left hikey and its size (still assuming 16-bit alignment) */
! 		if (!(record-xl_info  XLR_BKP_BLOCK(0)))
  		{
  			/* We assume 16-bit alignment is enough for IndexTupleSize */
  			left_hikey = (Item) datapos;
*** btree_xlog_split(bool onleft, bool isroo
*** 310,316 
  		datalen -= sizeof(OffsetNumber);
  	}
  
! 	if (onleft  !(record-xl_info  XLR_BKP_BLOCK_1))
  	{
  		/*
  		 * We assume that 16-bit alignment is enough to apply IndexTupleSize
--- 316,322 
  		datalen -= sizeof(OffsetNumber);
  	}
  
! 	if (onleft  !(record-xl_info  XLR_BKP_BLOCK(0)))
  	{
  		/*
  		 * We assume that 16-bit alignment is enough to apply IndexTupleSize
*** btree_xlog_split(bool onleft, bool isroo
*** 323,329 
  		datalen -= newitemsz;
  	}
  
! 	/* 

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Tom Lane
While I'm looking at this: there seems to be a bug/inconsistency in
heap_xlog_freeze().  It uses a cleanup lock in the normal code path,
but it doesn't tell RestoreBkpBlocks to use a cleanup lock.  Which
choice is correct?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-10 Thread Simon Riggs
On 10 November 2012 21:24, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 10 November 2012 18:16, Tom Lane t...@sss.pgh.pa.us wrote:
 Here's a WIP patch that attacks it in this way.

 Looks fine, but we need a top-level comment in btree code explaining
 why/how it follows the very well explained rules in the README.

 Not sure what you'd want it to say exactly?  Really these issues are per
 WAL-replay function, not global.  I've now been through all of the btree
 functions, and AFAICS btree_xlog_split is the only one of them that
 actually has a bug; the others can be a bit lax about the order in which
 things get done.

 One thing that seems a bit annoying is the use of zero-based backup
 block indexes in RestoreBackupBlock, when most (not all) of the callers
 are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
 That's a bug waiting to happen.  We could address it by changing
 RestoreBackupBlock to accept a one-based argument, but what I'm thinking
 of doing instead is getting rid of the one-based convention entirely;
 that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
 all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
 advantage of doing that is that it'll force inspection of all code
 related to this.

 I wouldn't do that in a back branch, but I can see why its a good idea.

 If any of this stuff were getting used by external modules, changing it
 would be problematic ... but fortunately, it isn't, because we lack
 support for plug-in rmgrs.  So I'm not worried about back-patching the
 change, and would prefer to keep the 9.x branches in sync.

 Attached is an updated patch in which nbtxlog.c has been fully converted,
 but I've not touched other rmgrs yet.  I've tested this to the extent of
 having a replication slave follow a master that's running the regression
 tests, and it seems to work.  It's difficult to prove much by testing
 about concurrent behavior, of course.

 I found that there's another benefit of managing backup-block replay
 this way, which is that we can de-contort the handling of conflict
 resolution by moving it into the per-record-type subroutines, instead of
 needing an extra switch before the RestoreBkpBlocks call.  So that gives
 me more confidence that this is a good way to go.

It's a good way to go, I'm just glad you're handling it for the back
branches ;-)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Tom Lane
During normal running, operations such as btree page splits are
extremely careful about the order in which they acquire and release
buffer locks, if they're doing something that concurrently modifies
multiple pages.

During WAL replay, that all goes out the window.  Even if an individual
WAL-record replay function does things in the right order for standard
cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
of the referenced pages gets treated as a full-page image, we are left
with no guarantee whatsoever about what order the pages are restored in.
That never mattered when the code was originally designed, but it sure
matters during Hot Standby when other queries might be able to see the
intermediate states.

I can't prove that this is the cause of bug #7648, but it's fairly easy
to see that it could explain the symptom.  You only need to assume that
the page-being-split had been handled as a full-page image, and that the
new right-hand page had gotten allocated by extending the relation.
Then there will be an interval just after RestoreBkpBlocks does its
thing where the updated left-hand sibling is in the index and is not
locked in any way, but its right-link points off the end of the index.
If a few indexscans come along before the replay process gets to
continue, you'd get exactly the reported errors.

I'm inclined to think that we need to fix this by getting rid of
RestoreBkpBlocks per se, and instead having the per-WAL-record restore
routines dictate when each full-page image is restored (and whether or
not to release the buffer lock immediately).  That's not going to be a
small change unfortunately :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Andres Freund
On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
 I can't prove that this is the cause of bug #7648, but it's fairly easy
 to see that it could explain the symptom.  You only need to assume that
 the page-being-split had been handled as a full-page image, and that the
 new right-hand page had gotten allocated by extending the relation.
 Then there will be an interval just after RestoreBkpBlocks does its
 thing where the updated left-hand sibling is in the index and is not
 locked in any way, but its right-link points off the end of the index.
 If a few indexscans come along before the replay process gets to
 continue, you'd get exactly the reported errors.

Sounds plausible.

 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

I wonder if we couldn't instead fix it by ensuring the backup blocks
are in the right order in the backup blocks at the inserting
location. That would just need some care about the order of
XLogRecData blocks.
I am pretty unfamiliar with the nbtree locking but I seem to remember
that we should be fine if we always restore from left to right?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Daniel Farina
On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 During normal running, operations such as btree page splits are
 extremely careful about the order in which they acquire and release
 buffer locks, if they're doing something that concurrently modifies
 multiple pages.

 During WAL replay, that all goes out the window.  Even if an individual
 WAL-record replay function does things in the right order for standard
 cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
 of the referenced pages gets treated as a full-page image, we are left
 with no guarantee whatsoever about what order the pages are restored in.
 That never mattered when the code was originally designed, but it sure
 matters during Hot Standby when other queries might be able to see the
 intermediate states.

 I can't prove that this is the cause of bug #7648,

(I was the reporter of 7648)

To lend slightly more circumstantial evidence in support of this, I
also happened to note that the relfile in question was the last
segment and it was about a quarter full, so the access attempt was
definitely at the extreme outermost edge of the index most generally.

--
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

 I wonder if we couldn't instead fix it by ensuring the backup blocks
 are in the right order in the backup blocks at the inserting
 location. That would just need some care about the order of
 XLogRecData blocks.

I don't think that's a good way to go.  In the first place, if we did
that the fix would require incompatible changes in the contents of WAL
streams.  In the second place, there are already severe constraints on
the positioning of backup blocks to ensure that WAL records can be
uniquely decoded (the section of access/transam/README about WAL coding
touches on this) --- I don't think it's a good plan to add still more
constraints there.  And in the third place, the specific problem we're
positing here results from a failure to hold the buffer lock for a
full-page image until after we're done restoring a *non* full-page image
represented elsewhere in the same WAL record.  In general, of the set
of pages touched by a WAL record, any arbitrary subset of them might be
converted to FPIs during XLogInsert; but the replay-time locking
requirements are going to be the same regardless of that.  So AFAICS,
any design in which RestoreBkpBlocks acts independently of the
non-full-page-image updates is just broken.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers