Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-03-18 Thread Heikki Linnakangas

On 02/06/2014 06:42 AM, Peter Geoghegan wrote:

I'm not sure about this:


*** _bt_findinsertloc(Relation rel,
*** 675,680 
--- 701,707 
static void
_bt_insertonpg(Relation rel,
   Buffer buf,
+  Buffer cbuf,
   BTStack stack,
   IndexTuple itup,
   OffsetNumber newitemoff,


Wouldn't lcbuf be a better name for the new argument? After all, in
relevant contexts 'buf' is the parent of both of the pages post-split,
but there are two children in play here. You say:


+  *When inserting to a non-leaf page, 'cbuf' is the left-sibling 
of the
+  *page we're inserting the downlink for.  This function will 
clear the
+  *INCOMPLETE_SPLIT flag on it, and release the buffer.


I suggest also noting in comments at this point that cbuf/the
left-sibling is the old half from the split.

Even having vented about cbuf's name, I can kind of see why you didn't
call it lcbuf: we already have an BTPageOpaque lpageop variable for
the special 'buf' metadata within the _bt_insertonpg() function; so
there might be an ambiguity between the possibly-soon-to-be-left page
(the target page) and the *child* left page that needs to have its
flag cleared. Although I still think you should change the name of
lpageop (further details follow).

Consider this:


/* release buffers */
+   if (!P_ISLEAF(lpageop))
+   _bt_relbuf(rel, cbuf);


(Rebased, so this looks a bit different to your original version IIRC).

This is checking that the _bt_insertonpg() target page (whose special
data lpageop points to) is not a leaf page, and releasing the *child*
left page if it isn't. This is pretty confusing. So I suggest naming
lpageop tpageop ('t' for target, a terminology already used in the
comments above the function).


I don't know, the buf/page/lpageop variable names are used in many 
functions in nbtinsert.c. Perhaps they should all be renamed, but I 
think it's fine as it is, and not this patch's responsibility anyway. 
(The lpageop name makes sense when the page has to be split, and the 
page becomes the left page. Otherwise, admittedly, not so much)



Also, I suggest you change this existing code comment:

  * On entry, we must have the right buffer in which to do the
  * insertion, and the buffer must be pinned and write-locked.  
On return,
  * we will have dropped both the pin and the lock on the buffer.

Change right to correct here. Minor point, but right is a loaded word.


Fixed.


But why are you doing new things while checking P_ISLEAF(lpageop) in
_bt_insertonpg() to begin with? Would you not be better off doing
things when there is a child buffer passed (e.g. if
(BufferIsValid(cbuf))...), and only then asserting
!P_ISLEAF(lpageop)? That seems to me to more naturally establish the
context of _bt_insertonpg() is called to insert downlink after
split. Although maybe that conflates things within XLOG stuff. Hmm.


Changed that way for the places where we deal with the incomplete-split 
flag.


I committed the patch now. Thanks for the review!

Before committing, I spotted a bug in the way the new page-deletion code 
interacts with this: there was a check that the page that's about to be 
deleted was not marked with the INCOMPLETE_SPLIT flag, it was possible 
that the page was the right half of an incomplete split, and so didn't 
have a downlink. Vacuuming that failed with an failed to re-find parent 
key error. I fixed that by checking that the left sibling is also not 
marked with the flag.


It would be fairly easy to delete a page that is the right half of an 
incomplete split. Such a page doesn't have a downlink pointing to it, so 
just skip removing it, and instead clear the INCOMPLETE_SPLIT flag in 
the left sibling. But deleting incompletely split pages isn't important 
from a disk-space point of view, they should be extremely rare, so I 
decided to just punt.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-03-18 Thread Heikki Linnakangas

On 02/06/2014 01:54 AM, Peter Geoghegan wrote:

On Thu, Jan 23, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote:

So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.


What are your thoughts on this, as someone that has a broader
perspective here? Are you inclined to keep the startup and cleanup
callbacks in anticipation of a day when that degree of generality is
useful? That would be pretty well-precedented of course, but I would
like to hear your opinion.


So, I just removed the rm_safe_restartpoint callback, as that's clearly 
dead now and I would complain loudly if someone tried to add a resource 
manager that would need it again.


Yeah, it's a bit silly that each resource manager has to do that on 
their own. It would be useful to have a memory context that was 
automatically reset between each WAL record. In fact that should 
probably be the default memory context you run the WAL redo routines in.


But even if we do that, I'm not in a hurry to remove rm_startup/cleanup. 
They seem generally useful, even if they're not actually used for much 
anymore.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-03-18 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 Yeah, it's a bit silly that each resource manager has to do that on 
 their own. It would be useful to have a memory context that was 
 automatically reset between each WAL record. In fact that should 
 probably be the default memory context you run the WAL redo routines in.

+1

 But even if we do that, I'm not in a hurry to remove rm_startup/cleanup. 
 They seem generally useful, even if they're not actually used for much 
 anymore.

Agreed; they aren't hurting us and they could be useful.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-03-14 Thread Peter Geoghegan
Ping?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-03-14 Thread Heikki Linnakangas

On 03/14/2014 01:03 PM, Peter Geoghegan wrote:

Ping?


I committed the other patch this depends on now. I'll take another stab 
at this one next.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Since, as I mentioned, _bt_finish_split() ultimately unlocks *and
 unpins*, it may not be the same buffer as before, so even with the
 refactoring there are race conditions.

 Care to elaborate? Or are you just referring to the missing buf =  ?

Yes, that's all I meant.

 Attached is a new version of the patch, with those issues fixed.
 btree-incomplete-split-4.patch is a complete patch against the latest
 fix-btree-page-deletion patch, and moveright-assign-fix.patch is just the
 changes to _bt_moveright, if you want to review just the changes since the
 previous patch I posted.

Cool. I'll take a good look at it tomorrow morning PST.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
On Thu, Jan 23, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote:
 So while post-recovery callbacks no longer exist for any
 rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
 concern the simple management of memory of AM-specific recovery
 contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
 a better abstraction than that, such as a generic recovery memory
 context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
 just calls those callbacks for each resource at exactly the same time
 anyway, just as it initializes resource managers in precisely the same
 manner earlier on. Plus if you look at what those AM-local memory
 management routines do, it all seems very simple.

What are your thoughts on this, as someone that has a broader
perspective here? Are you inclined to keep the startup and cleanup
callbacks in anticipation of a day when that degree of generality is
useful? That would be pretty well-precedented of course, but I would
like to hear your opinion.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
On Tue, Feb 4, 2014 at 11:56 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I also changed _bt_moveright to never return a write-locked buffer, when the
 caller asked for a read-lock (an issue you pointed out earlier in this
 thread).

I think that _bt_moveright() looks good now.

There is now bitrot, caused by commit ac8bc3. I rebased myself, but
that was non-trivial, surprisingly; I had to tweak by hand.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-05 Thread Peter Geoghegan
Some more thoughts:

Please add comments above _bt_mark_page_halfdead(), a new routine from
the dependency patch. I realize that this is substantially similar to
part of how _bt_pagedel() used to work, but it's still incongruous.

 ! Our approach is to create any missing downlinks on-they-fly, when
 ! searching the tree for a new insertion. It could be done during searches,
 ! too, but it seems best not to put any extra updates in what would otherwise

s/on-they-fly/on-the-fly/

I'm not sure about this:

 *** _bt_findinsertloc(Relation rel,
 *** 675,680 
 --- 701,707 
 static void
 _bt_insertonpg(Relation rel,
  Buffer buf,
 +Buffer cbuf,
  BTStack stack,
  IndexTuple itup,
  OffsetNumber newitemoff,

Wouldn't lcbuf be a better name for the new argument? After all, in
relevant contexts 'buf' is the parent of both of the pages post-split,
but there are two children in play here. You say:

+  *   When inserting to a non-leaf page, 'cbuf' is the left-sibling 
of the
+  *   page we're inserting the downlink for.  This function will 
clear the
+  *   INCOMPLETE_SPLIT flag on it, and release the buffer.

I suggest also noting in comments at this point that cbuf/the
left-sibling is the old half from the split.

Even having vented about cbuf's name, I can kind of see why you didn't
call it lcbuf: we already have an BTPageOpaque lpageop variable for
the special 'buf' metadata within the _bt_insertonpg() function; so
there might be an ambiguity between the possibly-soon-to-be-left page
(the target page) and the *child* left page that needs to have its
flag cleared. Although I still think you should change the name of
lpageop (further details follow).

Consider this:

   /* release buffers */
+  if (!P_ISLEAF(lpageop))
+  _bt_relbuf(rel, cbuf);

(Rebased, so this looks a bit different to your original version IIRC).

This is checking that the _bt_insertonpg() target page (whose special
data lpageop points to) is not a leaf page, and releasing the *child*
left page if it isn't. This is pretty confusing. So I suggest naming
lpageop tpageop ('t' for target, a terminology already used in the
comments above the function).

Also, I suggest you change this existing code comment:

 *  On entry, we must have the right buffer in which to do the
 *  insertion, and the buffer must be pinned and write-locked.  
On return,
 *  we will have dropped both the pin and the lock on the buffer.

Change right to correct here. Minor point, but right is a loaded word.

But why are you doing new things while checking P_ISLEAF(lpageop) in
_bt_insertonpg() to begin with? Would you not be better off doing
things when there is a child buffer passed (e.g. if
(BufferIsValid(cbuf))...), and only then asserting
!P_ISLEAF(lpageop)? That seems to me to more naturally establish the
context of _bt_insertonpg() is called to insert downlink after
split. Although maybe that conflates things within XLOG stuff. Hmm.

Perhaps some of these observations could equally well apply to
_bt_split(), which is similarly charged with releasing a left child
page on inserting a downlink. Particularly around reconsidering the
left vs left child vs target terminology.

Consider what happens when _bt_split() is passed a (left) child buffer
(a 'cbuf' argument). We are:

1. Splitting a page (first phase, which may only be finished lazily
some time later).

2. Iff this is a non-leaf page split, simultaneously unmark the flag
from some *other* split which we have a child buffer to unmark. Needs
to be part of same critical section. Unlock + release cbuf, again only
iff target/right split page contains a leaf page.

So, at the risk of belaboring the point:

* Some callers to _bt_split() and _bt_insertonpg() pass a 'cbuf' that
is not invalid.

* If either of those 2 functions don't unlock + release those buffers,
no one ever will.

* Therefore, at the very least we should unlock + release those
buffers **just because they're not invalid**. That's a sufficient
condition to do so, and attaching that to level is unnecessarily
unclear. As I said, assert non-leafness.

Incidentally, I think that in general we could be clearer on the fact
that a root page may also be a leaf page.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-04 Thread Heikki Linnakangas

On 02/04/2014 02:40 AM, Peter Geoghegan wrote:

On Fri, Jan 31, 2014 at 9:09 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

I refactored the loop in _bt_moveright to, well, not have that bug anymore.
The 'page' and 'opaque' pointers are now fetched at the beginning of the
loop. Did I miss something?


I think so, yes. You still aren't assigning the value returned by
_bt_getbuf() to 'buf'.


D'oh, you're right.


Since, as I mentioned, _bt_finish_split() ultimately unlocks *and
unpins*, it may not be the same buffer as before, so even with the
refactoring there are race conditions.


Care to elaborate? Or are you just referring to the missing buf =  ?


A closely related issue is that you haven't mentioned anything about
buffer pins/refcount side effects in comments above
_bt_finish_split(), even though I believe you should.


Ok.


A minor stylistic concern is that I think it would be better to only
have one pair of _bt_finish_split()/_bt_getbuf() calls regardless of
the initial value of 'access'.


Ok.

I also changed _bt_moveright to never return a write-locked buffer, when 
the caller asked for a read-lock (an issue you pointed out earlier in 
this thread).


Attached is a new version of the patch, with those issues fixed. 
btree-incomplete-split-4.patch is a complete patch against the latest 
fix-btree-page-deletion patch, and moveright-assign-fix.patch is just 
the changes to _bt_moveright, if you want to review just the changes 
since the previous patch I posted.


- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 03efc29..43ee75f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -404,12 +404,34 @@ an additional insertion above that, etc).
 For a root split, the followon WAL entry is a new root entry rather than
 an insertion entry, but details are otherwise much the same.
 
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course).  This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-they-fly, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra updates in what would otherwise
+be a read-only operation (updating is not possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in the
+parent is finished and the flag in the child cleared, but can be released
+immediately after that, before recursing up the tree, if the parent also
+needs to be split. This ensures that incompletely split pages should not be
+seen under normal circumstances; only when insertion to the parent fails
+for some reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
 
 When splitting a non-root page that is alone on its level, the required
 metapage update (of the fast root link) is performed and logged as part
@@ -422,6 +444,14 @@ page is a second record.  If vacuum is interrupted for some reason, or the
 system crashes, the tree is consistent for searches and insertions.  The next
 VACUUM will find the half-dead leaf page and continue the deletion.
 
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next  insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to 

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-02-03 Thread Peter Geoghegan
On Fri, Jan 31, 2014 at 9:09 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I refactored the loop in _bt_moveright to, well, not have that bug anymore.
 The 'page' and 'opaque' pointers are now fetched at the beginning of the
 loop. Did I miss something?

I think so, yes. You still aren't assigning the value returned by
_bt_getbuf() to 'buf'. Since, as I mentioned, _bt_finish_split()
ultimately unlocks *and unpins*, it may not be the same buffer as
before, so even with the refactoring there are race conditions. A
closely related issue is that you haven't mentioned anything about
buffer pins/refcount side effects in comments above
_bt_finish_split(), even though I believe you should.

A minor stylistic concern is that I think it would be better to only
have one pair of _bt_finish_split()/_bt_getbuf() calls regardless of
the initial value of 'access'.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-01-31 Thread Heikki Linnakangas

On 01/30/2014 12:46 AM, Peter Geoghegan wrote:

On Mon, Jan 27, 2014 at 10:54 AM, Peter Geoghegan p...@heroku.com wrote:

On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that opaque (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto page. So opaque may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the opaque pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets stuck on). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.


Yep, fixed.


Can you explain what the fix was, please?


Ping? I would like to hear some details here.


I refactored the loop in _bt_moveright to, well, not have that bug 
anymore. The 'page' and 'opaque' pointers are now fetched at the 
beginning of the loop. Did I miss something?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-01-29 Thread Peter Geoghegan
On Mon, Jan 27, 2014 at 10:54 AM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 I think I see some bugs in _bt_moveright(). If you examine
 _bt_finish_split() in detail, you'll see that it doesn't just drop the
 write buffer lock that the caller will have provided (per its
 comments) - it also drops the buffer pin. It calls _bt_insert_parent()
 last, which was previously only called last thing by _bt_insertonpg()
 (some of the time), and _bt_insertonpg() is indeed documented to drop
 both the lock and the pin. And if you look at _bt_moveright() again,
 you'll see that there is a tacit assumption that the buffer pin isn't
 dropped, or at least that opaque (the BTPageOpaque BT special page
 area pointer) continues to point to the same page held in the same
 buffer, even though the code doesn't set the opaque' pointer again
 and it may not point to a pinned buffer or even the appropriate
 buffer. Ditto page. So opaque may point to the wrong thing on
 subsequent iterations - you don't do anything with the return value of
 _bt_getbuf(), unlike all of the existing call sites. I believe you
 should store its return value, and get the held page and the special
 pointer on the page, and assign that to the opaque pointer before
 the next iteration (an iteration in which you very probably really do
 make progress not limited to completing a page split, the occurrence
 of which the _bt_moveright() loop gets stuck on). You know, do what
 is done in the non-split-handling case. It may not always be the same
 buffer as before, obviously.


 Yep, fixed.

 Can you explain what the fix was, please?

Ping? I would like to hear some details here.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-01-27 Thread Heikki Linnakangas

On 01/23/2014 11:36 PM, Peter Geoghegan wrote:

The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):


The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.


I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.


Yes, that's correct, these b-tree patches are part of a grand plan to 
eliminate post-recovery cleanup actions altogether.



So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.

I think you should bump XLOG_PAGE_MAGIC.

Perhaps you should increase the elevel here:

+   elog(DEBUG1, finishing incomplete split of %u/%u,
+BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));

Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.


Hmm. I'm not sure I agree with that line of thinking in general - what 
is an admin supposed to do about the message? But there's a precedent in 
vacuumlazy.c, which logs a message when it finds all-zero pages in the 
heap. So I guess that should be a LOG.



Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.


Yeah, reverted.


I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that opaque (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto page. So opaque may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the opaque pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets stuck on). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.


Yep, fixed.


Do you suppose that there are similar problems in other direct callers
of _bt_finish_split()? I'm thinking in particular of
_bt_findinsertloc().


Hmm, no, the other callers look OK to me.


I'm also not sure about the lock escalation that may occur within
_bt_moveright() for callers with BT_READ access - there is nothing to
prevent a caller from ending up being left with a write lock where
before they only had a read lock if they find that
!P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
conclude that there is no split finishing to be done after all. It
looks like all callers currently won't care about that, but it seems
like that 

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-01-27 Thread Peter Geoghegan
On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I think I see some bugs in _bt_moveright(). If you examine
 _bt_finish_split() in detail, you'll see that it doesn't just drop the
 write buffer lock that the caller will have provided (per its
 comments) - it also drops the buffer pin. It calls _bt_insert_parent()
 last, which was previously only called last thing by _bt_insertonpg()
 (some of the time), and _bt_insertonpg() is indeed documented to drop
 both the lock and the pin. And if you look at _bt_moveright() again,
 you'll see that there is a tacit assumption that the buffer pin isn't
 dropped, or at least that opaque (the BTPageOpaque BT special page
 area pointer) continues to point to the same page held in the same
 buffer, even though the code doesn't set the opaque' pointer again
 and it may not point to a pinned buffer or even the appropriate
 buffer. Ditto page. So opaque may point to the wrong thing on
 subsequent iterations - you don't do anything with the return value of
 _bt_getbuf(), unlike all of the existing call sites. I believe you
 should store its return value, and get the held page and the special
 pointer on the page, and assign that to the opaque pointer before
 the next iteration (an iteration in which you very probably really do
 make progress not limited to completing a page split, the occurrence
 of which the _bt_moveright() loop gets stuck on). You know, do what
 is done in the non-split-handling case. It may not always be the same
 buffer as before, obviously.


 Yep, fixed.

Can you explain what the fix was, please?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-01-27 Thread Peter Geoghegan
On Mon, Jan 27, 2014 at 10:58 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Okay, promise not to laugh. I did write a bunch of hacks, to generate
 graphviz .dot files from the btree pages, and render them into pictures. It
 consist of multiple parts, all in the attached tarball.

It's funny that you should say that, because I was thinking about
building something similar over the past few days. I find it very
useful to build ad-hoc tools like that, and I thought that it would be
particularly useful to have something visual for testing btree code.
Sure, you can come up with a test and keep the details straight in
your head while coordinating everything, but that won't scale as new
testing scenarios inevitably occur to you. I've done plenty of things
with contrib/pageinspect along these lines in the past, but this is
better. Anything that reduces the friction involved in building an
accurate mental model of things is a very good thing.



-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2014-01-23 Thread Peter Geoghegan
On Thu, Nov 14, 2013 at 9:23 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Ok, here's a new version of the patch to handle incomplete B-tree splits.

I finally got around to taking a look at this. Unlike with the as-yet
uncommitted Race condition in b-tree page deletion patch that Kevin
looked at, which there is a dependency on here, I could not really
consider your patch in isolation. I'm really reviewing both patches,
but with a particular focus on this recent set of additions
(btree-incomplete-splits-2.patch), and the issues addressed by it.
However, since the distinction between this patch and the patch that
it has a dependency on is fuzzy, expect me to be fuzzy in
differentiating the two. This patch is only very loosely an atomic
unit. This is not a criticism - I understand that it just turned out
that way.

The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):


The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.


I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.

So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.

I think you should bump XLOG_PAGE_MAGIC.

Perhaps you should increase the elevel here:

+   elog(DEBUG1, finishing incomplete split of %u/%u,
+BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));

Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.

Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.

I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that opaque (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto page. So opaque may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the opaque pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets stuck on). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.

For a moment I thought that you might have accounted for that here:

*** _bt_insert_parent(Relation rel,
*** 1685,1696 
* 05/27/97
*/
   ItemPointerSet((stack-bts_btentry.t_tid), bknum, P_HIKEY);
-
   pbuf = _bt_getstackbuf(rel, 

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-25 Thread Heikki Linnakangas

On 22.10.2013 19:55, Heikki Linnakangas wrote:

I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate the downlink hasn't been inserted
yet. When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.


This is what I came up with.

One thing I'm not totally happy about is the way page deletions of 
incompletely split pages are handled. Basically, it just bails out and 
refuses to delete a page that is part of an incomplete split. That's 
probably OK in practice, as incomplete splits should be very rare 
anyway, but it's a bit dissatisfying to not handle the case because at 
first glance it seems like it should be even simpler than usual to 
delete a page that has no downlink. Nevertheless, I decided to just skip 
that for now.


After this patch, deleting the only child of a parent and the parent 
itself is still a multi-WAL-record operation that needs to be tracked 
during recovery, and completed at the end of recovery. I'd like to 
eliminate that too, but that's another patch.


- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..29d8bd1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -384,12 +384,41 @@ an additional insertion above that, etc).
 For a root split, the followon WAL entry is a new root entry rather than
 an insertion entry, but details are otherwise much the same.
 
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course).  This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because insertion involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks along the way, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra writes in what would otherwise
+be a read-only operation (it would not be possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is atomically cleared.
+The child page is kept locked until the insertion in the parent is finished
+and the flag in the child cleared, but can be released immediately after
+that, before recursing up the tree, if the parent also needs to be split.
+That ensures that incompletely split pages should not be seen under normal
+circumstances; only when a transaction fails to insert the parent for some
+reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+We used to keep track of incomplete splits during recovery and finish them
+immediately at end of recovery, instead of doing it lazily at the next
+insertion. However, that made the recovery much more complicated, and only
+fixed the problem when crash recovery was performed. An incomplete split can
+also occur if an otherwise recoverable error, like out-of-memory or
+out-of-disk-space, happens while inserting the downlink to the parent.
 
 When splitting a non-root page that is alone on its level, the required
 metapage update (of the fast root link) is performed and logged as part
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a452fea..a810015 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -59,14 +59,16 @@ static void _bt_findinsertloc(Relation rel,
   ScanKey scankey,
   IndexTuple newtup,
   Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer 

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Peter Geoghegan
On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I propose that we use the same approach I used with GiST, and add a flag to
 the page header to indicate the downlink hasn't been inserted yet. When
 insertion (or vacuum) bumps into a flagged page, it can finish the
 incomplete action by inserting the downlink.

Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 20:27, Peter Geoghegan wrote:

On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

I propose that we use the same approach I used with GiST, and add a flag to
the page header to indicate the downlink hasn't been inserted yet. When
insertion (or vacuum) bumps into a flagged page, it can finish the
incomplete action by inserting the downlink.


Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.


I may be missing something, but there are already plenty of b-tree 
specific flags. See BTP_* in nbtree.h. I'll just add another to that list.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Peter Geoghegan
On Tue, Oct 22, 2013 at 10:30 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I may be missing something, but there are already plenty of b-tree specific
 flags. See BTP_* in nbtree.h. I'll just add another to that list.

Based on your remarks, I thought that you were intent on directly
using page level bits (pd_flags), rather than the private state flag
bits. Blame it on the lack of coffee, I suppose.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
 Splitting a B-tree page is a two-stage process: First, the page is split,
 and then a downlink for the new right page is inserted into the parent
 (which might recurse to split the parent page, too). What happens if
 inserting the downlink fails for some reason? I tried that out, and it turns
 out that it's not nice.
 
 I used this to cause a failure:
 
 --- a/src/backend/access/nbtree/nbtinsert.c
 +++ b/src/backend/access/nbtree/nbtinsert.c
 @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
  _bt_relbuf(rel, pbuf);
  }
 
 +elog(ERROR, fail!);
 +
  /* get high key from left page == lowest key on new right page 
  */
  ritem = (IndexTuple) PageGetItem(page,
  
   PageGetItemId(page, P_HIKEY));
 
 postgres=# create table foo (i int4 primary key);
 CREATE TABLE
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  fail!
 
 That's not surprising. But when I removed that elog again and restarted the
 server, I still can't insert. The index is permanently broken:
 
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5
 
 In real life, you would get a failure like this e.g if you run out of memory
 or disk space while inserting the downlink to the parent. Although rare in
 practice, it's no fun if it happens.

Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 21:25, Andres Freund wrote:

On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:

Splitting a B-tree page is a two-stage process: First, the page is split,
and then a downlink for the new right page is inserted into the parent
(which might recurse to split the parent page, too). What happens if
inserting the downlink fails for some reason? I tried that out, and it turns
out that it's not nice.

I used this to cause a failure:


--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
_bt_relbuf(rel, pbuf);
}

+   elog(ERROR, fail!);
+
/* get high key from left page == lowest key on new right page 
*/
ritem = (IndexTuple) PageGetItem(page,

 PageGetItemId(page, P_HIKEY));


postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 1);
ERROR:  fail!

That's not surprising. But when I removed that elog again and restarted the
server, I still can't insert. The index is permanently broken:

postgres=# insert into foo select generate_series(1, 1);
ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5

In real life, you would get a failure like this e.g if you run out of memory
or disk space while inserting the downlink to the parent. Although rare in
practice, it's no fun if it happens.


Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?


Because there's no recovery involved. The failure I injected (or an 
out-of-memory or out-of-disk-space in the real world) doesn't cause a 
PANIC, just an ERROR that rolls back the current transaction, nothing more.


We could put a critical section around the whole recursion that inserts 
the downlinks, so that you would get a PANIC and the incomplete split 
mechanism would fix it at recovery. But that would hardly be an improvement.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 21:25, Andres Freund wrote:
 On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
 Splitting a B-tree page is a two-stage process: First, the page is split,
 and then a downlink for the new right page is inserted into the parent
 (which might recurse to split the parent page, too). What happens if
 inserting the downlink fails for some reason? I tried that out, and it turns
 out that it's not nice.
 
 I used this to cause a failure:
 
 --- a/src/backend/access/nbtree/nbtinsert.c
 +++ b/src/backend/access/nbtree/nbtinsert.c
 @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
_bt_relbuf(rel, pbuf);
}
 
 +  elog(ERROR, fail!);
 +
/* get high key from left page == lowest key on new right page 
  */
ritem = (IndexTuple) PageGetItem(page,

   PageGetItemId(page, P_HIKEY));
 
 postgres=# create table foo (i int4 primary key);
 CREATE TABLE
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  fail!
 
 That's not surprising. But when I removed that elog again and restarted the
 server, I still can't insert. The index is permanently broken:
 
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5
 
 In real life, you would get a failure like this e.g if you run out of memory
 or disk space while inserting the downlink to the parent. Although rare in
 practice, it's no fun if it happens.
 
 Why doesn't the incomplete split mechanism prevent this? Because we do
 not delay checkpoints on the primary and a checkpoint happened just
 befor your elog(ERROR) above?
 
 Because there's no recovery involved. The failure I injected (or an
 out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC,
 just an ERROR that rolls back the current transaction, nothing more.
 
 We could put a critical section around the whole recursion that inserts the
 downlinks, so that you would get a PANIC and the incomplete split mechanism
 would fix it at recovery. But that would hardly be an improvement.

You were talking about restarting the server, that's why I assumed
recovery had been involved... But you just were talking about removing
the elog() again.

For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.

Do you forsee the fix with ignoring missing downlinks to be
back-patchable? FWIW, I think I might have seen real-world cases of
this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
 We could put a critical section around the whole recursion that inserts the
 downlinks, so that you would get a PANIC and the incomplete split mechanism
 would fix it at recovery. But that would hardly be an improvement.

 For me this clearly *has* to be in a critical section with the current
 code. I had always assumed all multi-part actions would be.

No, that's hardly a good idea.  As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.

Note also that the problem might be persistent, eg if you're out of disk
space.  In that case, you'd just get the PANIC again at recovery, so now
not only have you crashed all your sessions but you have a database that
won't start up.

I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 22:24, Tom Lane wrote:

I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.


Yes. I intend to do that, too.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
  We could put a critical section around the whole recursion that inserts the
  downlinks, so that you would get a PANIC and the incomplete split mechanism
  would fix it at recovery. But that would hardly be an improvement.
 
  For me this clearly *has* to be in a critical section with the current
  code. I had always assumed all multi-part actions would be.
 
 No, that's hardly a good idea.  As Heikki says, that would amount to
 converting an entirely foreseeable situation into a PANIC.

But IIUC this can currently lead to an index giving wrong answers, not
just fail at further inserts? I think if we half-split (without
inserting the downlink) a page several times, via different child-pages,
we might suceed in splitting but we'll have corrupted left/right
links. With complete splits such things should be prevented by the
page-level locks we hold, but that's not the case anymore.
If so that could, especially in combination with unique indexes, lead to
quite nasty data corruption

 I wonder whether Heikki's approach could be used to remove the need for
 the incomplete-split-fixup code altogether, thus eliminating a class of
 recovery failure possibilities.

That'd be better...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
 No, that's hardly a good idea.  As Heikki says, that would amount to
 converting an entirely foreseeable situation into a PANIC.

 But IIUC this can currently lead to an index giving wrong answers, not
 just fail at further inserts?

No.  It's exactly the same situation as when the insert is still in
progress, ie searches have to move right when following the original
downlink.  Go read the nbtree README.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 16:38:05 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
  No, that's hardly a good idea.  As Heikki says, that would amount to
  converting an entirely foreseeable situation into a PANIC.
 
  But IIUC this can currently lead to an index giving wrong answers, not
  just fail at further inserts?
 
 No.  It's exactly the same situation as when the insert is still in
 progress, ie searches have to move right when following the original
 downlink.  Go read the nbtree README.

If the parent insert is still in progress we have a write lock on the
left and right page preventing such issues, right? At least that's what
the nbtree README says...
That doesn't hold true if we split a page but fail before re-inserting
the parent since the transaction rollback will obviously have released
the locks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers